Preface Who this book is for Conventions Reader feedback Customer support Downloading the example code Downloading the color images of this book Errata Piracy Questions Getting Started Installing Enthought Canopy Giving the installation a test run If you occasionally get problems opening your IPNYB files Using and understanding IPython (Jupyter) Notebooks Python basics - Part 1 Understanding Python code Importing modules Data structures Experimenting with lists Pre colon Post colon Negative syntax Adding list to list The append function Complex data structures Dereferencing a single element The sort function Reverse sort Tuples Dereferencing an element List of tuples Dictionaries Iterating through entries Python basics - Part 2 Functions in Python Lambda functions - functional programming Understanding boolean expressions The if statement The if-else loop Looping The while loop Exploring activity Running Python scripts More options than just the IPython/Jupyter Notebook Running Python scripts in command prompt Using the Canopy IDE Summary Statistics and Probability Refresher, and Python Practice Types of data Numerical data Discrete data Continuous data Categorical data Ordinal data Mean, median, and mode Mean Median The factor of outliers Mode Using mean, median, and mode in Python Calculating mean using the NumPy package Visualizing data using matplotlib Calculating median using the NumPy package Analyzing the effect of outliers Calculating mode using the SciPy package Some exercises Standard deviation and variance Variance Measuring variance Standard deviation Identifying outliers with standard deviation Population variance versus sample variance The Mathematical explanation Analyzing standard deviation and variance on a histogram Using Python to compute standard deviation and variance Try it yourself Probability density function and probability mass function The probability density function and probability mass functions Probability density functions Probability mass functions Types of data distributions Uniform distribution Normal or Gaussian distribution The exponential probability distribution or Power law Binomial probability mass function Poisson probability mass function Percentiles and moments Percentiles Quartiles Computing percentiles in Python Moments Computing moments in Python Summary Matplotlib and Advanced Probability Concepts A crash course in Matplotlib Generating multiple plots on one graph Saving graphs as images Adjusting the axes Adding a grid Changing line types and colors Labeling axes and adding a legend A fun example Generating pie charts Generating bar charts Generating scatter plots Generating histograms Generating box-and-whisker plots Try it yourself Covariance and correlation Defining the concepts Measuring covariance Correlation Computing covariance and correlation in Python Computing correlation – The hard way Computing correlation – The NumPy way Correlation activity Conditional probability Conditional probability exercises in Python Conditional probability assignment My assignment solution Bayes' theorem Summary Predictive Models Linear regression The ordinary least squares technique The gradient descent technique The co-efficient of determination or r-squared Computing r-squared Interpreting r-squared Computing linear regression and r-squared using Python Activity for linear regression Polynomial regression Implementing polynomial regression using NumPy Computing the r-squared error Activity for polynomial regression Multivariate regression and predicting car prices Multivariate regression using Python Activity for multivariate regression Multi-level models Summary Machine Learning with Python Machine learning and train/test Unsupervised learning Supervised learning Evaluating supervised learning K-fold cross validation Using train/test to prevent overfitting of a polynomial regression Activity Bayesian methods - Concepts Implementing a spam classifier with Naïve Bayes Activity K-Means clustering Limitations to k-means clustering Clustering people based on income and age Activity Measuring entropy Decision trees - Concepts Decision tree example Walking through a decision tree Random forests technique Decision trees - Predicting hiring decisions using Python Ensemble learning – Using a random forest Activity Ensemble learning Support vector machine overview Using SVM to cluster people by using scikit-learn Activity Summary Recommender Systems What are recommender systems? User-based collaborative filtering Limitations of user-based collaborative filtering Item-based collaborative filtering Understanding item-based collaborative filtering How item-based collaborative filtering works? Collaborative filtering using Python Finding movie similarities Understanding the code The corrwith function Improving the results of movie similarities Making movie recommendations to people Understanding movie recommendations with an example Using the groupby command to combine rows Removing entries with the drop command Improving the recommendation results Summary More Data Mining and Machine Learning Techniques K-nearest neighbors - concepts Using KNN to predict a rating for a movie Activity Dimensionality reduction and principal component analysis Dimensionality reduction Principal component analysis A PCA example with the Iris dataset Activity Data warehousing overview ETL versus ELT Reinforcement learning Q-learning The exploration problem The simple approach The better way Fancy words Markov decision process Dynamic programming Summary Dealing with Real-World Data Bias/variance trade-off K-fold cross-validation to avoid overfitting Example of k-fold cross-validation using scikit-learn Data cleaning and normalisation Cleaning web log data Applying a regular expression on the web log Modification one - filtering the request field Modification two - filtering post requests Modification three - checking the user agents Filtering the activity of spiders/robots Modification four - applying website-specific filters Activity for web log data Normalizing numerical data Detecting outliers Dealing with outliers Activity for outliers Summary Apache Spark - Machine Learning on Big Data Installing Spark Installing Spark on Windows Installing Spark on other operating systems Installing the Java Development Kit Installing Spark Spark introduction It's scalable It's fast It's young It's not difficult Components of Spark Python versus Scala for Spark Spark and Resilient Distributed Datasets (RDD) The SparkContext object Creating RDDs Creating an RDD using a Python list Loading an RDD from a text file More ways to create RDDs RDD operations Transformations Using map() Actions Introducing MLlib Some MLlib Capabilities Special MLlib data types The vector data type LabeledPoint data type Rating data type Decision Trees in Spark with MLlib Exploring decision trees code Creating the SparkContext Importing and cleaning our data Creating a test candidate and building our decision tree Running the script K-Means Clustering in Spark Within set sum of squared errors (WSSSE) Running the code TF-IDF TF-IDF in practice Using TF- IDF Searching wikipedia with Spark MLlib Import statements Creating the initial RDD Creating and transforming a HashingTF object Computing the TF-IDF score Using the Wikipedia search engine algorithm Running the algorithm Using the Spark 2.0 DataFrame API for MLlib How Spark 2.0 MLlib works Implementing linear regression Summary Testing and Experimental Design A/B testing concepts A/B tests Measuring conversion for A/B testing How to attribute conversions Variance is your enemy T-test and p-value The t-statistic or t-test The p-value Measuring t-statistics and p-values using Python Running A/B test on some experimental data When there's no real difference between the two groups Does the sample size make a difference? Sample size increased to six-digits Sample size increased seven-digits A/A testing Determining how long to run an experiment for A/B test gotchas Novelty effects Seasonal effects Selection bias Auditing selection bias issues Data pollution Attribution errors Summary