Chapter 13. Capstone: Python for Data Analytics

At the end of Chapter 8 you extended what you learned about R to explore and test relationships in the mpg dataset. We’ll do the same in this chapter, using Python. We’ve conducted the same work in Excel and R, so I’ll focus less on the whys of our analysis in favor of the hows of doing it in Python.

To get started, let’s call in all the necessary modules. Some of these are new: from scipy, we’ll import the stats submodule. To do this, we’ll use the from keyword to tell Python what module to look for, then the usual import keyword to choose a sub-module. As the name suggests, we’ll use the stats submodule of scipy to conduct our statistical analysis. We’ll also be using a new package called sklearn, or scikit-learn, to validate our model on a train/test split. This package has become a dominant resource for machine learning and also comes installed with Anaconda.

In [1]: import pandas as pd
        import seaborn as sns
        import matplotlib.pyplot as plt
        from scipy import stats
        from sklearn import linear_model
        from sklearn import model_selection
        from sklearn import metrics

With the usecols argument of read_csv() we can specify which columns to read into the DataFrame:

In [2]: mpg = pd.read_csv('datasets/mpg/mpg.csv',usecols=
           ['mpg','weight','horsepower','origin','cylinders'])
        mpg.head()

Out[2]:
     mpg  cylinders  horsepower  weight origin
 0  18.0          8         130    3504    USA
 1  15.0          8         165    3693    USA
 2  18.0          8         150    3436    USA
 3  16.0          8         150    3433    USA
 4  17.0          8         140    3449    USA

Exploratory Data Analysis

Let’s start with the descriptive statistics:

In[3]: mpg.describe()

Out[3]:
              mpg   cylinders  horsepower       weight
count  392.000000  392.000000  392.000000   392.000000
mean    23.445918    5.471939  104.469388  2977.584184
std      7.805007    1.705783   38.491160   849.402560
min      9.000000    3.000000   46.000000  1613.000000
25%     17.000000    4.000000   75.000000  2225.250000
50%     22.750000    4.000000   93.500000  2803.500000
75%     29.000000    8.000000  126.000000  3614.750000
max     46.600000    8.000000  230.000000  5140.000000

Because origin is a categorical variable, by default it doesn’t show up as part of describe(). Let’s explore this variable instead with a frequency table. This can be done in pandas with the crosstab() function. First, we’ll specify what data to place on the index: origin. We’ll get a count for each level by setting the columns argument to count:

In [4]: pd.crosstab(index=mpg['origin'], columns='count')

Out[4]:
col_0   count
origin
Asia       79
Europe     68
USA       245

To make a two-way frequency table, we can instead set columns to another categorical variable, such as cylinders:

In [5]: pd.crosstab(index=mpg['origin'], columns=mpg['cylinders'])

Out[5]:
cylinders  3   4  5   6    8
origin
Asia       4  69  0   6    0
Europe     0  61  3   4    0
USA        0  69  0  73  103

Next, let’s retrieve descriptive statistics for mpg by each level of origin. I’ll do this by chaining together two methods, then subsetting the results:

In[6]: mpg.groupby('origin').describe()['mpg']

Out[6]:
        count       mean       std   min    25%   50%     75%   max
origin
Asia     79.0  30.450633  6.090048  18.0  25.70  31.6  34.050  46.6
Europe   68.0  27.602941  6.580182  16.2  23.75  26.0  30.125  44.3
USA     245.0  20.033469  6.440384   9.0  15.00  18.5  24.000  39.0

We can also visualize the overall distribution of mpg, as in Figure 13-1:

In[7]: sns.displot(data=mpg, x='mpg')
Histogram of MPG
Figure 13-1. Histogram of mpg

Now let’s make a boxplot as in Figure 13-2 comparing the distribution of mpg across each level of origin:

In[8]: sns.boxplot(x='origin', y='mpg', data=mpg, color='pink')
Box plot
Figure 13-2. Boxplot of mpg by origin

Alternatively, we can set the col argument of displot() to origin to create faceted histograms, such as in Figure 13-3:

In[9]: sns.displot(data=mpg, x="mpg", col="origin")
Faceted histogram
Figure 13-3. Faceted histogram of mpg by origin

Hypothesis Testing

Let’s again test for a difference in mileage between American and European cars. For ease of analysis, we’ll split the observations in each group into their own DataFrames.

In[10]: usa_cars = mpg[mpg['origin']=='USA']
        europe_cars = mpg[mpg['origin']=='Europe']

Independent Samples T-test

We can now use the ttest_ind() function from scipy.stats to conduct the t-test. This function expects two numpy arrays as arguments; pandas Series also work:

In[11]: stats.ttest_ind(usa_cars['mpg'], europe_cars['mpg'])

Out[11]: Ttest_indResult(statistic=-8.534455914399228,
            pvalue=6.306531719750568e-16)

Unfortunately, the output here is rather scarce: while it does include the p-value, it doesn’t include the confidence interval. To run a t-test with more output, check out the researchpy module.

Let’s move on to analyzing our continuous variables. We’ll start with a correlation matrix. We can use the corr() method from pandas, including only the relevant variables:

In[12]: mpg[['mpg','horsepower','weight']].corr()

Out[12]:
                 mpg  horsepower    weight
mpg         1.000000   -0.778427 -0.832244
horsepower -0.778427    1.000000  0.864538
weight     -0.832244    0.864538  1.000000

Next, let’s visualize the relationship between weight and mpg with a scatterplot as shown in Figure 13-4:

In[13]: sns.scatterplot(x='weight', y='mpg', data=mpg)
        plt.title('Relationship between weight and mileage')
Scatter plot
Figure 13-4. Scatterplot of mpg by weight

Alternatively, we could produce scatterplots across all pairs of our dataset with the pairplot() function from seaborn. Histograms of each variable are included along the diagonal, as seen in Figure 13-5:

In[14]: sns.pairplot(mpg[['mpg','horsepower','weight']])
Pairplot
Figure 13-5. Pairplot of mpg, horsepower, and weight

Linear Regression

Now it’s time for a linear regression. To do this, we’ll use linregress() from scipy, which also looks for two numpy arrays or pandas Series. We’ll specify which variable is our independent and dependent variable with the x and y arguments, respectively:

In[15]: # Linear regression of weight on mpg
        stats.linregress(x=mpg['weight'], y=mpg['mpg'])

Out[15]: LinregressResult(slope=-0.007647342535779578,
   intercept=46.21652454901758, rvalue=-0.8322442148315754,
   pvalue=6.015296051435726e-102, stderr=0.0002579632782734318)

Again, you’ll see that some of the output you may be used to is missing here. Be careful: the rvalue included is the correlation coefficient, not R-square. For a richer linear regression output, check out the statsmodels module.

Last but not least, let’s overlay our regression line to a scatterplot. seaborn has a separate function to do just that: regplot(). As usual, we’ll specify our independent and dependent variables, and where to get the data. This results in Figure 13-6:

In[16]: # Fit regression line to scatterplot
        sns.regplot(x="weight", y="mpg", data=mpg)
        plt.xlabel('Weight (lbs)')
        plt.ylabel('Mileage (mpg)')
        plt.title('Relationship between weight and mileage')
Fit scatter plot
Figure 13-6. Scatterplot with fit regression line of mpg by weight

Train/Test Split and Validation

At the end of Chapter 9 you learned how to apply a train/test split when building a linear regression model in R.

We will use the train_test_split() function to split our dataset into four DataFrames: not just by training and testing but also independent and dependent variables. We’ll pass in a DataFrame containing our independent variable first, then one containing the dependent variable. Using the random_state argument, we’ll seed the random number generator so the results remain consistent for this example:

In[17]: X_train, X_test, y_train, y_test =
        model_selection.train_test_split(mpg[['weight']], mpg[['mpg']],
        random_state=1234)

By default, the data is split 75/25 between training and testing subsets:

In[18]:  y_train.shape

Out[18]: (294, 1)


In[19]:  y_test.shape

Out[19]: (98, 1)

Now, let’s fit the model to the training data. First we’ll specify the linear model with LinearRegression(), then we’ll train the model with regr.fit(). To get the predicted values for the test dataset, we can use predict(). This results in a numpy array, not a pandas DataFrame, so the head() method won’t work to print the first few rows. We can, however, slice it:

In[20]:  # Create linear regression object
         regr = linear_model.LinearRegression()

         # Train the model using the training sets
         regr.fit(X_train, y_train)

         # Make predictions using the testing set
         y_pred = regr.predict(X_test)

         # Print first five observations
         y_pred[:5]

Out[20]:  array([[14.86634263],
         [23.48793632],
         [26.2781699 ],
         [27.69989655],
         [29.05319785]])

The coef_ attribute returns the coefficient of our test model:

In[21]:  regr.coef_

Out[21]: array([[-0.00760282]])

To get more information about the model, such as the coefficient p-values or R-squared, try fitting it with the statsmodels package.

For now, we’ll evaluate the performance of the model on our test data, this time using the metrics submodule of sklearn. We’ll pass in our actual and predicted values to the r2_score() and mean_squared_error() functions, which will return the R-squared and RMSE, respectively.

In[22]:  metrics.r2_score(y_test, y_pred)

Out[22]: 0.6811923996681357


In[23]:  metrics.mean_squared_error(y_test, y_pred)

Out[23]: 21.63348076436662

Conclusion

The usual caveat applies to this chapter: we’ve just scratched the surface of what analysis is possible on this or any other dataset. But I hope you feel you’ve hit your stride on working with data in Python.

Exercises

Take another look at the ais dataset, this time using Python. Read the Excel workbook in from the book repository and complete the following. You should be pretty comfortable with this analysis by now.

  1. Visualize the distribution of red blood cell count (rcc) by sex (sex).

  2. Is there a significant difference in red blood cell count between the two groups of sex?

  3. Produce a correlation matrix of the relevant variables in this dataset.

  4. Visualize the relationship of height (ht) and weight (wt).

  5. Regress ht on wt. Find the equation of the fit regression line. Is there a significant relationship?

  6. Split your regression model into training and testing subsets. What is the R-squared and RMSE on your test model?

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.118.1.232