At the end of Chapter 8 you extended what you learned about R to explore and test relationships in the mpg dataset. We’ll do the same in this chapter, using Python. We’ve conducted the same work in Excel and R, so I’ll focus less on the whys of our analysis in favor of the hows of doing it in Python.
To get started, let’s call in all the necessary modules. Some of these are new: from scipy
, we’ll import the stats
submodule. To do this, we’ll use the from
keyword to tell Python what module to look for, then the usual import
keyword to choose a sub-module. As the name suggests, we’ll use the stats
submodule of scipy
to conduct our statistical analysis. We’ll also be using a new package called sklearn
, or scikit-learn, to validate our model on a train/test split. This package has become a dominant resource for machine learning and also comes installed with Anaconda.
In
[
1
]:
import
pandas
as
pd
import
seaborn
as
sns
import
matplotlib.pyplot
as
plt
from
scipy
import
stats
from
sklearn
import
linear_model
from
sklearn
import
model_selection
from
sklearn
import
metrics
With the usecols
argument of read_csv()
we can specify which columns to read into the DataFrame:
In
[
2
]:
mpg
=
pd
.
read_csv
(
'datasets/mpg/mpg.csv'
,
usecols
=
[
'mpg'
,
'weight'
,
'horsepower'
,
'origin'
,
'cylinders'
])
mpg
.
head
()
Out
[
2
]:
mpg
cylinders
horsepower
weight
origin
0
18.0
8
130
3504
USA
1
15.0
8
165
3693
USA
2
18.0
8
150
3436
USA
3
16.0
8
150
3433
USA
4
17.0
8
140
3449
USA
Let’s start with the descriptive statistics:
In
[
3
]:
mpg
.
describe
()
Out
[
3
]:
mpg
cylinders
horsepower
weight
count
392.000000
392.000000
392.000000
392.000000
mean
23.445918
5.471939
104.469388
2977.584184
std
7.805007
1.705783
38.491160
849.402560
min
9.000000
3.000000
46.000000
1613.000000
25
%
17.000000
4.000000
75.000000
2225.250000
50
%
22.750000
4.000000
93.500000
2803.500000
75
%
29.000000
8.000000
126.000000
3614.750000
max
46.600000
8.000000
230.000000
5140.000000
Because origin is a categorical variable, by default it doesn’t show up as part of describe()
. Let’s explore this variable instead with a frequency table. This can be done in pandas
with the crosstab()
function. First, we’ll specify what data to place on the index: origin. We’ll get a count for each level by setting the columns
argument to count
:
In
[
4
]:
pd
.
crosstab
(
index
=
mpg
[
'origin'
],
columns
=
'count'
)
Out
[
4
]:
col_0
count
origin
Asia
79
Europe
68
USA
245
To make a two-way frequency table, we can instead set columns
to another categorical variable, such as cylinders
:
In
[
5
]:
pd
.
crosstab
(
index
=
mpg
[
'origin'
],
columns
=
mpg
[
'cylinders'
])
Out
[
5
]:
cylinders
3
4
5
6
8
origin
Asia
4
69
0
6
0
Europe
0
61
3
4
0
USA
0
69
0
73
103
Next, let’s retrieve descriptive statistics for mpg by each level of origin. I’ll do this by chaining together two methods, then subsetting the results:
In
[
6
]:
mpg
.
groupby
(
'origin'
)
.
describe
()[
'mpg'
]
Out
[
6
]:
count
mean
std
min
25
%
50
%
75
%
max
origin
Asia
79.0
30.450633
6.090048
18.0
25.70
31.6
34.050
46.6
Europe
68.0
27.602941
6.580182
16.2
23.75
26.0
30.125
44.3
USA
245.0
20.033469
6.440384
9.0
15.00
18.5
24.000
39.0
We can also visualize the overall distribution of mpg, as in Figure 13-1:
In
[
7
]:
sns
.
displot
(
data
=
mpg
,
x
=
'mpg'
)
Now let’s make a boxplot as in Figure 13-2 comparing the distribution of mpg across each level of origin:
In
[
8
]:
sns
.
boxplot
(
x
=
'origin'
,
y
=
'mpg'
,
data
=
mpg
,
color
=
'pink'
)
Alternatively, we can set the col
argument of displot()
to origin
to create faceted histograms, such as in Figure 13-3:
In
[
9
]:
sns
.
displot
(
data
=
mpg
,
x
=
"mpg"
,
col
=
"origin"
)
Let’s again test for a difference in mileage between American and European cars. For ease of analysis, we’ll split the observations in each group into their own DataFrames.
In
[
10
]:
usa_cars
=
mpg
[
mpg
[
'origin'
]
==
'USA'
]
europe_cars
=
mpg
[
mpg
[
'origin'
]
==
'Europe'
]
We can now use the ttest_ind()
function from scipy.stats
to conduct the t-test. This function expects two numpy
arrays as arguments; pandas
Series also work:
In
[
11
]:
stats
.
ttest_ind
(
usa_cars
[
'mpg'
],
europe_cars
[
'mpg'
])
Out
[
11
]:
Ttest_indResult
(
statistic
=-
8.534455914399228
,
pvalue
=
6.306531719750568e-16
)
Unfortunately, the output here is rather scarce: while it does include the p-value, it doesn’t include the confidence interval. To run a t-test with more output, check out the researchpy
module.
Let’s move on to analyzing our continuous variables. We’ll start with a correlation matrix. We can use the corr()
method from pandas
, including only the relevant
variables:
In
[
12
]:
mpg
[[
'mpg'
,
'horsepower'
,
'weight'
]]
.
corr
()
Out
[
12
]:
mpg
horsepower
weight
mpg
1.000000
-
0.778427
-
0.832244
horsepower
-
0.778427
1.000000
0.864538
weight
-
0.832244
0.864538
1.000000
Next, let’s visualize the relationship between weight and mpg with a scatterplot as shown in Figure 13-4:
In
[
13
]:
sns
.
scatterplot
(
x
=
'weight'
,
y
=
'mpg'
,
data
=
mpg
)
plt
.
title
(
'Relationship between weight and mileage'
)
Alternatively, we could produce scatterplots across all pairs of our dataset with the pairplot()
function from seaborn
. Histograms of each variable are included along the diagonal, as seen in Figure 13-5:
In
[
14
]:
sns
.
pairplot
(
mpg
[[
'mpg'
,
'horsepower'
,
'weight'
]])
Now it’s time for a linear regression. To do this, we’ll use linregress()
from scipy
, which also looks for two numpy
arrays or pandas
Series. We’ll specify which variable is our independent and dependent variable with the x
and y
arguments, respectively:
In
[
15
]:
# Linear regression of weight on mpg
stats
.
linregress
(
x
=
mpg
[
'weight'
],
y
=
mpg
[
'mpg'
])
Out
[
15
]:
LinregressResult
(
slope
=-
0.007647342535779578
,
intercept
=
46.21652454901758
,
rvalue
=-
0.8322442148315754
,
pvalue
=
6.015296051435726e-102
,
stderr
=
0.0002579632782734318
)
Again, you’ll see that some of the output you may be used to is missing here. Be careful: the rvalue
included is the correlation coefficient, not R-square. For a richer linear regression output, check out the statsmodels
module.
Last but not least, let’s overlay our regression line to a scatterplot. seaborn
has a separate function to do just that: regplot()
. As usual, we’ll specify our independent and dependent variables, and where to get the data. This results in Figure 13-6:
In
[
16
]:
# Fit regression line to scatterplot
sns
.
regplot
(
x
=
"weight"
,
y
=
"mpg"
,
data
=
mpg
)
plt
.
xlabel
(
'Weight (lbs)'
)
plt
.
ylabel
(
'Mileage (mpg)'
)
plt
.
title
(
'Relationship between weight and mileage'
)
At the end of Chapter 9 you learned how to apply a train/test split when building a linear regression model in R.
We will use the train_test_split()
function to split our dataset into four DataFrames: not just by training and testing but also independent and dependent variables. We’ll pass in a DataFrame containing our independent variable first, then one containing the dependent variable. Using the random_state
argument, we’ll seed the random number generator so the results remain consistent for this example:
In
[
17
]:
X_train
,
X_test
,
y_train
,
y_test
=
model_selection
.
train_test_split
(
mpg
[[
'weight'
]],
mpg
[[
'mpg'
]],
random_state
=
1234
)
By default, the data is split 75/25 between training and testing subsets:
In
[
18
]:
y_train
.
shape
Out
[
18
]:
(
294
,
1
)
In
[
19
]:
y_test
.
shape
Out
[
19
]:
(
98
,
1
)
Now, let’s fit the model to the training data. First we’ll specify the linear model with LinearRegression()
, then we’ll train the model with regr.fit()
. To get the predicted values for the test dataset, we can use predict()
. This results in a numpy
array, not a pandas
DataFrame, so the head()
method won’t work to print the first few rows. We can, however, slice it:
In
[
20
]:
# Create linear regression object
regr
=
linear_model
.
LinearRegression
()
# Train the model using the training sets
regr
.
fit
(
X_train
,
y_train
)
# Make predictions using the testing set
y_pred
=
regr
.
predict
(
X_test
)
# Print first five observations
y_pred
[:
5
]
Out
[
20
]:
array
([[
14.86634263
],
[
23.48793632
],
[
26.2781699
],
[
27.69989655
],
[
29.05319785
]])
The coef_
attribute returns the coefficient of our test model:
In
[
21
]:
regr
.
coef_
Out
[
21
]:
array
([[
-
0.00760282
]])
To get more information about the model, such as the coefficient p-values or R-squared, try fitting it with the statsmodels
package.
For now, we’ll evaluate the performance of the model on our test data, this time using the metrics
submodule of sklearn
. We’ll pass in our actual and predicted values to the r2_score()
and mean_squared_error()
functions, which will return the R-squared and RMSE, respectively.
In
[
22
]:
metrics
.
r2_score
(
y_test
,
y_pred
)
Out
[
22
]:
0.6811923996681357
In
[
23
]:
metrics
.
mean_squared_error
(
y_test
,
y_pred
)
Out
[
23
]:
21.63348076436662
Take another look at the ais dataset, this time using Python. Read the Excel workbook in from the book repository and complete the following. You should be pretty comfortable with this analysis by now.
Visualize the distribution of red blood cell count (rcc) by sex (sex).
Is there a significant difference in red blood cell count between the two groups of sex?
Produce a correlation matrix of the relevant variables in this dataset.
Visualize the relationship of height (ht) and weight (wt).
Regress ht on wt. Find the equation of the fit regression line. Is there a significant relationship?
Split your regression model into training and testing subsets. What is the R-squared and RMSE on your test model?
18.118.1.232