This part of the book follows the methods described in Jared Lander’s Rfor Everyone. The rationale is that since you have learned the methods of data manipulation in Python using Pandas, you can save out the cleaned data set if you need to use a method from another analytics language. Also, this part covers many of the basic modeling techniques amd serves as an introduction to data analytics/machine learning. Other great references are Andreas Müller and Sarah Guido’s Introduction to Machine Learning With Python and Sebastian Raschka and Vahid Mirjalili’s Python Machine Learning.
The goal of linear regression is to draw a straight-line relationship between a response variable (also known as an outcome or dependent variable) and a predictor variable (also known as a feature, covariate, or independent variable).
Let’s take another look at our tips
data set.
import pandas as pd
import seaborn as sns
tips = sns.load_dataset('tips')
print(tips.head())
In our simple linear regression, we’d like to see how the total_bill
relates to or predicts the tip
.
We can use the statsmodels
library to perform our simple linear regression. We will use the formula
API from statsmodels
.
import statsmodels.formula.api as smf
To perform this simple linear regression, we use the ols
function, which computes the ordinary least squares value; it is one method to estimate parameters in a linear regression. Recall that the formula for a line is y = mx + b, where y is our response variable, x is our predictor, b is the intercept, and m is the slope, the parameter we are estimating.
The formula notation has two parts, separated by a tilde, ~. To the left of the tilde is the response variable, and to the right of the tilde is the predictor.
model = smf.ols(formula='tip ~ total_bill', data=tips)
Once we have specified our model, we can fit the data to the model by using the fit
method.
results = model.fit()
To look at our results, we can call the summary
method on the results
.
print(results.summary())
Here we can see the Intercept
of the model and the total_bill
. We can use these parameters in our formula for the line, y =(0.105)x +0.920. To interpret these numbers, we say that for every one unit increase in total_bill
(i.e., every time the bill increases by a dollar), the tip increases by 0.105 (i.e., 10.5 cents).
If we just want the coefficients, we can call the params
attribute on the results
.
print(results.params)
Depending on your field, you may also need to report a confidence interval, which identifies the possible values the estimated value can take on. The confidence interval includes the values less than [0.025 0.975]
. We can also extract these values using the conf_int
method.
print(results.conf_int())
We can also use the sklearn
library to fit various machine learning models. To perform the same analysis as in Section 12.2.1, we need to import the linear_model
module from this library.
from sklearn import linear_model
We can then create our linear regression object.
# create our LinearRegression object
lr = linear_model.LinearRegression()
Next, we need to specify the predictor, X
, and the response, y
. To do this, we pass in the columns we want to use for the model.
# note it is a uppercase X
# and a lowercase y
# this will fail because our X has only 1 variable
predicted = lr.fit(X=tips['total_bill'], y=tips['tip'])
Since sklearn
is built to take numpy
arrays, there will be times when you have to do some data manipulations to pass your dataframe into sklearn
. The error message in the preceding output essentially tells us the matrix passed is not in the correct shape. We need to reshape our inputs. Depending on whether we have a single feature (which is the case here) or a single sample (i.e., multiple observations), we will specify reshape(-1, 1)
or reshape(1, -1)
, respectively.
Calling reshape
directly on the column will raise either a DeprecationWarning
(Pandas 0.17) or a ValueError
(Pandas 0.19), depending on the version of Pandas being used. To properly reshape our data, we must use the values
attribute (otherwise you may get another error or warning). When we call values
on a Pandas dataframe or series, we get the numpy ndarray
representation of the data.
# note it is a uppercase X
# and a lowercase y
# we fix the data by putting it in the correct shape for sklearn
predicted = lr.fit(X=tips['total_bill'].values.reshape(-1, 1),
y=tips['tip'])
Since sklearn
works on numpy ndarrays
, you may see code that explicitly passes in the numpy
vector into the X
or y
parameter: y=tips['tip'].values
.
Unfortunately, sklearn
doesn’t provide us with the nice summary tables that statsmodels
does. This mainly reflects the different schools of thought—namely, statistics and computer science/machine learning—behind these two libraries. To obtain the coefficients in sklearn
, we call the coef_
attribute on the fitted model.
print(predicted.coef_)
To get the intercept, we call the intercept_
attribute.
predicted.intercept_
Notice that we get the same results as we did with statsmodels
. That is, people in our data set are tipping about 10% of their bill amount.
In simple linear regression, one predictor is regressed on a continuous response variable. Alternatively, we can use multiple regression to put multiple predictors in a model.
Fitting a multiple regression model to a data set is very similar to fitting a simple linear regression model. Using the formula interface, we “add” the other covariates to the right-hand side.
model = smf.ols(formula='tip ~ total_bill + size', data=tips).
fit()
print(model.sumamry())
The interpretations are exactly the same as before, although each parameter is interpreted “with all other variables held constant.” That is, for every one unit increase (dollar) in total_bill
, the tip increases by 0.09
(i.e., 9 cents) as long as the size of the group does not change.
To this point, we have used only continuous predictors in our model. If we look at the info
attribute of our tips
data set, however, we can see that our data includes categorical variables.
print(tips.info())
When we want to model a categorical variable, we have to create dummy variables. That is, each unique value in the category becomes a new binary feature. For example, sex
in our data can hold one of two values, Female
or Male
.
print(tips.sex.unique())
statsmodels
will automatically create dummy variables for us. To avoid multicollinearity, we typically drop one of the dummy variables. That is, if we have a column that indicates whether an individual is female, then we know if the person is not female (in our data), that person must be male. In such a case, we can effectively drop the dummy variable that codes for males and still have the same information.
Here’s the model that uses all the variables in our data.
model = smf.ols(
formula='tip ~ total_bill + size + sex + smoker + day + time',
data=tips).
fit()
We can see from the summary that statsmodels
automatically creates dummy variables as well as drops the reference variable to avoid multicollinearity.
print(model.summary())
The interpretation of these parameters are the same as before. However, our interpretation of categorical variables must be stated in relation to the reference variable (i.e., the dummy variable that was dropped from the analysis). For example, the coefficient for sex[T.Female]
is 0.0324
. We interpret this value in relation to the reference value, Male
; that is, we say that when the sex
changes from Male
to Female
, the tip increases by 0.324
. For the day
variable:
print(tips.day.unique())
We see that our summary
is missing Thur
, so that is the reference variable to use to interpret the coefficients.
The syntax for multiple regression in sklearn
is very similar to the syntax for simple linear regression with this library. To add more features into the model, we pass in the columns we want to use.
lr = linear_model.LinearRegression()
# since we are performing multiple regression
# we no longer need to reshape our X values
predicted = lr.fit(X=tips[['total_bill', 'size']],
y=tips['tip'])
print(predicted.coef_)
We can get the intercept from the model just as we did earlier.
print(predicted.intercept_)
We have to manually create our dummy variables for sklearn
. Luckily, Pandas has a function, get_dummies
, that will do this work for us. This function converts all the categorical variables into dummy variables automatically, so we do not need to pass in individual columns one at a time. sklearn
has a OneHotEncoder
function that does something similar.1
1. sklearn OneHotEncoder
documentation: http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html
tips_dummy = pd.get_dummies(
tips[['total_bill', 'size',
'sex', 'smoker', 'day', 'time']])
print(tips_dummy.head())
To drop the reference variable, we can pass in drop_first=True
.
x_tips_dummy_ref = pd.get_dummies(
tips[['total_bill', 'size',
'sex', 'smoker', 'day', 'time']], drop_first=True)
print(x_tips_dummy_ref.head())
We fit the model just as we did earlier.
lr = linear_model.LinearRegression()
predicted = lr.fit(X=x_tips_dummy_ref,
y=tips['tip'])
print(predicted.coef_)
We also obtain the coefficient in the same way.
print(predicted.intercept_)
One of the annoying things when trying to interpret a model from sklearn
is that the coefficients are not labeled. The labels are omitted because the numpy ndarray
is unable to store this type of metadata. If we want our output to resemble something from statsmodels
, we need to manually store the labels and append the coefficients to them.
import numpy as np
# create and fit the model
lr = linear_model.LinearRegression()
predicted = lr.fit(X=x_tips_dummy_ref, y=tips['tip'])
# get the intercept along with other coefficients
values = np.append(predicted.intercept_, predicted.coef_)
# get the names of the values
names = np.append('intercept', x_tips_dummy_ref.columns)
# put everything in a labeled dataframe
results = pd.DataFrame(values, index = names,
columns=['coef'] # you need the square brackets here
)
print(results)
This chapter introduced the basics of fitting models using the statsmodels
and sklearn
libraries. The concepts of adding features to a model and creating dummy variables are constantly used when fitting models. Thus far, we have focused on fitting linear models, where the response variable is a continuous variable. In later chapters, we’ll fit models where the response variable is not a continuous variable.
18.220.151.158