12. Linear Models

12.1 Introduction

This part of the book follows the methods described in Jared Lander’s Rfor Everyone. The rationale is that since you have learned the methods of data manipulation in Python using Pandas, you can save out the cleaned data set if you need to use a method from another analytics language. Also, this part covers many of the basic modeling techniques amd serves as an introduction to data analytics/machine learning. Other great references are Andreas Müller and Sarah Guido’s Introduction to Machine Learning With Python and Sebastian Raschka and Vahid Mirjalili’s Python Machine Learning.

12.2 Simple Linear Regression

The goal of linear regression is to draw a straight-line relationship between a response variable (also known as an outcome or dependent variable) and a predictor variable (also known as a feature, covariate, or independent variable).

Let’s take another look at our tips data set.

import pandas as pd
import seaborn as sns

tips = sns.load_dataset('tips')
print(tips.head())

   total_bill   tip     sex smoker  day    time  size
0       16.99  1.01  Female     No  Sun  Dinner     2
1       10.34  1.66    Male     No  Sun  Dinner     3
2       21.01  3.50    Male     No  Sun  Dinner     3
3       23.68  3.31    Male     No  Sun  Dinner     2
4       24.59  3.61  Female     No  Sun  Dinner     4

In our simple linear regression, we’d like to see how the total_bill relates to or predicts the tip.

12.2.1 Using statsmodels

We can use the statsmodels library to perform our simple linear regression. We will use the formula API from statsmodels.

import statsmodels.formula.api as smf

To perform this simple linear regression, we use the ols function, which computes the ordinary least squares value; it is one method to estimate parameters in a linear regression. Recall that the formula for a line is y = mx + b, where y is our response variable, x is our predictor, b is the intercept, and m is the slope, the parameter we are estimating.

The formula notation has two parts, separated by a tilde, ~. To the left of the tilde is the response variable, and to the right of the tilde is the predictor.

model = smf.ols(formula='tip ~ total_bill', data=tips)

Once we have specified our model, we can fit the data to the model by using the fit method.

results = model.fit()

To look at our results, we can call the summary method on the results.

print(results.summary())

                            OLS Regression Results
==============================================================================
Dep. Variable:                    tip   R-squared:                       0.457
Model:                            OLS   Adj. R-squared:                  0.454
Method:                 Least Squares   F-statistic:                     203.4
Date:                Tue, 12 Sep 2017   Prob (F-statistic):           6.69e-34
Time:                        06:25:09   Log-Likelihood:                -350.54
No. Observations:                 244   AIC:                             705.1
Df Residuals:                     242   BIC:                             712.1
Df Model:                           1   Covariance Type:             nonrobust
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      0.9203      0.160      5.761      0.000      0.606        1.235
total_bill     0.1050      0.007     14.260      0.000      0.091        0.120
==============================================================================
Omnibus:                       20.185   Durbin-Watson:                   2.151
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               37.750
Skew:                           0.443   Prob(JB):                     6.35e-09
Kurtosis:                       4.711   Cond. No.                         53.0
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is
correctly specified.

Here we can see the Intercept of the model and the total_bill. We can use these parameters in our formula for the line, y =(0.105)x +0.920. To interpret these numbers, we say that for every one unit increase in total_bill (i.e., every time the bill increases by a dollar), the tip increases by 0.105 (i.e., 10.5 cents).

If we just want the coefficients, we can call the params attribute on the results.

print(results.params)

Intercept    0.920270
total_bill   0.105025
dtype: float64

Depending on your field, you may also need to report a confidence interval, which identifies the possible values the estimated value can take on. The confidence interval includes the values less than [0.025 0.975]. We can also extract these values using the conf_int method.

print(results.conf_int())

                   0         1
Intercept   0.605622  1.234918
total_bill  0.090517  0.119532

12.2.2 Using sklearn

We can also use the sklearn library to fit various machine learning models. To perform the same analysis as in Section 12.2.1, we need to import the linear_model module from this library.

from sklearn import linear_model

We can then create our linear regression object.

# create our LinearRegression object
lr = linear_model.LinearRegression()

Next, we need to specify the predictor, X, and the response, y. To do this, we pass in the columns we want to use for the model.

# note it is a uppercase X
# and a lowercase y
# this will fail because our X has only 1 variable
predicted = lr.fit(X=tips['total_bill'], y=tips['tip'])

Traceback (most recent call last):
  File "<ipython-input-1-40e6128e301f>", line 2, in <module>
    predicted = lr.fit(X=tips['total_bill'], y=tips['tip'])
ValueError: Expected 2D array, got 1D array instead:
array=[ 16.99  10.34  21.01  23.68  24.59  25.29   8.77  26.88  15.04
14.78
  10.27  35.26  15.42  18.43  14.83  21.58  10.33  16.29  16.97  20.65
  17.92  20.29  15.77  39.42  19.82  17.81  13.37  12.69  21.7   19.65
   9.55  18.35  15.06  20.69  17.78  24.06  16.31  16.93  18.69  31.27
  16.04  17.46  13.94   9.68  30.4   18.29  22.23  32.4   28.55  18.04
  12.54  10.29  34.81   9.94  25.56  19.49  38.01  26.41  11.24  48.27
  20.29  13.81  11.02  18.29  17.59  20.08  16.45   3.07  20.23  15.01
  12.02  17.07  26.86  25.28  14.73  10.51  17.92  27.2   22.76  17.29
  19.44  16.66  10.07  32.68  15.98  34.83  13.03  18.28  24.71  21.16
  28.97  22.49   5.75  16.32  22.75  40.17  27.28  12.03  21.01  12.46
  11.35  15.38  44.3   22.42  20.92  15.36  20.49  25.21  18.24  14.31
14.
   7.25  38.07  23.95  25.71  17.31  29.93  10.65  12.43  24.08  11.69
  13.42  14.26  15.95  12.48  29.8    8.52  14.52  11.38  22.82  19.08
  20.27  11.17  12.26  18.26   8.51  10.33  14.15  16.    13.16  17.47
  34.3   41.19  27.05  16.43   8.35  18.64  11.87   9.78   7.51  14.07
  13.13  17.26  24.55  19.77  29.85  48.17  25.    13.39  16.49  21.5
  12.66  16.21  13.81  17.51  24.52  20.76  31.71  10.59  10.63  50.81
  15.81   7.25  31.85  16.82  32.9   17.89  14.48   9.6   34.63  34.65
  23.33  45.35  23.17  40.55  20.69  20.9   30.46  18.15  23.1   15.69
  19.81  28.44  15.48  16.58   7.56  10.34  43.11  13.    13.51  18.71
  12.74  13.    16.4   20.53  16.47  26.59  38.73  24.27  12.76  30.06
  25.89  48.33  13.27  28.17  12.9   28.15  11.59   7.74  30.14  12.16
  13.42   8.58  15.98  13.42  16.27  10.09  20.45  13.28  22.12  24.01
  15.69  11.61  10.77  15.53  10.07  12.6   32.83  35.83  29.03  27.18
  22.67  17.82  18.78].
Reshape your data either using array.reshape(-1, 1) if your data has a
single feature or array.reshape(1, -1) if it contains a single sample.

Since sklearn is built to take numpy arrays, there will be times when you have to do some data manipulations to pass your dataframe into sklearn. The error message in the preceding output essentially tells us the matrix passed is not in the correct shape. We need to reshape our inputs. Depending on whether we have a single feature (which is the case here) or a single sample (i.e., multiple observations), we will specify reshape(-1, 1) or reshape(1, -1), respectively.

Calling reshape directly on the column will raise either a DeprecationWarning (Pandas 0.17) or a ValueError (Pandas 0.19), depending on the version of Pandas being used. To properly reshape our data, we must use the values attribute (otherwise you may get another error or warning). When we call values on a Pandas dataframe or series, we get the numpy ndarray representation of the data.

# note it is a uppercase X
# and a lowercase y
# we fix the data by putting it in the correct shape for sklearn
predicted = lr.fit(X=tips['total_bill'].values.reshape(-1, 1),
                  y=tips['tip'])

Since sklearn works on numpy ndarrays, you may see code that explicitly passes in the numpy vector into the X or y parameter: y=tips['tip'].values.

Unfortunately, sklearn doesn’t provide us with the nice summary tables that statsmodels does. This mainly reflects the different schools of thought—namely, statistics and computer science/machine learning—behind these two libraries. To obtain the coefficients in sklearn, we call the coef_ attribute on the fitted model.

print(predicted.coef_)

[ 0.10502452]

To get the intercept, we call the intercept_ attribute.

predicted.intercept_

0.92026961355467307

Notice that we get the same results as we did with statsmodels. That is, people in our data set are tipping about 10% of their bill amount.

12.3 Multiple Regression

In simple linear regression, one predictor is regressed on a continuous response variable. Alternatively, we can use multiple regression to put multiple predictors in a model.

12.3.1 Using statsmodels

Fitting a multiple regression model to a data set is very similar to fitting a simple linear regression model. Using the formula interface, we “add” the other covariates to the right-hand side.

model = smf.ols(formula='tip ~ total_bill + size', data=tips).
             fit()
print(model.sumamry())

                            OLS Regression Results
==============================================================================
Dep. Variable:                    tip   R-squared:                       0.468
Model:                            OLS   Adj. R-squared:                  0.463
Method:                 Least Squares   F-statistic:                     105.9
Date:                Tue, 12 Sep 2017   Prob (F-statistic):           9.67e-34
Time:                        06:25:10   Log-Likelihood:                -347.99
No. Observations:                 244   AIC:                             702.0
Df Residuals:                     241   BIC:                             712.5
Df Model:                           2   Covariance Type:             nonrobust
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      0.6689      0.194      3.455      0.001       0.288       1.050
total_bill     0.0927      0.009     10.172      0.000       0.075       0.111
size           0.1926      0.085      2.258      0.025       0.025       0.361
==============================================================================
Omnibus:                       24.753   Durbin-Watson:                   2.100
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               46.169
Skew:                           0.545   Prob(JB):                     9.43e-11
Kurtosis:                       4.831   Cond. No.                         67.6
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is
correctly specified.

The interpretations are exactly the same as before, although each parameter is interpreted “with all other variables held constant.” That is, for every one unit increase (dollar) in total_bill, the tip increases by 0.09 (i.e., 9 cents) as long as the size of the group does not change.

12.3.2 Using statsmodels With Categorical Variables

To this point, we have used only continuous predictors in our model. If we look at the info attribute of our tips data set, however, we can see that our data includes categorical variables.

print(tips.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 7 columns):
total_bill    244 non-null float64
tip           244 non-null float64
sex           244 non-null category
smoker        244 non-null category
day           244 non-null category
time          244 non-null category
size          244 non-null int64
dtypes: category(4), float64(2), int64(1)
memory usage: 7.2 KB
None

When we want to model a categorical variable, we have to create dummy variables. That is, each unique value in the category becomes a new binary feature. For example, sex in our data can hold one of two values, Female or Male.

print(tips.sex.unique())

[Female, Male]
Categories (2, object): [Female, Male]

statsmodels will automatically create dummy variables for us. To avoid multicollinearity, we typically drop one of the dummy variables. That is, if we have a column that indicates whether an individual is female, then we know if the person is not female (in our data), that person must be male. In such a case, we can effectively drop the dummy variable that codes for males and still have the same information.

Here’s the model that uses all the variables in our data.

model = smf.ols(
    formula='tip ~ total_bill + size + sex + smoker + day + time',
    data=tips).
    fit()

We can see from the summary that statsmodels automatically creates dummy variables as well as drops the reference variable to avoid multicollinearity.

print(model.summary())

                            OLS Regression Results
==============================================================================
Dep. Variable:                    tip   R-squared:                       0.470
Model:                            OLS   Adj. R-squared:                  0.452
Method:                 Least Squares   F-statistic:                     26.06
Date:                Tue, 12 Sep 2017   Prob (F-statistic):           1.20e-28
Time:                        06:25:10   Log-Likelihood:                -347.48
No. Observations:                 244   AIC:                             713.0
Df Residuals:                     235   BIC:                             744.4
Df Model:                           8   Covariance Type:             nonrobust
==================================================================================
                    coef     std err          t      P>|t|     [0.025       0.975]
----------------------------------------------------------------------------------
Intercept         0.5908       0.256      2.310      0.022      0.087       1.095
sex[T.Female]     0.0324       0.142      0.229      0.819     -0.247       0.311
smoker[T.No]      0.0864       0.147      0.589      0.556     -0.202       0.375
day[T.Fri]        0.1623       0.393      0.412      0.680     -0.613       0.937
day[T.Sat]        0.0408       0.471      0.087      0.931     -0.886       0.968
day[T.Sun]        0.1368       0.472      0.290      0.772     -0.793       1.066
time[T.Dinner]   -0.0681       0.445     -0.153      0.878     -0.944       0.808
total_bill        0.0945       0.010      9.841      0.000      0.076       0.113
size              0.1760       0.090      1.966      0.051     -0.000       0.352
==============================================================================
Omnibus:                       27.860   Durbin-Watson:                   2.096
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               52.555
Skew:                           0.607   Prob(JB):                     3.87e-12
Kurtosis:                       4.923   Cond. No.                         281.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is
correctly specified.

The interpretation of these parameters are the same as before. However, our interpretation of categorical variables must be stated in relation to the reference variable (i.e., the dummy variable that was dropped from the analysis). For example, the coefficient for sex[T.Female] is 0.0324. We interpret this value in relation to the reference value, Male; that is, we say that when the sex changes from Male to Female, the tip increases by 0.324. For the day variable:

print(tips.day.unique())

[Sun, Sat, Thur, Fri]
Categories (4, object): [Sun, Sat, Thur, Fri]

We see that our summary is missing Thur, so that is the reference variable to use to interpret the coefficients.

12.3.3 Using sklearn

The syntax for multiple regression in sklearn is very similar to the syntax for simple linear regression with this library. To add more features into the model, we pass in the columns we want to use.

lr = linear_model.LinearRegression()

# since we are performing multiple regression
# we no longer need to reshape our X values
predicted = lr.fit(X=tips[['total_bill', 'size']],
                   y=tips['tip'])
print(predicted.coef_)

[ 0.09271334 0.19259779]

We can get the intercept from the model just as we did earlier.

print(predicted.intercept_)

0.668944740813

12.3.4 Using sklearn With Categorical Variables

We have to manually create our dummy variables for sklearn. Luckily, Pandas has a function, get_dummies, that will do this work for us. This function converts all the categorical variables into dummy variables automatically, so we do not need to pass in individual columns one at a time. sklearn has a OneHotEncoder function that does something similar.1

1. sklearn OneHotEncoder documentation: http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html

tips_dummy = pd.get_dummies(
    tips[['total_bill', 'size',
          'sex', 'smoker', 'day', 'time']])
print(tips_dummy.head())

   total_bill  size  sex_Male  sex_Female  smoker_Yes  smoker_No  
0       16.99     2         0           1           0          1
1       10.34     3         1           0           0          1
2       21.01     3         1           0           0          1
3       23.68     2         1           0           0          1
4       24.59     4         0           1           0          1

   day_Thur  day_Fri  day_Sat  day_Sun  time_Lunch  time_Dinner
0         0        0        0        1           0            1
1         0        0        0        1           0            1
2         0        0        0        1           0            1
3         0        0        0        1           0            1
4         0        0        0        1           0            1

To drop the reference variable, we can pass in drop_first=True.

x_tips_dummy_ref = pd.get_dummies(
    tips[['total_bill', 'size',
          'sex', 'smoker', 'day', 'time']], drop_first=True)
print(x_tips_dummy_ref.head())

   total_bill  size  sex_Female  smoker_No  day_Fri  day_Sat  
0       16.99     2           1          1        0        0
1       10.34     3           0          1        0        0
2       21.01     3           0          1        0        0
3       23.68     2           0          1        0        0
4       24.59     4           1          1        0        0

   day_Sun  time_Dinner
0        1            1
1        1            1
2        1            1
3        1            1
4        1            1

We fit the model just as we did earlier.

lr = linear_model.LinearRegression()
predicted = lr.fit(X=x_tips_dummy_ref,
             y=tips['tip'])

print(predicted.coef_)

[ 0.09448701  0.175992    0.03244094  0.08640832  0.1622592
0.04080082
  0.13677854 -0.0681286 ]

We also obtain the coefficient in the same way.

print(predicted.intercept_)

0.590837425951

12.4 Keeping Index Labels From sklearn

One of the annoying things when trying to interpret a model from sklearn is that the coefficients are not labeled. The labels are omitted because the numpy ndarray is unable to store this type of metadata. If we want our output to resemble something from statsmodels, we need to manually store the labels and append the coefficients to them.

import numpy as np

# create and fit the model
lr = linear_model.LinearRegression()
predicted = lr.fit(X=x_tips_dummy_ref, y=tips['tip'])

# get the intercept along with other coefficients
values = np.append(predicted.intercept_, predicted.coef_)

# get the names of the values
names = np.append('intercept', x_tips_dummy_ref.columns)

# put everything in a labeled dataframe
results = pd.DataFrame(values, index = names,
    columns=['coef'] # you need the square brackets here
)

print(results)

                 coef
intercept    0.590837
total_bill   0.094487
size         0.175992
sex_Female   0.032441
smoker_No    0.086408
day_Fri      0.162259
day_Sat      0.040801
day_Sun      0.136779
time_Dinner -0.068129

12.5 Conclusion

This chapter introduced the basics of fitting models using the statsmodels and sklearn libraries. The concepts of adding features to a model and creating dummy variables are constantly used when fitting models. Thus far, we have focused on fitting linear models, where the response variable is a continuous variable. In later chapters, we’ll fit models where the response variable is not a continuous variable.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.220.151.158