Y. New York ACS Logistic Regression Example

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Y

New York ACS Logistic Regression Example

import pandas as pd

acs = pd.read_csv('data/acs_ny.csv')
print(acs.columns)

Index(['Acres', 'FamilyIncome', 'FamilyType', 'NumBedrooms', 'NumChildren',
       'NumPeople', 'NumRooms', 'NumUnits', 'NumVehicles', 'NumWorkers',
       'OwnRent', 'YearBuilt', 'HouseCosts', 'ElectricBill', 'FoodStamp',
       'HeatingFuel', 'Insurance', 'Language'],
     dtype='object')

print(acs.head())

 Acres FamilyIncome  FamilyType NumBedrooms NumChildren NumPeople 
0 1-10          150     Married           4           1         3
1 1-10          180 Female Head           3           2         4
2 1-10          280 Female Head           4           0         2
3 1-10          330 Female Head           2           1         2
4 1-10          330   Male Head           3           1         2

   Num        Num       Num         Num         Own       Year  
  Rooms       Units      Vehicles   Workers     Rent      Built
0     9 Single detached        1         0 Mortgage   1950-1959
1     6 Single detached        2         0   Rented Before 1939
2     8 Single detached        3         1 Mortgage   2000-2004
3     4 Single detached        1         0   Rented   1950-1959
4     5 Single attached        1         0 Mortgage Before 1939

   House  Electric  Food   Heating Insurance       Language
   Costs  Bill     Stamp   Fuel
0   1800       90     No       Gas      2500        English
1    850       90     No       Oil         0        English
2   2600      260     No       Oil      6600 Other European
3   1800      140     No       Oil         0        English
4    860      150     No       Gas       660        Spanish

To model these data, we first need to create a binary response variable. Here we split the FamilyIncome variable into a binary variable.

Click here to view code image

acs["ge150k"] = pd.cut(
     acs["FamilyIncome"],
     [0, 150000, acs["FamilyIncome"].max()],
     labels=[0, 1],
)

acs["ge150k_i"] = acs["ge150k"].astype(int)
print(acs["ge150k_i"].value_counts())

0    18294
1     4451
Name: ge150k_i, dtype: int64

In so doing, we created a binary (0/1) variable.

Click here to view code image

acs.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22745 entries, 0 to 22744
Data columns (total 20 columns):
 #   Column         Non-Null Count  Dtype
---  ------         --------------  -----
 0   Acres          22745 non-null  object
 1   FamilyIncome   22745 non-null  int64
 2   FamilyType     22745 non-null  object
 3   NumBedrooms    22745 non-null  int64
 4   NumChildren    22745 non-null  int64
 5   NumPeople      22745 non-null  int64
 6   NumRooms       22745 non-null  int64
 7   NumUnits       22745 non-null  object
 8   NumVehicles    22745 non-null  int64
 9   NumWorkers     22745 non-null  int64
 10  OwnRent        22745 non-null  object
 11  YearBuilt      22745 non-null  object
 12  HouseCosts     22745 non-null  int64
 13  ElectricBill   22745 non-null  int64
 14  FoodStamp      22745 non-null  object
 15  HeatingFuel    22745 non-null  object
 16  Insurance      22745 non-null  int64
 17  Language       22745 non-null  object
 18  ge150k         22745 non-null  category
 19  ge150k_i       22745 non-null  int64
dtypes: category(1), int64(11), object(8)
memory usage: 3.3+ MB

Let’s subset our data with just the columns we’ll use for the example.

Click here to view code image

acs_sub = acs[
  [
    "ge150k_i",
    "HouseCosts",
    "NumWorkers",
    "OwnRent",
    "NumBedrooms",
    "FamilyType",
  ]
].copy()

print(acs_sub)

   ge150k_i  HouseCosts  NumWorkers  OwnRent  NumBedrooms  FamilyType
0         0        1800           0 Mortgage            4     Married
1         0         850           0   Rented            3 Female Head
2         0        2600           1 Mortgage            4 Female Head
3         0        1800           0   Rented            2 Female Head
4         0         860           0 Mortgage            3   Male Head
...     ...         ...         ...      ...          ...         ...
22740     1        1700           2 Mortgage            5     Married
22741     1        1300           2 Mortgage            4     Married
22742     1         410           3 Mortgage            4     Married
22743     1        1600           3 Mortgage            3     Married
22744     1        6500           2 Mortgage            4     Married

[22745 rows x 6 columns]

import statsmodels.formula.api as smf

# we break up the formula string to fit on the page
model = smf.logit(
    "ge150k_i ~ HouseCosts + NumWorkers + OwnRent + NumBedrooms
      + FamilyType",
    data=acs_sub,
)

results = model.fit()

Optimization terminated successfully.
         Current function value: 0.391651
         Iterations 7

print(results.summary())

                              Logit Regression Results
==============================================================================
Dep. Variable:              ge150k_i No. Observations:   22745
Model:                         Logit Df Residuals:       22737
Method:                          MLE Df Model:               7
Date:               Thu, 01 Sep 2022 Pseudo R-squ.:     0.2078
Time:                       01:57:02 Log-Likelihood:   -8908.1
converged:                      True LL-Null:          -11244.
Covariance Type:           nonrobust LLR p-value:        0.000
===========================================================================================
                             coef   std err       z   P>|z|  [0.025  0.975]
-------------------------------------------------------------------------------------------
Intercept                 -5.8081     0.120 -48.456   0.000  -6.043  -5.573
OwnRent[T.Outright]        1.8276     0.208   8.782   0.000   1.420   2.236
OwnRent[T.Rented]         -0.8763     0.101  -8.647   0.000  -1.075  -0.678
FamilyType[T.Male Head]    0.2874     0.150   1.913   0.056  -0.007   0.582
FamilyType[T.Married]      1.3877     0.088  15.781   0.000   1.215   1.560
HouseCosts                 0.0007  1.72e-05  42.453   0.000   0.001   0.001
NumWorkers                 0.5873     0.026  22.393   0.000   0.536   0.639
NumBedrooms                0.2365     0.017  13.985   0.000   0.203   0.270
===========================================================================================

import statsmodels.formula.api as smf

# we break up the formula string to fit on the page
model = smf.logit(
    "ge150k_i ~ HouseCosts + NumWorkers + OwnRent + NumBedrooms + FamilyType",
    data=acs_sub,
)

results = model.fit()

Optimization terminated successfully.
         Current function value: 0.391651
         Iterations 7

print(results.summary())

                              Logit Regression Results
==============================================================================
Dep. Variable:              ge150k_i No. Observations:   22745
Model:                         Logit Df Residuals:       22737
Method:                          MLE Df Model:               7
Date:               Thu, 01 Sep 2022 Pseudo R-squ.:     0.2078
Time:                       01:57:02 Log-Likelihood:   -8908.1
converged:                      True LL-Null:          -11244.
Covariance Type:           nonrobust LLR p-value:        0.000
===========================================================================================
                             coef   std err       z   P>|z|  [0.025  0.975]
-------------------------------------------------------------------------------------------
Intercept                 -5.8081     0.120 -48.456   0.000  -6.043  -5.573
OwnRent[T.Outright]        1.8276     0.208   8.782   0.000   1.420   2.236
OwnRent[T.Rented]         -0.8763     0.101  -8.647   0.000  -1.075  -0.678
FamilyType[T.Male Head]    0.2874     0.150   1.913   0.056  -0.007   0.582
FamilyType[T.Married]      1.3877     0.088  15.781   0.000   1.215   1.560
HouseCosts                 0.0007  1.72e-05  42.453   0.000   0.001   0.001
NumWorkers                 0.5873     0.026  22.393   0.000   0.536   0.639
NumBedrooms                0.2365     0.017  13.985   0.000   0.203   0.270
===========================================================================================

import numpy as np

# exponentiate our results
odds_ratios = np.exp(results.params)
print(odds_ratios)

Intercept                0.003003
OwnRent[T.Outright]      6.219147
OwnRent[T.Rented]        0.416310
FamilyType[T.Male Head]  1.332901
FamilyType[T.Married]    4.005636
HouseCosts               1.000731
NumWorkers               1.799117
NumBedrooms              1.266852
dtype: float64

print(acs.OwnRent.unique())

['Mortgage' 'Rented' 'Outright']

Y.0.1 With sklearn

Click here to view code image

predictors = pd.get_dummies(acs_sub.iloc[:, 1:], drop_first=True)
print(predictors)

    HouseCosts NumWorkers NumBedrooms OwnRent_Outright OwnRent_Rented 
0         1800          0           4                0              0
1          850          0           3                0              1
2         2600          1           4                0              0
3         1800          0           2                0              1
4          860          0           3                0              0
...        ...        ...         ...              ...            ...
22740     1700          2           5                0              0
22741     1300          2           4                0              0
22742      410          3           4                0              0
22743     1600          3           3                0              0
22744     6500          2           4                0              0

   FamilyType_Male Head FamilyType_Married
0                     0                  1
1                     0                  0
2                     0                  0
3                     0                  0
4                     1                  0
...                 ...                ...
22740                 0                  1
22741                 0                  1
22742                 0                  1
22743                 0                  1
22744                 0                  1

[22745 rows x 7 columns]

from sklearn import linear_model
lr = linear_model.LogisticRegression()

results = lr.fit(X = predictors, y = acs['ge150k_i'])

/Users/danielchen/.pyenv/versions/3.10.4/envs/pfe_book/lib/python3.10/
site-packages/sklearn/linear_model/_logistic.py:444: ConvergenceWarning:
lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-
    regression
  n_iter_i = _check_optimize_result(

We can also get our coefficients in the same way.

Click here to view code image

print(results.coef_)

[[ 5.83764740e-04  7.29381775e-01 2.82543789e-01 7.03519146e-02
  -2.11748592e+00 -1.02984936e+00 2.50310160e-01]]

We can get the intercept as well.

print(results.intercept_)

[-4.82088401]

We can print out our results in a more attractive format.

Click here to view code image

values = np.append(results.intercept_, results.coef_)

# get the names of the values
names = np.append("intercept", predictors.columns)

# put everything in a labeled dataframe
results = pd.DataFrame(
    values,
    index=names,
    columns=["coef"], # you need the square brackets here
)

print(results)

                          coef
intercept            -4.820884
HouseCosts            0.000584
NumWorkers            0.729382
NumBedrooms           0.282544
OwnRent_Outright      0.070352
OwnRent_Rented       -2.117486
FamilyType_Male Head -1.029849
FamilyType_Married    0.250310

In order to interpret our coefficients, we still need to exponentiate our values.

Click here to view code image

results['or'] = np.exp(results['coef'])
print(results)

                            coef       or
intercept              -4.820884 0.008060
HouseCosts              0.000584 1.000584
NumWorkers              0.729382 2.073798
NumBedrooms             0.282544 1.326500
OwnRent_Outright        0.070352 1.072886
OwnRent_Rented         -2.117486 0.120334
FamilyType_Male Head   -1.029849 0.357061
FamilyType_Married      0.250310 1.284424

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Y. New York ACS Logistic Regression Example

Create new playlist

Sign In

Sign Up

Y

New York ACS Logistic Regression Example

Y.0.1 With sklearn

Table of Contents for
Y. New York ACS Logistic Regression Example