Y

New York ACS Logistic Regression Example

import pandas as pd

acs = pd.read_csv('data/acs_ny.csv')
print(acs.columns)
Index(['Acres', 'FamilyIncome', 'FamilyType', 'NumBedrooms', 'NumChildren',
       'NumPeople', 'NumRooms', 'NumUnits', 'NumVehicles', 'NumWorkers',
       'OwnRent', 'YearBuilt', 'HouseCosts', 'ElectricBill', 'FoodStamp',
       'HeatingFuel', 'Insurance', 'Language'],
     dtype='object')
print(acs.head())
 Acres FamilyIncome  FamilyType NumBedrooms NumChildren NumPeople 
0 1-10          150     Married           4           1         3
1 1-10          180 Female Head           3           2         4
2 1-10          280 Female Head           4           0         2
3 1-10          330 Female Head           2           1         2
4 1-10          330   Male Head           3           1         2
   Num        Num       Num         Num         Own       Year  
  Rooms       Units      Vehicles   Workers     Rent      Built
0     9 Single detached        1         0 Mortgage   1950-1959
1     6 Single detached        2         0   Rented Before 1939
2     8 Single detached        3         1 Mortgage   2000-2004
3     4 Single detached        1         0   Rented   1950-1959
4     5 Single attached        1         0 Mortgage Before 1939
   House  Electric  Food   Heating Insurance       Language
   Costs  Bill     Stamp   Fuel
0   1800       90     No       Gas      2500        English
1    850       90     No       Oil         0        English
2   2600      260     No       Oil      6600 Other European
3   1800      140     No       Oil         0        English
4    860      150     No       Gas       660        Spanish

To model these data, we first need to create a binary response variable. Here we split the FamilyIncome variable into a binary variable.

acs["ge150k"] = pd.cut(
     acs["FamilyIncome"],
     [0, 150000, acs["FamilyIncome"].max()],
     labels=[0, 1],
)

acs["ge150k_i"] = acs["ge150k"].astype(int)
print(acs["ge150k_i"].value_counts())
0    18294
1     4451
Name: ge150k_i, dtype: int64

In so doing, we created a binary (0/1) variable.

acs.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22745 entries, 0 to 22744
Data columns (total 20 columns):
 #   Column         Non-Null Count  Dtype
---  ------         --------------  -----
 0   Acres          22745 non-null  object
 1   FamilyIncome   22745 non-null  int64
 2   FamilyType     22745 non-null  object
 3   NumBedrooms    22745 non-null  int64
 4   NumChildren    22745 non-null  int64
 5   NumPeople      22745 non-null  int64
 6   NumRooms       22745 non-null  int64
 7   NumUnits       22745 non-null  object
 8   NumVehicles    22745 non-null  int64
 9   NumWorkers     22745 non-null  int64
 10  OwnRent        22745 non-null  object
 11  YearBuilt      22745 non-null  object
 12  HouseCosts     22745 non-null  int64
 13  ElectricBill   22745 non-null  int64
 14  FoodStamp      22745 non-null  object
 15  HeatingFuel    22745 non-null  object
 16  Insurance      22745 non-null  int64
 17  Language       22745 non-null  object
 18  ge150k         22745 non-null  category
 19  ge150k_i       22745 non-null  int64
dtypes: category(1), int64(11), object(8)
memory usage: 3.3+ MB

Let’s subset our data with just the columns we’ll use for the example.

acs_sub = acs[
  [
    "ge150k_i",
    "HouseCosts",
    "NumWorkers",
    "OwnRent",
    "NumBedrooms",
    "FamilyType",
  ]
].copy()

print(acs_sub)
   ge150k_i  HouseCosts  NumWorkers  OwnRent  NumBedrooms  FamilyType
0         0        1800           0 Mortgage            4     Married
1         0         850           0   Rented            3 Female Head
2         0        2600           1 Mortgage            4 Female Head
3         0        1800           0   Rented            2 Female Head
4         0         860           0 Mortgage            3   Male Head
...     ...         ...         ...      ...          ...         ...
22740     1        1700           2 Mortgage            5     Married
22741     1        1300           2 Mortgage            4     Married
22742     1         410           3 Mortgage            4     Married
22743     1        1600           3 Mortgage            3     Married
22744     1        6500           2 Mortgage            4     Married
[22745 rows x 6 columns]
import statsmodels.formula.api as smf

# we break up the formula string to fit on the page
model = smf.logit(
    "ge150k_i ~ HouseCosts + NumWorkers + OwnRent + NumBedrooms
      + FamilyType",
    data=acs_sub,
)

results = model.fit()
Optimization terminated successfully.
         Current function value: 0.391651
         Iterations 7
print(results.summary())
                              Logit Regression Results
==============================================================================
Dep. Variable:              ge150k_i No. Observations:   22745
Model:                         Logit Df Residuals:       22737
Method:                          MLE Df Model:               7
Date:               Thu, 01 Sep 2022 Pseudo R-squ.:     0.2078
Time:                       01:57:02 Log-Likelihood:   -8908.1
converged:                      True LL-Null:          -11244.
Covariance Type:           nonrobust LLR p-value:        0.000
===========================================================================================
                             coef   std err       z   P>|z|  [0.025  0.975]
-------------------------------------------------------------------------------------------
Intercept                 -5.8081     0.120 -48.456   0.000  -6.043  -5.573
OwnRent[T.Outright]        1.8276     0.208   8.782   0.000   1.420   2.236
OwnRent[T.Rented]         -0.8763     0.101  -8.647   0.000  -1.075  -0.678
FamilyType[T.Male Head]    0.2874     0.150   1.913   0.056  -0.007   0.582
FamilyType[T.Married]      1.3877     0.088  15.781   0.000   1.215   1.560
HouseCosts                 0.0007  1.72e-05  42.453   0.000   0.001   0.001
NumWorkers                 0.5873     0.026  22.393   0.000   0.536   0.639
NumBedrooms                0.2365     0.017  13.985   0.000   0.203   0.270
===========================================================================================
import statsmodels.formula.api as smf

# we break up the formula string to fit on the page
model = smf.logit(
    "ge150k_i ~ HouseCosts + NumWorkers + OwnRent + NumBedrooms + FamilyType",
    data=acs_sub,
)

results = model.fit()
Optimization terminated successfully.
         Current function value: 0.391651
         Iterations 7
print(results.summary())
                              Logit Regression Results
==============================================================================
Dep. Variable:              ge150k_i No. Observations:   22745
Model:                         Logit Df Residuals:       22737
Method:                          MLE Df Model:               7
Date:               Thu, 01 Sep 2022 Pseudo R-squ.:     0.2078
Time:                       01:57:02 Log-Likelihood:   -8908.1
converged:                      True LL-Null:          -11244.
Covariance Type:           nonrobust LLR p-value:        0.000
===========================================================================================
                             coef   std err       z   P>|z|  [0.025  0.975]
-------------------------------------------------------------------------------------------
Intercept                 -5.8081     0.120 -48.456   0.000  -6.043  -5.573
OwnRent[T.Outright]        1.8276     0.208   8.782   0.000   1.420   2.236
OwnRent[T.Rented]         -0.8763     0.101  -8.647   0.000  -1.075  -0.678
FamilyType[T.Male Head]    0.2874     0.150   1.913   0.056  -0.007   0.582
FamilyType[T.Married]      1.3877     0.088  15.781   0.000   1.215   1.560
HouseCosts                 0.0007  1.72e-05  42.453   0.000   0.001   0.001
NumWorkers                 0.5873     0.026  22.393   0.000   0.536   0.639
NumBedrooms                0.2365     0.017  13.985   0.000   0.203   0.270
===========================================================================================
import numpy as np

# exponentiate our results
odds_ratios = np.exp(results.params)
print(odds_ratios)
Intercept                0.003003
OwnRent[T.Outright]      6.219147
OwnRent[T.Rented]        0.416310
FamilyType[T.Male Head]  1.332901
FamilyType[T.Married]    4.005636
HouseCosts               1.000731
NumWorkers               1.799117
NumBedrooms              1.266852
dtype: float64
print(acs.OwnRent.unique())
['Mortgage' 'Rented' 'Outright']

Y.0.1 With sklearn

predictors = pd.get_dummies(acs_sub.iloc[:, 1:], drop_first=True)
print(predictors)
    HouseCosts NumWorkers NumBedrooms OwnRent_Outright OwnRent_Rented 
0         1800          0           4                0              0
1          850          0           3                0              1
2         2600          1           4                0              0
3         1800          0           2                0              1
4          860          0           3                0              0
...        ...        ...         ...              ...            ...
22740     1700          2           5                0              0
22741     1300          2           4                0              0
22742      410          3           4                0              0
22743     1600          3           3                0              0
22744     6500          2           4                0              0
   FamilyType_Male Head FamilyType_Married
0                     0                  1
1                     0                  0
2                     0                  0
3                     0                  0
4                     1                  0
...                 ...                ...
22740                 0                  1
22741                 0                  1
22742                 0                  1
22743                 0                  1
22744                 0                  1
[22745 rows x 7 columns]
from sklearn import linear_model
lr = linear_model.LogisticRegression()
results = lr.fit(X = predictors, y = acs['ge150k_i'])
/Users/danielchen/.pyenv/versions/3.10.4/envs/pfe_book/lib/python3.10/
site-packages/sklearn/linear_model/_logistic.py:444: ConvergenceWarning:
lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-
    regression
  n_iter_i = _check_optimize_result(

We can also get our coefficients in the same way.

print(results.coef_)
[[ 5.83764740e-04  7.29381775e-01 2.82543789e-01 7.03519146e-02
  -2.11748592e+00 -1.02984936e+00 2.50310160e-01]]

We can get the intercept as well.

print(results.intercept_)
[-4.82088401]

We can print out our results in a more attractive format.

values = np.append(results.intercept_, results.coef_)

# get the names of the values
names = np.append("intercept", predictors.columns)

# put everything in a labeled dataframe
results = pd.DataFrame(
    values,
    index=names,
    columns=["coef"], # you need the square brackets here
)

print(results)
                          coef
intercept            -4.820884
HouseCosts            0.000584
NumWorkers            0.729382
NumBedrooms           0.282544
OwnRent_Outright      0.070352
OwnRent_Rented       -2.117486
FamilyType_Male Head -1.029849
FamilyType_Married    0.250310

In order to interpret our coefficients, we still need to exponentiate our values.

results['or'] = np.exp(results['coef'])
print(results)
                            coef       or
intercept              -4.820884 0.008060
HouseCosts              0.000584 1.000584
NumWorkers              0.729382 2.073798
NumBedrooms             0.282544 1.326500
OwnRent_Outright        0.070352 1.072886
OwnRent_Rented         -2.117486 0.120334
FamilyType_Male Head   -1.029849 0.357061
FamilyType_Married      0.250310 1.284424
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.224.108.196