import pandas as pd
acs = pd.read_csv('data/acs_ny.csv')
print(acs.columns)
Index(['Acres', 'FamilyIncome', 'FamilyType', 'NumBedrooms', 'NumChildren',
'NumPeople', 'NumRooms', 'NumUnits', 'NumVehicles', 'NumWorkers',
'OwnRent', 'YearBuilt', 'HouseCosts', 'ElectricBill', 'FoodStamp',
'HeatingFuel', 'Insurance', 'Language'],
dtype='object')
print(acs.head())
Acres FamilyIncome FamilyType NumBedrooms NumChildren NumPeople
0 1-10 150 Married 4 1 3
1 1-10 180 Female Head 3 2 4
2 1-10 280 Female Head 4 0 2
3 1-10 330 Female Head 2 1 2
4 1-10 330 Male Head 3 1 2
Num Num Num Num Own Year
Rooms Units Vehicles Workers Rent Built
0 9 Single detached 1 0 Mortgage 1950-1959
1 6 Single detached 2 0 Rented Before 1939
2 8 Single detached 3 1 Mortgage 2000-2004
3 4 Single detached 1 0 Rented 1950-1959
4 5 Single attached 1 0 Mortgage Before 1939
House Electric Food Heating Insurance Language
Costs Bill Stamp Fuel
0 1800 90 No Gas 2500 English
1 850 90 No Oil 0 English
2 2600 260 No Oil 6600 Other European
3 1800 140 No Oil 0 English
4 860 150 No Gas 660 Spanish
To model these data, we first need to create a binary response variable. Here we split the FamilyIncome
variable into a binary variable.
acs["ge150k"] = pd.cut(
acs["FamilyIncome"],
[0, 150000, acs["FamilyIncome"].max()],
labels=[0, 1],
)
acs["ge150k_i"] = acs["ge150k"].astype(int)
print(acs["ge150k_i"].value_counts())
0 18294
1 4451
Name: ge150k_i, dtype: int64
In so doing, we created a binary (0/1) variable.
acs.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22745 entries, 0 to 22744
Data columns (total 20 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Acres 22745 non-null object
1 FamilyIncome 22745 non-null int64
2 FamilyType 22745 non-null object
3 NumBedrooms 22745 non-null int64
4 NumChildren 22745 non-null int64
5 NumPeople 22745 non-null int64
6 NumRooms 22745 non-null int64
7 NumUnits 22745 non-null object
8 NumVehicles 22745 non-null int64
9 NumWorkers 22745 non-null int64
10 OwnRent 22745 non-null object
11 YearBuilt 22745 non-null object
12 HouseCosts 22745 non-null int64
13 ElectricBill 22745 non-null int64
14 FoodStamp 22745 non-null object
15 HeatingFuel 22745 non-null object
16 Insurance 22745 non-null int64
17 Language 22745 non-null object
18 ge150k 22745 non-null category
19 ge150k_i 22745 non-null int64
dtypes: category(1), int64(11), object(8)
memory usage: 3.3+ MB
Let’s subset our data with just the columns we’ll use for the example.
acs_sub = acs[
[
"ge150k_i",
"HouseCosts",
"NumWorkers",
"OwnRent",
"NumBedrooms",
"FamilyType",
]
].copy()
print(acs_sub)
ge150k_i HouseCosts NumWorkers OwnRent NumBedrooms FamilyType
0 0 1800 0 Mortgage 4 Married
1 0 850 0 Rented 3 Female Head
2 0 2600 1 Mortgage 4 Female Head
3 0 1800 0 Rented 2 Female Head
4 0 860 0 Mortgage 3 Male Head
... ... ... ... ... ... ...
22740 1 1700 2 Mortgage 5 Married
22741 1 1300 2 Mortgage 4 Married
22742 1 410 3 Mortgage 4 Married
22743 1 1600 3 Mortgage 3 Married
22744 1 6500 2 Mortgage 4 Married
[22745 rows x 6 columns]
import statsmodels.formula.api as smf
# we break up the formula string to fit on the page
model = smf.logit(
"ge150k_i ~ HouseCosts + NumWorkers + OwnRent + NumBedrooms
+ FamilyType",
data=acs_sub,
)
results = model.fit()
Optimization terminated successfully.
Current function value: 0.391651
Iterations 7
print(results.summary())
Logit Regression Results
==============================================================================
Dep. Variable: ge150k_i No. Observations: 22745
Model: Logit Df Residuals: 22737
Method: MLE Df Model: 7
Date: Thu, 01 Sep 2022 Pseudo R-squ.: 0.2078
Time: 01:57:02 Log-Likelihood: -8908.1
converged: True LL-Null: -11244.
Covariance Type: nonrobust LLR p-value: 0.000
===========================================================================================
coef std err z P>|z| [0.025 0.975]
-------------------------------------------------------------------------------------------
Intercept -5.8081 0.120 -48.456 0.000 -6.043 -5.573
OwnRent[T.Outright] 1.8276 0.208 8.782 0.000 1.420 2.236
OwnRent[T.Rented] -0.8763 0.101 -8.647 0.000 -1.075 -0.678
FamilyType[T.Male Head] 0.2874 0.150 1.913 0.056 -0.007 0.582
FamilyType[T.Married] 1.3877 0.088 15.781 0.000 1.215 1.560
HouseCosts 0.0007 1.72e-05 42.453 0.000 0.001 0.001
NumWorkers 0.5873 0.026 22.393 0.000 0.536 0.639
NumBedrooms 0.2365 0.017 13.985 0.000 0.203 0.270
===========================================================================================
import statsmodels.formula.api as smf
# we break up the formula string to fit on the page
model = smf.logit(
"ge150k_i ~ HouseCosts + NumWorkers + OwnRent + NumBedrooms + FamilyType",
data=acs_sub,
)
results = model.fit()
Optimization terminated successfully.
Current function value: 0.391651
Iterations 7
print(results.summary())
Logit Regression Results
==============================================================================
Dep. Variable: ge150k_i No. Observations: 22745
Model: Logit Df Residuals: 22737
Method: MLE Df Model: 7
Date: Thu, 01 Sep 2022 Pseudo R-squ.: 0.2078
Time: 01:57:02 Log-Likelihood: -8908.1
converged: True LL-Null: -11244.
Covariance Type: nonrobust LLR p-value: 0.000
===========================================================================================
coef std err z P>|z| [0.025 0.975]
-------------------------------------------------------------------------------------------
Intercept -5.8081 0.120 -48.456 0.000 -6.043 -5.573
OwnRent[T.Outright] 1.8276 0.208 8.782 0.000 1.420 2.236
OwnRent[T.Rented] -0.8763 0.101 -8.647 0.000 -1.075 -0.678
FamilyType[T.Male Head] 0.2874 0.150 1.913 0.056 -0.007 0.582
FamilyType[T.Married] 1.3877 0.088 15.781 0.000 1.215 1.560
HouseCosts 0.0007 1.72e-05 42.453 0.000 0.001 0.001
NumWorkers 0.5873 0.026 22.393 0.000 0.536 0.639
NumBedrooms 0.2365 0.017 13.985 0.000 0.203 0.270
===========================================================================================
import numpy as np
# exponentiate our results
odds_ratios = np.exp(results.params)
print(odds_ratios)
Intercept 0.003003
OwnRent[T.Outright] 6.219147
OwnRent[T.Rented] 0.416310
FamilyType[T.Male Head] 1.332901
FamilyType[T.Married] 4.005636
HouseCosts 1.000731
NumWorkers 1.799117
NumBedrooms 1.266852
dtype: float64
print(acs.OwnRent.unique())
['Mortgage' 'Rented' 'Outright']
predictors = pd.get_dummies(acs_sub.iloc[:, 1:], drop_first=True)
print(predictors)
HouseCosts NumWorkers NumBedrooms OwnRent_Outright OwnRent_Rented
0 1800 0 4 0 0
1 850 0 3 0 1
2 2600 1 4 0 0
3 1800 0 2 0 1
4 860 0 3 0 0
... ... ... ... ... ...
22740 1700 2 5 0 0
22741 1300 2 4 0 0
22742 410 3 4 0 0
22743 1600 3 3 0 0
22744 6500 2 4 0 0
FamilyType_Male Head FamilyType_Married
0 0 1
1 0 0
2 0 0
3 0 0
4 1 0
... ... ...
22740 0 1
22741 0 1
22742 0 1
22743 0 1
22744 0 1
[22745 rows x 7 columns]
from sklearn import linear_model
lr = linear_model.LogisticRegression()
results = lr.fit(X = predictors, y = acs['ge150k_i'])
/Users/danielchen/.pyenv/versions/3.10.4/envs/pfe_book/lib/python3.10/
site-packages/sklearn/linear_model/_logistic.py:444: ConvergenceWarning:
lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-
regression
n_iter_i = _check_optimize_result(
We can also get our coefficients in the same way.
print(results.coef_)
[[ 5.83764740e-04 7.29381775e-01 2.82543789e-01 7.03519146e-02
-2.11748592e+00 -1.02984936e+00 2.50310160e-01]]
We can get the intercept as well.
print(results.intercept_)
[-4.82088401]
We can print out our results in a more attractive format.
values = np.append(results.intercept_, results.coef_)
# get the names of the values
names = np.append("intercept", predictors.columns)
# put everything in a labeled dataframe
results = pd.DataFrame(
values,
index=names,
columns=["coef"], # you need the square brackets here
)
print(results)
coef
intercept -4.820884
HouseCosts 0.000584
NumWorkers 0.729382
NumBedrooms 0.282544
OwnRent_Outright 0.070352
OwnRent_Rented -2.117486
FamilyType_Male Head -1.029849
FamilyType_Married 0.250310
In order to interpret our coefficients, we still need to exponentiate our values.
results['or'] = np.exp(results['coef'])
print(results)
coef or
intercept -4.820884 0.008060
HouseCosts 0.000584 1.000584
NumWorkers 0.729382 2.073798
NumBedrooms 0.282544 1.326500
OwnRent_Outright 0.070352 1.072886
OwnRent_Rented -2.117486 0.120334
FamilyType_Male Head -1.029849 0.357061
FamilyType_Married 0.250310 1.284424
18.224.108.196