How to conduct inference with statsmodels

We will illustrate how to use logistic regression with statsmodels based on a simple built-in dataset containing quarterly US macro data from 1959 – 2009 (see the notebook logistic_regression_macro_data.ipynb for detail).

The variables and their transformations are listed in the following table:

Variable

Description

Transformation

realgdp

Real gross domestic product

Annual Growth Rate

realcons

Real personal consumption expenditures

Annual Growth Rate

realinv

Real gross private domestic investment

Annual Growth Rate

realgovt

Real federal expenditures and gross investment

Annual Growth Rate

realdpi

Real private disposable income

Annual Growth Rate

m1

M1 nominal money stock

Annual Growth Rate

tbilrate

Monthly 3 treasury bill rate

Level

unemp

Seasonally adjusted unemployment rate (%)

Level

infl

Inflation rate

Level

realint

Real interest rate

Level

 

To obtain a binary target variable, we compute the 20-quarter rolling average of the annual growth rate of quarterly real GDP. We then assign 1 if current growth exceeds the moving average and 0 otherwise. Finally, we shift the indicator variables to align next quarter's outcome with the current quarter.

We use an intercept and convert the quarter values to dummy variables and train the logistic regression model as follows:

import statsmodels.api as sm

data = pd.get_dummies(data.drop(drop_cols, axis=1), columns=['quarter'], drop_first=True).dropna()
model = sm.Logit(data.target, sm.add_constant(data.drop('target', axis=1)))
result = model.fit()
result.summary()

This produces the following summary for our model with 198 observations and 13 variables, including intercept:

Logit Regression results

The summary indicates that the model has been trained using maximum likelihood and provides the maximized value of the log-likelihood function at -67.9.

The LL-Null value of -136.42 is the result of the maximized log-likelihood function when only an intercept is included. It forms the basis for the pseudo-R2 statistic and the Log-Likelihood Ratio (LLR) test. 

The pseudo-Rstatistic is a substitute for the familiar R2 available under least squares. It is computed based on the ratio of the maximized log-likelihood function for the null model m0 and the full model m1 as follows:

The values vary from 0 (when the model does not improve the likelihood) to 1 where the model fits perfectly and the log-likelihood is maximized at 0. Consequently, higher values indicate a better fit.

The LLR test generally compares a more restricted model and is computed as:

 

The null hypothesis is that the restricted model performs better but the low p-value suggests that we can reject this hypothesis and prefer the full model over the null model. This is similar to the F-test for linear regression (where we can also use the LLR test when we estimate the model using MLE).

The z-statistic plays the same role as the t-statistic in the linear regression output and is equally computed as the ratio of the coefficient estimate and its standard error. The p-values also indicate the probability of observing the test statistic assuming the null hypothesis H0 : β = 0 that the population coefficient is zero. We can reject this hypothesis for the intercept, realcons, realinv, realgovt, realdpi, and unemp.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.224.246.203