Model definition – Bayesian logistic regression

As discussed in Chapter 6, Machine Learning Workflow, logistic regression estimates a linear relationship between a set of features and a binary outcome, which is mediated by a sigmoid function to ensure that the model produces probabilities. The frequentist approach resulted in point estimates for the parameters that measure the influence of each feature on the probability that a data point belongs to the positive class, with confidence intervals based on assumptions about the parameter distribution.

In contrast, Bayesian logistic regression estimates the posterior distribution over the parameters itself. The posterior allows for more robust estimates of what is called a Bayesian credible interval for each parameter, with the benefit of more transparency about the model's uncertainty.

A probabilistic program consists of observed and unobserved random variables (RVs). As we have discussed, we define the observed RVs via likelihood distributions and unobserved RVs via prior distributions. PyMC3 includes numerous probability distributions for this purpose.

We will use a simple dataset that classifies 30,000 individuals by income using a threshold of $50K per year. This dataset will contain information on age, sex, hours worked, and years of education. Hence, we are modeling the probability that an individual earns more than $50K using these features.

The PyMC3 library makes it very straightforward to perform approximate Bayesian inference for logistic regression. Logistic regression models the probability that individual i earns a high income based on k features, as outlined on the left-hand side of the following diagram:

We will use the context manager with to define a manual_logistic_model that we can refer to later as a probabilistic model:

The random variables for the unobserved parameters for intercept and two features are expressed using uninformative priors that assume normal distributions with a mean of 0 and a standard deviation of 100.
The likelihood combines the parameters with the data according to the specification of the logistic regression.
The outcome is modeled as a Bernoulli RV with success probability given by the likelihood:

with pm.Model() as manual_logistic_model:
    # coefficients as rvs with uninformative priors
    intercept = pm.Normal('intercept', 0, sd=100)
    b1 = pm.Normal('beta_1', 0, sd=100)
    b2 = pm.Normal('beta_2', 0, sd=100)

    # Likelihood transforms rvs into probabilities p(y=1)
    # according to logistic regression model.    
    likelihood = pm.invlogit(intercept + b1 * data.hours + b2 * data.educ)

    # Outcome as Bernoulli rv with success probability 
    # given by sigmoid function conditioned on actual data 
    pm.Bernoulli(name='logit', p=likelihood, observed=data.income)

Table of Contents for Model definition – Bayesian logistic regression

Create new playlist

Sign In

Sign Up

Table of Contents for
Model definition – Bayesian logistic regression