Poisson regression and ZIP regression

The ZIP model may look a little dull, but sometimes we need to estimate simple distributions, such as this one, or others such as Poisson or Gaussian distributions. Anyway, we can also use Poisson or ZIP distributions as part of a linear model. As we saw with the logistic and softmax regressions, we can use an inverse link function to transform the result of a linear model into a variable in a range suitable to be used with other distributions than the normal. Following the same idea, we can now perform a regression analysis where the output variable is a count variable using a Poisson or a ZIP distribution. This time, we can use the exponential function, , as the inverse link function. This choice guarantees the values returned by the linear model are always positive:

To exemplify a ZIP-regression model implementation, we are going to work with a dataset taken from the Institute for Digital Research and Education (http://www.ats.ucla.edu/stat/data). We have 250 groups of visitors to a park. Here are some parts of the data per group:

  • The number of fish they caught (count)
  • How many children were in the group (child)
  • Whether they brought a camper to the park (camper)

Using this data, we are going to build a model that predicts the number of caught fish as a function of the child and camper variables. We can use pandas to load the data:

fish_data = pd.read_csv('../data/fish.csv')

I leave it as an exercise for you to explore the dataset using plots and/or a Pandas function, such as describe. For now, we are going to continue by implementing the ZIP_reg PyMC3 model:

with pm.Model() as ZIP_reg:
ψ = pm.Beta('ψ', 1, 1)
α = pm.Normal('α', 0, 10)
β = pm.Normal('β', 0, 10, shape=2)
θ = pm.math.exp(α + β[0] * fish_data['child'] + β[1] * fish_data['camper'])
yl = pm.ZeroInflatedPoisson('yl', ψ, θ, observed=fish_data['count'])
trace_ZIP_reg = pm.sample(1000)

Camper is a binary variable with a value 0f 0 for not-camper and 1 for camper. A variable indicating the absence/presence of an attribute is usually denoted as a dummy variable or indicator variable. Note that when camper takes the value of 0, the term involving will also be 0 and the model reduces to a regression with a single independent variable.

To better understand the results of our inference, let's do a plot:

children = [0, 1, 2, 3, 4]
fish_count_pred_0 = []
fish_count_pred_1 = []
for n in children:
without_camper = trace_ZIP_reg['α'] + trace_ZIP_reg['β'][:,0] * n
with_camper = without_camper + trace_ZIP_reg['β'][:,1]
fish_count_pred_0.append(np.exp(without_camper))
fish_count_pred_1.append(np.exp(with_camper))


plt.plot(children, fish_count_pred_0, 'C0.', alpha=0.01)
plt.plot(children, fish_count_pred_1, 'C1.', alpha=0.01)

plt.xticks(children);
plt.xlabel('Number of children')
plt.ylabel('Fish caught')
plt.plot([], 'C0o', label='without camper')
plt.plot([], 'C1o', label='with camper')
plt.legend()

Figure 4.12

From Figure 4.12, we can see that the higher the number of children, the lower the number of fish caught. Also people that travel with a camper generally catch more fish. If you check the  coefficients for child and camper, you will see that we can say the following:

  • For each additional child, the expected count of the fish caught decreases by
  • Being in a camper increases the expected count of the fish caught by

We arrive at these values by taking the exponential of the  and  coefficients, respectively.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.140.191.34