Regularizing priors

Using informative and weakly informative priors is a way of introducing bias in a model and, if done properly, this can be a really good because bias prevents overfitting and thus contributes to models being able to make predictions that generalize well. This idea of adding a bias to reduce a generalization error without affecting the ability of the model to adequately model the data that's used to fit is known as regularization. This regularization often takes the form of penalizing larger values for the parameters in a model. This is a way of reducing the information that a model is able to represent and thus reduces the chances that a model captures the noise instead of the signal.

The regularization idea is so powerful and useful that it has been discovered several times, including outside the Bayesian framework. In some fields, this idea is known as Tikhonov regularization. In non-Bayesian statistics, this regularization idea takes the form of two modifications on the least square method, known as ridge regression and Lasso regression. From the Bayesian point of view, a ridge regression can be interpreted as using normal distributions for the beta coefficients (of a linear  model), with a small standard deviation that pushes the coefficients toward zero. In this sense, we have been doing something similar to ridge regression for every single linear model in this book (except the examples in this chapter that uses SciPy!). On the other hand, Lasso regression can be interpreted from a Bayesian point of view as the MAP of the posterior computed from a model with Laplace priors for the beta coefficients. The Laplace distribution looks similar to the Gaussian distribution, but its first derivative is undefined at zero because it has a very sharp peak at zero (see Figure 5.14). The  Laplace distribution concentrates its probability mass much closer to zero compared to the normal distribution. The idea of using such a prior is to provide both regularization and variable selection. The idea is that since we have this peak at zero, we expect the prior to induce sparsity, that is, we create a model with a lot of parameters and the prior will automatically makes most of them zero, keeping only the relevant variables contributing to the output of the model. Unfortunately, the Bayesian Lasso does not really work like this, basically because in order to have many parameters, the Laplace prior is forcing the non-zero parameters to be small. Fortunately, not everything is lost—there are Bayesian models that can be used for inducing sparsity and performing variable selection, like the horseshoe and the Finnish horseshoe.

It is important to notice that the classical versions of ridge and Lasso regressions correspond to single point estimates, while the Bayesian versions gave a full posterior distribution as a result:

Figure 5.14
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.116.14.245