Bayesian averaging

So far, we have learned that simply minimizing the loss function (or equivalently maximizing the log likelihood function in the case of normal distribution) is not enough to develop a machine learning model for a given problem. One has to worry about models overfitting the training data, which will result in larger prediction errors on new datasets. The main advantage of Bayesian methods is that one can, in principle, get away from this problem, without using explicit regularization and different datasets for training and validation. This is called Bayesian model averaging and will be discussed here. This is one of the answers to our main question of the chapter, why Bayesian inference for machine learning?

For this, let's do a full Bayesian treatment of the linear regression problem. Since we only want to explain how Bayesian inference avoids the overfitting problem, we will skip all the mathematical derivations and state only the important results here. For more details, interested readers can refer to the book by Christopher M. Bishop (reference 2 in the References section of this chapter).

The linear regression equation Bayesian averaging, with Bayesian averaging having a normal distribution with zero mean and variance Bayesian averaging (equivalently, precision Bayesian averaging), can be cast in a probability distribution form with Y having a normal distribution with mean f(X) and precision Bayesian averaging. Therefore, linear regression is equivalent to estimating the mean of the normal distribution:

Bayesian averaging

Since Bayesian averaging, where the set of basis functions B(X) is known and we are assuming here that the noise parameter Bayesian averaging is also a known constant, only Bayesian averaging needs to be taken as an uncertain variable for a fully Bayesian treatment.

The first step in Bayesian inference is to compute a posterior distribution of parameter vector Bayesian averaging. For this, we assume that the prior distribution of Bayesian averaging is an M dimensional normal distribution (since there are M components) with mean Bayesian averaging and covariance matrix Bayesian averaging. As we have seen in Chapter 3, Introducing Bayesian Inference, this corresponds to taking a conjugate distribution for the prior:

Bayesian averaging

The corresponding posterior distribution is given by:

Bayesian averaging

Here, Bayesian averaging and Bayesian averaging.

Here, B is an N x M matrix formed by stacking basis vectors B, at different values of X, on top of each other as shown here:

Bayesian averaging

Now that we have the posterior distribution for Bayesian averaging as a closed-form analytical expression, we can use it to predict new values of Y. To get an analytical closed-form expression for the predictive distribution of Y, we make an assumption that Bayesian averaging and Bayesian averaging. This corresponds to a prior with zero mean and isotropic covariance matrix characterized by one precision parameter Bayesian averaging. The predictive distribution or the probability that the prediction for a new value of X = x is y, is given by:

Bayesian averaging

This equation is the central theme of this section. In the classical or frequentist approach, one estimates a particular value Bayesian averaging for the parameter Bayesian averaging from the training dataset and finds the probability of predicting y by simply using Bayesian averaging. This does not address the overfitting of the model unless regularization is used. In Bayesian inference, we are integrating out the parameter variable Bayesian averaging by using its posterior probability distribution Bayesian averaging learned from the data. This averaging will remove the necessity of using regularization or keeping the parameters to an optimal level through bias-variance tradeoff. This can be seen from the closed-form expression for P(y|x), after we substitute the expressions for Bayesian averaging and Bayesian averaging for the linear regression problem and do the integration. Since both are normal distributions, the integration can be done analytically that results in the following simple expression for P(y|x):

Bayesian averaging

Here, Bayesian averaging.

This equation implies that the variance of the predictive distribution consists of two terms. One term, 1/Bayesian averaging, coming from the inherent noise in the data and the second term coming from the uncertainty associated with the estimation of model parameter Bayesian averaging from data. One can show that as the size of training data N becomes very large, the second term decreases and in the limit Bayesian averaging it becomes zero.

The example shown here illustrates the power of Bayesian inference. Since one can take care of uncertainty in the parameter estimation through Bayesian averaging, one doesn't need to keep separate validation data and all the data can be used for training. So, a full Bayesian treatment of a problem avoids the overfitting issue. Another major advantage of Bayesian inference, which we will not go into in this section, is treating latent variables in a machine learning model. In the next section, we will give a high-level overview of the various common machine learning tasks.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.12.163.180