Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Generalized linear models

The generalized linear model is a group of models that try to find the M parameters that form a linear relationship between the labels y_i and the feature vector x⁽ⁱ⁾ that is as follows:

Here, are the errors of the model. The algorithm for finding the parameters tries to minimize the total error of the model defined by the cost function J:

The minimization of J is achieved using an iterative algorithm called batch gradient descent:

Here, α is called learning rate, and it is a trade-off between convergence speed and convergence precision. An alternative algorithm that is called stochastic gradient descent, that is loop for :

The θ_j is updated for each training example i instead of waiting to sum over the entire training set. The last algorithm converges near the minimum of J, typically faster than batch gradient descent, but the final solution may oscillate around the real values of the parameters. The following paragraphs describe the most common model and the corresponding cost function, J.

Linear regression

Linear regression is the simplest algorithm and is based on the model:

The cost function and update rule are:

Ridge regression

Ridge regression, also known as Tikhonov regularization, adds a term to the cost function J such that:

, where λ is the regularization parameter. The additional term has the function needed to prefer a certain set of parameters over all the possible solutions penalizing all the parameters θ_j different from 0. The final set of θ_jshrank around 0, lowering the variance of the parameters but introducing a bias error. Indicating with the superscript l the parameters from the linear regression, the ridge regression parameters are related by the following formula:

This clearly shows that the larger the λ value, the more the ridge parameters are shrunk around 0.

Lasso regression

Lasso regression is an algorithm similar to ridge regression, the only difference being that the regularization term is the sum of the absolute values of the parameters:

Logistic regression

Despite the name, this algorithm is used for (binary) classification problems, so we define the labels. The model is given the so-called logistic function expressed by:

In this case, the cost function is defined as follows:

From this, the update rule is formally the same as linear regression (but the model definition, , is different):

Note that the prediction for a point p, , is a continuous value between 0 and 1. So usually, to estimate the class label, we have a threshold at =0.5 such that:

The logistic regression algorithm is applicable to multiple label problems using the techniques one versus all or one versus one. Using the first method, a problem with K classes is solved by training K logistic regression models, each one assuming the labels of the considered class j as +1 and all the rest as 0. The second approach consists of training a model for each pair of labels ( trained models).

Probabilistic interpretation of generalized linear models

Now that we have seen the generalized linear model, let's find the parameters θ_j that satisfy the relationship:

In the case of linear regression, we can assume as normally distributed with mean 0 and variance σ² such that the probability is Probabilistic interpretation of generalized linear models equivalent to:

Probabilistic interpretation of generalized linear models

Therefore, the total likelihood of the system can be expressed as follows:

In the case of the logistic regression algorithm, we are assuming that the logistic function itself is the probability:

Then the likelihood can be expressed by:

In both cases, it can be shown that maximizing the likelihood is equivalent to minimizing the cost function, so the gradient descent will be the same.

k-nearest neighbours (KNN)

This is a very simple classification (or regression) method in which given a set of feature vectors with corresponding labels y_i, a test point x^(t) is assigned to the label value with the majority of the label occurrences in the K nearest neighbors found, using a distance measure such as the following:

Euclidean:
Manhattan:
Minkowski: (if q=2, this reduces to the Euclidean distance)

In the case of regression, the value y_t is calculated by replacing the majority of occurrences by the average of the labels . The simplest average (or the majority of occurrences) has uniform weights, so each point has the same importance regardless of their actual distance from x^(t). However, a weighted average with weights equal to the inverse distance from x^(t) may be used.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Generalized linear models

Create new playlist

Sign In

Sign Up