Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Modeling class probabilities via logistic regression

Although the perceptron rule offers a nice and easygoing introduction to machine learning algorithms for classification, its biggest disadvantage is that it never converges if the classes are not perfectly linearly separable. The classification task in the previous section would be an example of such a scenario. Intuitively, we can think of the reason as the weights are continuously being updated since there is always at least one misclassified sample present in each epoch. Of course, you can change the learning rate and increase the number of epochs, but be warned that the perceptron will never converge on this dataset. To make better use of our time, we will now take a look at another simple yet more powerful algorithm for linear and binary classification problems: logistic regression. Note that, in spite of its name, logistic regression is a model for classification, not regression.

Logistic regression intuition and conditional probabilities

Logistic regression is a classification model that is very easy to implement but performs very well on linearly separable classes. It is one of the most widely used algorithms for classification in industry. Similar to the perceptron and Adaline, the logistic regression model in this chapter is also a linear model for binary classification that can be extended to multiclass classification via the OvR technique.

To explain the idea behind logistic regression as a probabilistic model, let's first introduce the odds ratio, which is the odds in favor of a particular event. The odds ratio can be written as Logistic regression intuition and conditional probabilities , where stands for the probability of the positive event. The term positive event does not necessarily mean good, but refers to the event that we want to predict, for example, the probability that a patient has a certain disease; we can think of the positive event as class label . We can then further define the logit function, which is simply the logarithm of the odds ratio (log-odds):

The logit function takes input values in the range 0 to 1 and transforms them to values over the entire real number range, which we can use to express a linear relationship between feature values and the log-odds:

Here, is the conditional probability that a particular sample belongs to class 1 given its features x.

Now what we are actually interested in is predicting the probability that a certain sample belongs to a particular class, which is the inverse form of the logit function. It is also called the logistic function, sometimes simply abbreviated as sigmoid function due to its characteristic S-shape.

Here, z is the net input, that is, the linear combination of weights and sample features and can be calculated as .

Now let's simply plot the sigmoid function for some values in the range -7 to 7 to see what it looks like:

>>> import matplotlib.pyplot as plt
>>> import numpy as np
>>> def sigmoid(z):
...     return 1.0 / (1.0 + np.exp(-z))
>>> z = np.arange(-7, 7, 0.1)
>>> phi_z = sigmoid(z)
>>> plt.plot(z, phi_z)
>>> plt.axvline(0.0, color='k')
>>> plt.axhspan(0.0, 1.0, facecolor='1.0', alpha=1.0, ls='dotted')
>>> plt.axhline(y=0.5, ls='dotted', color='k')
>>> plt.yticks([0.0, 0.5, 1.0])
>>> plt.ylim(-0.1, 1.1)
>>> plt.xlabel('z')
>>> plt.ylabel('$phi (z)$')
>>> plt.show()

As a result of executing the previous code example, we should now see the S-shaped (sigmoidal) curve:

Logistic regression intuition and conditional probabilities

We can see that approaches 1 if z goes towards infinity (), since becomes very small for large values of z. Similarly, goes towards 0 for as the result of an increasingly large denominator. Thus, we conclude that this sigmoid function takes real number values as input and transforms them to values in the range [0, 1] with an intercept at .

To build some intuition for the logistic regression model, we can relate it to our previous Adaline implementation in Chapter 2, Training Machine Learning Algorithms for Classification. In Adaline, we used the identity function as the activation function. In logistic regression, this activation function simply becomes the sigmoid function that we defined earlier, which is illustrated in the following figure:

The output of the sigmoid function is then interpreted as the probability of particular sample belonging to class 1 , given its features x parameterized by the weights w. For example, if we compute for a particular flower sample, it means that the chance that this sample is an Iris-Versicolor flower is 80 percent. Similarly, the probability that this flower is an Iris-Setosa flower can be calculated as or 20 percent. The predicted probability can then simply be converted into a binary outcome via a quantizer (unit step function):

If we look at the preceding sigmoid plot, this is equivalent to the following:

In fact, there are many applications where we are not only interested in the predicted class labels, but where estimating the class-membership probability is particularly useful. Logistic regression is used in weather forecasting, for example, to not only predict if it will rain on a particular day but also to report the chance of rain. Similarly, logistic regression can be used to predict the chance that a patient has a particular disease given certain symptoms, which is why logistic regression enjoys wide popularity in the field of medicine.

Learning the weights of the logistic cost function

You learned how we could use the logistic regression model to predict probabilities and class labels. Now let's briefly talk about the parameters of the model, for example, weights w. In the previous chapter, we defined the sum-squared-error cost function:

We minimized this in order to learn the weights w for our Adaline classification model. To explain how we can derive the cost function for logistic regression, let's first define the likelihood L that we want to maximize when we build a logistic regression model, assuming that the individual samples in our dataset are independent of one another. The formula is as follows:

In practice, it is easier to maximize the (natural) log of this equation, which is called the log-likelihood function:

Firstly, applying the log function reduces the potential for numerical underflow, which can occur if the likelihoods are very small. Secondly, we can convert the product of factors into a summation of factors, which makes it easier to obtain the derivative of this function via the addition trick, as you may remember from calculus.

Now we could use an optimization algorithm such as gradient ascent to maximize this log-likelihood function. Alternatively, let's rewrite the log-likelihood as a cost function that can be minimized using gradient descent as in Chapter 2, Training Machine Learning Algorithms for Classification:

To get a better grasp on this cost function, let's take a look at the cost that we calculate for one single-sample instance:

Looking at the preceding equation, we can see that the first term becomes zero if , and the second term becomes zero if , respectively:

Learning the weights of the logistic cost function

The following plot illustrates the cost for the classification of a single-sample instance for different values of :

We can see that the cost approaches 0 (plain blue line) if we correctly predict that a sample belongs to class 1. Similarly, we can see on the y axis that the cost also approaches 0 if we correctly predict (dashed line). However, if the prediction is wrong, the cost goes towards infinity. The moral is that we penalize wrong predictions with an increasingly larger cost.

Training a logistic regression model with scikit-learn

If we were to implement logistic regression ourselves, we could simply substitute the cost function in our Adaline implementation from Chapter 2, Training Machine Learning Algorithms for Classification, by the new cost function:

This would compute the cost of classifying all training samples per epoch and we would end up with a working logistic regression model. However, since scikit-learn implements a highly optimized version of logistic regression that also supports multiclass settings off-the-shelf, we will skip the implementation and use the sklearn.linear_model.LogisticRegression class as well as the familiar fit method to train the model on the standardized flower training dataset:

>>> from sklearn.linear_model import LogisticRegression
>>> lr = LogisticRegression(C=1000.0, random_state=0)
>>> lr.fit(X_train_std, y_train)
   >>> plot_decision_regions(X_combined_std, 
...                       y_combined, classifier=lr,
...                       test_idx=range(105,150))
>>> plt.xlabel('petal length [standardized]')
>>> plt.ylabel('petal width [standardized]')
>>> plt.legend(loc='upper left')
>>> plt.show()

After fitting the model on the training data, we plotted the decision regions, training samples and test samples, as shown here:

Training a logistic regression model with scikit-learn

Looking at the preceding code that we used to train the LogisticRegression model, you might now be wondering, "What is this mysterious parameter C?" We will get to this in a second, but let's briefly go over the concept of overfitting and regularization in the next subsection first.

Furthermore, we can predict the class-membership probability of the samples via the predict_proba method. For example, we can predict the probabilities of the first Iris-Setosa sample:

>>> lr.predict_proba(X_test_std[0,:])

This returns the following array:

array([[  0.000,   0.063,   0.937]])

The preceding array tells us that the model predicts a chance of 93.7 percent that the sample belongs to the Iris-Virginica class, and a 6.3 percent chance that the sample is a Iris-Versicolor flower.

We can show that the weight update in logistic regression via gradient descent is indeed equal to the equation that we used in Adaline in Chapter 2, Training Machine Learning Algorithms for Classification. Let's start by calculating the partial derivative of the log-likelihood function with respect to the jth weight:

Before we continue, let's calculate the partial derivative of the sigmoid function first:

Now we can resubstitute Training a logistic regression model with scikit-learn = in our first equation to obtain the following:

Remember that the goal is to find the weights that maximize the log-likelihood so that we would perform the update for each weight as follows:

Since we update all weights simultaneously, we can write the general update rule as follows:

We define as follows:

Since maximizing the log-likelihood is equal to minimizing the cost function that we defined earlier, we can write the gradient descent update rule as follows:

This is equal to the gradient descent rule in Adaline in Chapter 2, Training Machine Learning Algorithms for Classification.

Tackling overfitting via regularization

Overfitting is a common problem in machine learning, where a model performs well on training data but does not generalize well to unseen data (test data). If a model suffers from overfitting, we also say that the model has a high variance, which can be caused by having too many parameters that lead to a model that is too complex given the underlying data. Similarly, our model can also suffer from underfitting (high bias), which means that our model is not complex enough to capture the pattern in the training data well and therefore also suffers from low performance on unseen data.

Although we have only encountered linear models for classification so far, the problem of overfitting and underfitting can be best illustrated by using a more complex, nonlinear decision boundary as shown in the following figure:

Note

Variance measures the consistency (or variability) of the model prediction for a particular sample instance if we would retrain the model multiple times, for example, on different subsets of the training dataset. We can say that the model is sensitive to the randomness in the training data. In contrast, bias measures how far off the predictions are from the correct values in general if we rebuild the model multiple times on different training datasets; bias is the measure of the systematic error that is not due to randomness.

One way of finding a good bias-variance tradeoff is to tune the complexity of the model via regularization. Regularization is a very useful method to handle collinearity (high correlation among features), filter out noise from data, and eventually prevent overfitting. The concept behind regularization is to introduce additional information (bias) to penalize extreme parameter weights. The most common form of regularization is the so-called L2 regularization (sometimes also called L2 shrinkage or weight decay), which can be written as follows:

Here, is the so-called regularization parameter.

Note

Regularization is another reason why feature scaling such as standardization is important. For regularization to work properly, we need to ensure that all our features are on comparable scales.

In order to apply regularization, we just need to add the regularization term to the cost function that we defined for logistic regression to shrink the weights:

Via the regularization parameter , we can then control how well we fit the training data while keeping the weights small. By increasing the value of , we increase the regularization strength.

The parameter C that is implemented for the LogisticRegression class in scikit-learn comes from a convention in support vector machines, which will be the topic of the next section. C is directly related to the regularization parameter , which is its inverse:

So we can rewrite the regularized cost function of logistic regression as follows:

Consequently, decreasing the value of the inverse regularization parameter C means that we are increasing the regularization strength, which we can visualize by plotting the L2 regularization path for the two weight coefficients:

>>> weights, params = [], []
>>> for c in np.arange(-5, 5):
...     lr = LogisticRegression(C=10**c, random_state=0)
...     lr.fit(X_train_std, y_train)
...     weights.append(lr.coef_[1])
...     params.append(10**c)
>>> weights = np.array(weights)
>>> plt.plot(params, weights[:, 0], 
...          label='petal length')
>>> plt.plot(params, weights[:, 1], linestyle='--', 
...          label='petal width')
>>> plt.ylabel('weight coefficient')
>>> plt.xlabel('C')
>>> plt.legend(loc='upper left')
>>> plt.xscale('log')
>>> plt.show()

By executing the preceding code, we fitted ten logistic regression models with different values for the inverse-regularization parameter C. For the purposes of illustration, we only collected the weight coefficients of the class 2 vs. all classifier. Remember that we are using the OvR technique for multiclass classification.

As we can see in the resulting plot, the weight coefficients shrink if we decrease the parameter C, that is, if we increase the regularization strength:

Note

Since an in-depth coverage of the individual classification algorithms exceeds the scope of this book, I warmly recommend Dr. Scott Menard's Logistic Regression: From Introductory to Advanced Concepts and Applications, Sage Publications, to readers who want to learn more about logistic regression.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Modeling class probabilities via logistic regression

Create new playlist

Sign In

Sign Up

Modeling class probabilities via logistic regression

Logistic regression intuition and conditional probabilities

Learning the weights of the logistic cost function

Training a logistic regression model with scikit-learn

Tackling overfitting via regularization

Note

Note

Note

Table of Contents for
Modeling class probabilities via logistic regression