Jumping from the logistic function to logistic regression

Now that we have some knowledge of the logistic function, it is easy to map it to the algorithm that stems from it. In logistic regression, the function input becomes the weighted sum of features. Given a data sample with n features, x1, x2, …, xn (x represents a feature vector and x = (x1, x2, …, xn)), and weights (also called coefficients) of the model w (w represents a vector (w1, w2, …, wn)), z is expressed as follows:

Also, occasionally, the model comes with an intercept (also called bias), w0. In this instance, the preceding linear relationship becomes:

As for the output y(z) in the range of 0 to 1, in the algorithm, it becomes the probability of the target being 1 or the positive class:

Hence, logistic regression is a probabilistic classifier, similar to the Naïve Bayes classifier.

A logistic regression model or, more specifically, its weight vector w is learned from the training data, with the goal of predicting a positive sample as close to 1 as possible and predicting a negative sample as close to 0 as possible. In mathematical language, the weights are trained so as to minimize the cost defined as the mean squared error (MSE), which measures the average of squares of difference between the truth and the prediction. Given m training samples, (x(1), y(1)), (x(2),y(2)), … (x(i), y(i))…, (x(m)y(m)), where y(i) is either 1 (positive class) or 0 (negative class), the cost function J(w) regarding the weights to be optimized is expressed as follows:

However, the preceding cost function is non-convex, which means that, when searching for the optimal w, many local (suboptimal) optimums are found and the function does not converge to a global optimum.

Examples of the convex and non-convex functions are plotted respectively below:

To overcome this, the cost function in practice is defined as follows:

We can take a closer look at the cost of a single training sample:

If y(i)=1, when it predicts correctly (positive class in 100% probability), the sample cost j is 0; the cost keeps increasing when it is less likely to be the positive class; when it incorrectly predicts that there is no chance to be the positive class, the cost is infinitely high. We can visualize it as follows:

>>> y_hat = np.linspace(0, 1, 1000)
>>> cost = -np.log(y_hat)
>>> plt.plot(y_hat, cost)
>>> plt.xlabel('Prediction')
>>> plt.ylabel('Cost')
>>> plt.xlim(0, 1)
>>> plt.ylim(0, 7)
>>> plt.show()

Refer to the following screenshot for the end result:

On the contrary, if y(i)=0, when it predicts correctly (positive class in 0 probability, or negative class in 100% probability), the sample cost j is 0; the cost keeps increasing when it is more likely to be the positive class; when it incorrectly predicts that there is no chance to be the negative class, the cost goes infinitely high. We can visualize it using the following codes:

>>> y_hat = np.linspace(0, 1, 1000)
>>> cost = -np.log(1 - y_hat)
>>> plt.plot(y_hat, cost)
>>> plt.xlabel('Prediction')
>>> plt.ylabel('Cost')
>>> plt.xlim(0, 1)
>>> plt.ylim(0, 7)
>>> plt.show()

The following screenshot is the resultant output:

Minimizing this alternative cost function is actually equivalent to minimizing the MSE-based cost function. The advantages of choosing it over the other one include the following:

  • Obviously, being convex, so that the optimal model weights can be found
  • A summation of the logarithms of prediction  or  simplifies the calculation of its derivative with respect to the weights, which we will talk about later

Due to the logarithmic function, the cost function  is also called logarithmic loss, or simply log loss.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.226.98.208