Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Classification with logistic regression

Logistic regression is a type of a classification algorithm (see http://en.wikipedia.org/wiki/Logistic_regression). This algorithm can be used to predict probabilities associated with a class or an event occurring. A classification problem with multiple classes can be reduced to a binary classification problem. In this simplest case, a high probability for one class, means a low probability for another class. Logistic regression is based on the logistic function, which has values in the range between 0 and 1—just like for probabilities. The logistic function can therefore be used to transform arbitrary values into probabilities.

We can define a function that performs classification with logistic regression. Create a classifier object as follows:

clf = LogisticRegression(random_state=12)

The random_state parameter acts like a seed for a pseudorandom generator. We touched upon the importance of cross-validation earlier in this book as a technique to avoid overfitting. The k-fold cross-validation is a form of cross-validation involving k (a small integer number) random data partitions called folds. In k iterations, each fold is used once for validation and the rest of the data is used for training. The classes in scikit-learn have a default k value of 3, but typically we may want to set it to a higher value such as 5 or 10. The results of the iterations can be combined at the end. The scikit-learn has a utility KFold class for k-fold cross-validation. Create a KFold object with 10 folds as follows:

kf = KFold(len(y), n_folds=10)

Train the data with the fit() method, as follows:

clf.fit(x[train], y[train])

The score() method measures classification accuracy:

scores.append(clf.score(x[test], y[test]))

In this example, we will use the day-of-the-year and previous day rain amount as features. Construct an array with features, as follows:

x = np.vstack((dates[:-1], rain[:-1]))

As classes, define first rainless days with 0 amount of rain; second, low amount of rain corresponding to -1 in our data and third, rainy days. These three classes can be linked to the sign of values in our data:

y = np.sign(rain[1:])

Using this setup, we get an average accuracy of 57 percent. For the scikit-learn sample iris dataset, we get an average accuracy of 41 percent (refer to log_regress.py file in this book's code bundle):

from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import KFold
from sklearn import datasets
import numpy as np


def classify(x, y):
    clf = LogisticRegression(random_state=12)
    scores = []
    kf = KFold(len(y), n_folds=10)
    for train,test in kf:
      clf.fit(x[train], y[train])
      scores.append(clf.score(x[test], y[test]))

    print np.mean(scores)

rain = np.load('rain.npy')
dates = np.load('doy.npy')

x = np.vstack((dates[:-1], rain[:-1]))
y = np.sign(rain[1:])
classify(x.T, y)

#iris example
iris = datasets.load_iris()
x = iris.data[:, :2]
y = iris.target
classify(x, y)

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Classification with logistic regression

Create new playlist

Sign In

Sign Up

Classification with logistic regression

Table of Contents for
Classification with logistic regression