Classification with logistic regression

Logistic regression is a type of a classification algorithm (see http://en.wikipedia.org/wiki/Logistic_regression). This algorithm can be used to predict probabilities associated with a class or an event occurring. A classification problem with multiple classes can be reduced to a binary classification problem. In this simplest case, a high probability for one class, means a low probability for another class. Logistic regression is based on the logistic function, which has values in the range between 0 and 1—just like for probabilities. The logistic function can therefore be used to transform arbitrary values into probabilities.

We can define a function that performs classification with logistic regression. Create a classifier object as follows:

clf = LogisticRegression(random_state=12)

The random_state parameter acts like a seed for a pseudorandom generator. We touched upon the importance of cross-validation earlier in this book as a technique to avoid overfitting. The k-fold cross-validation is a form of cross-validation involving k (a small integer number) random data partitions called folds. In k iterations, each fold is used once for validation and the rest of the data is used for training. The classes in scikit-learn have a default k value of 3, but typically we may want to set it to a higher value such as 5 or 10. The results of the iterations can be combined at the end. The scikit-learn has a utility KFold class for k-fold cross-validation. Create a KFold object with 10 folds as follows:

kf = KFold(len(y), n_folds=10)

Train the data with the fit() method, as follows:

clf.fit(x[train], y[train])

The score() method measures classification accuracy:

scores.append(clf.score(x[test], y[test]))

In this example, we will use the day-of-the-year and previous day rain amount as features. Construct an array with features, as follows:

x = np.vstack((dates[:-1], rain[:-1]))

As classes, define first rainless days with 0 amount of rain; second, low amount of rain corresponding to -1 in our data and third, rainy days. These three classes can be linked to the sign of values in our data:

y = np.sign(rain[1:])

Using this setup, we get an average accuracy of 57 percent. For the scikit-learn sample iris dataset, we get an average accuracy of 41 percent (refer to log_regress.py file in this book's code bundle):

from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import KFold
from sklearn import datasets
import numpy as np


def classify(x, y):
    clf = LogisticRegression(random_state=12)
    scores = []
    kf = KFold(len(y), n_folds=10)
    for train,test in kf:
      clf.fit(x[train], y[train])
      scores.append(clf.score(x[test], y[test]))

    print np.mean(scores)

rain = np.load('rain.npy')
dates = np.load('doy.npy')

x = np.vstack((dates[:-1], rain[:-1]))
y = np.sign(rain[1:])
classify(x.T, y)

#iris example
iris = datasets.load_iris()
x = iris.data[:, :2]
y = iris.target
classify(x, y)
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.137.223.10