Logistic regression is a type of a classification algorithm (see http://en.wikipedia.org/wiki/Logistic_regression). This algorithm can be used to predict probabilities associated with a class or an event occurring. A classification problem with multiple classes can be reduced to a binary classification problem. In this simplest case, a high probability for one class, means a low probability for another class. Logistic regression is based on the logistic function, which has values in the range between 0 and 1—just like for probabilities. The logistic function can therefore be used to transform arbitrary values into probabilities.
We can define a function that performs classification with logistic regression. Create a classifier object as follows:
clf = LogisticRegression(random_state=12)
The random_state
parameter acts like a seed for a pseudorandom generator. We touched upon the importance of cross-validation earlier in this book as a technique to avoid overfitting. The k-fold cross-validation is a form of cross-validation involving k (a small integer number) random data partitions called folds. In k iterations, each fold is used once for validation and the rest of the data is used for training. The classes in scikit-learn have a default k value of 3, but typically we may want to set it to a higher value such as 5 or 10. The results of the iterations can be combined at the end. The scikit-learn has a utility KFold
class for k-fold cross-validation. Create a KFold
object with 10 folds as follows:
kf = KFold(len(y), n_folds=10)
Train the data with the fit()
method, as follows:
clf.fit(x[train], y[train])
The score()
method measures classification accuracy:
scores.append(clf.score(x[test], y[test]))
In this example, we will use the day-of-the-year and previous day rain amount as features. Construct an array with features, as follows:
x = np.vstack((dates[:-1], rain[:-1]))
As classes, define first rainless days with 0 amount of rain; second, low amount of rain corresponding to -1 in our data and third, rainy days. These three classes can be linked to the sign of values in our data:
y = np.sign(rain[1:])
Using this setup, we get an average accuracy of 57 percent. For the scikit-learn sample iris dataset, we get an average accuracy of 41 percent (refer to log_regress.py
file in this book's code bundle):
from sklearn.linear_model import LogisticRegression from sklearn.cross_validation import KFold from sklearn import datasets import numpy as np def classify(x, y): clf = LogisticRegression(random_state=12) scores = [] kf = KFold(len(y), n_folds=10) for train,test in kf: clf.fit(x[train], y[train]) scores.append(clf.score(x[test], y[test])) print np.mean(scores) rain = np.load('rain.npy') dates = np.load('doy.npy') x = np.vstack((dates[:-1], rain[:-1])) y = np.sign(rain[1:]) classify(x.T, y) #iris example iris = datasets.load_iris() x = iris.data[:, :2] y = iris.target classify(x, y)
3.137.223.10