Chapter 17
IN THIS CHAPTER
Using linear and logistic regression
Understanding Bayes’ theorem and using it for naive classification
Predicting on the basis of cases being similar with KNN
In this new part of the book, you start to explore all the algorithms and tools necessary for learning from data (training a model with data) and being capable of predicting a numeric estimate (for example, house pricing) or a class (for instance, the species of an Iris flower) given any new example that you didn’t have before. In this chapter, you start with the simplest algorithms and work toward those that are more complex. The four algorithms in this chapter represent a good starting point for any data scientist.
Regression has a long history in statistics, from building simple but effective linear models of economic, psychological, social, or political data, to hypothesis testing for understanding group differences, to modeling more complex problems with ordinal values, binary and multiple classes, count data, and hierarchical relationships. It’s also a common tool in data science, a Swiss Army knife of machine learning that you can use for every problem. Stripped of most of its statistical properties, data science practitioners perceive linear regression as a simple, understandable, yet effective algorithm for estimations, and, in its logistic regression version, for classification as well.
Linear regression is a statistical model that defines the relationship between a target variable and a set of predictive features. It does so by using a formula of the following type:
y = bx + a.
You can translate this formula into something readable and useful for many problems. For instance, if you’re trying to guess your sales based on historical results and available data about advertising expenditures, the same preceding formula becomes
sales = b * (advertising expenditure) + a
You can demystify the formula by explaining its components: a
is the value of the intercept (the value of y when x is zero) and b
is a coefficient that expresses the slope of the line (the relationship between x and y). If b
is positive, y increases and decreases as x increases and decreases — when b
is negative, y behaves in the opposite manner. You can understand b
as the unit change in y given a unit change in x. When the value of b
is near zero, the effect of x on y is slight, but if the value of b
is high, either positive or negative, the effect of changes in x on y are great.
Linear regression, therefore, can find the best y = bx + a
and represent the relationship between your target variable, y, with respect to your predictive feature, x. Both a
(alpha) and b
(beta coefficient) are estimated on the basis of the data, and they are found using the linear regression algorithm so that the difference between all the real y target values and all the y values derived from the linear regression formula are the minimum possible.
You can express this relationship graphically as the sum of the square of all the vertical distances between all the data points and the regression line. Such a sum is always the minimum possible when you calculate the regression line correctly using an estimation called ordinary least squares, which is derived from statistics or the equivalent gradient descent, a machine learning method. The differences between the real y values and the regression line (the predicted y values) are defined as residuals (because they are what are left after a regression: the errors).
When using a single variable for predicting y, you use simple linear regression, but when working with many variables, you use multiple linear regression. When you have many variables, their scale isn’t important in creating precise linear regression predictions. But a good habit is to standardize X
because the scale of the variables is quite important for some variants of regression (that you see later on) and it is insightful for your understanding of data to compare coefficients according to their impact on y.
The following example relies on the Boston dataset from Scikit-learn. It tries to guess Boston housing prices using a linear regression. The example also tries to determine which variables influence the result more, so the example standardizes the predictors.
from sklearn.datasets import load_boston
from sklearn.preprocessing import scale
boston = load_boston()
X = scale(boston.data)
y = boston.target
The regression class in Scikit-learn is part of the linear_model
module. Having previously scaled the X
variable, you have no other preparations or special parameters to decide when using this algorithm.
from sklearn.linear_model import LinearRegression
regression = LinearRegression(normalize=True)
regression.fit(X, y)
Now that the algorithm is fitted, you can use the score
method to report the R2 measure, which is a measure that ranges from 0 to 1 and points out how using a particular regression model is better in predicting y than using a simple mean would be. (The act of fitting creates a line or curve that best matches the data points provided by the data; you fit the line or curve to the data points in order to perform various tasks, such as predictions, based on the trends or patterns produced by the data.) You can also see R2 as being the quantity of target information explained by the model (the same as the squared correlation), so getting near 1 means being able to explain most of the y variable using the model.
print(regression.score(X, y))
Here is the resulting score:
0.740607742865
In this case, R2 on the previously fitted data is about 0.74, a good result for a simple model. You can interpret the R2 score as the percentage of information present in the target variable that has been explained by the model using the predictors. A score of 0.74, therefore, means that the model has fit the larger part of the information you wanted to predict and that only 26 percent of it remains unexplained.
To understand what drives the estimates in the multiple regression model, you have to look at the coefficients_
attribute, which is an array containing the regression beta coefficients. The coefficients are the numbers estimated by the linear regression model in order to effectively transform the input variables in the formula into the target y prediction. Printing at the same time, the boston.DESCR
attribute helps you understand which variable the coefficients reference. The zip
function will generate an iterable of both attributes, and you can print it for reporting.
print([a + ':' + str(round(b, 2)) for a, b in zip(
boston.feature_names, regression.coef_,)])
The reported variables and their rounded coefficients (b values, or slopes, as described in the “Defining the family of linear models” section, earlier in this chapter) are
['CRIM:-0.92', 'ZN:1.08', 'INDUS:0.14', 'CHAS:0.68',
'NOX:-2.06', 'RM:2.67', 'AGE:0.02', 'DIS:-3.1', 'RAD:2.66',
'TAX:-2.08', 'PTRATIO:-2.06', 'B:0.86', 'LSTAT:-3.75']
DIS
is the weighted distances to five employment centers. It shows the major absolute unit change. For example, in real estate, a house that’s too far from people’s interests (such as work) lowers the value. As a contrast, AGE
and INDUS
, with both proportions describing building age and showing whether nonretail activities are available in the area, don’t influence the result as much because the absolute value of their beta coefficients is lower than DIS
.
Although linear regression is a simple yet effective estimation tool, it has quite a few problems. The problems can reduce the benefit of using linear regressions in some cases, but it really depends on the data. You determine whether any problems exist by employing the method and testing its efficacy. Unless you work hard on data (see Chapter 19), you may encounter these limitations:
Linear regression is well suited for estimating values, but it isn’t the best tool for predicting the class of an observation. In spite of the statistical theory that advises against it, you can actually try to classify a binary class by scoring one class as 1 and the other as 0. The results are disappointing most of the time, so the statistical theory wasn’t wrong!
The fact is that linear regression works on a continuum of numeric estimates. In order to classify correctly, you need a more suitable measure, such as the probability of class ownership. Thanks to the following formula, you can transform a linear regression numeric estimate into a probability that is more apt to describe how a class fits an observation:
probability of a class = exp(r) / (1+exp(r))
r
is the regression result (the sum of the variables weighted by the coefficients) and exp
is the exponential function. exp(r)
corresponds to Euler’s number e
elevated to the power of r
. A linear regression using such a formula (also called a link function) for transforming its results into probabilities is a logistic regression.
Logistic regression is similar to linear regression, with the only difference being the y data, which should contain integer values indicating the class relative to the observation. Using the Iris dataset from the Scikit-learn datasets
module, you can use the values 0, 1, and 2 to denote three classes that correspond to three species:
from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data[:-1,:],
y = iris.target[:-1]
To make the example easier to work with, leave a single value out so that later you can use this value to test the efficacy of the logistic regression model on it.
from sklearn.linear_model import LogisticRegression
logistic = LogisticRegression()
logistic.fit(X, y)
single_row_pred = logistic.predict(
iris.data[-1, :].reshape(1, -1))
single_row_pred_proba = logistic.predict_proba(
iris.data[-1, :].reshape(1, -1))
print ('Predicted class %s, real class %s'
% (single_row_pred, iris.target[-1]))
print ('Probabilities for each class from 0 to 2: %s'
% single_row_pred_proba)
The preceding code snippet outputs the following:
Predicted class [2], real class 2
Probabilities for each class from 0 to 2:
[[ 0.00168787 0.28720074 0.71111138]]
In contrast to linear regression, logistic regression doesn’t just output the resulting class (in this case, the class 2) but also estimates the probability of the observation’s being part of all three classes. Based on the observation used for prediction, logistic regression estimates a probability of 71 percent of its being from class 2 — a high probability, but not a perfect score, therefore leaving a margin of uncertainty.
The previous problem, logistic regression, automatically handles a multiple class problem (it started with three iris species to guess). Most algorithms provided by Scikit-learn that predict probabilities or a score for class can automatically handle multiclass problems using two different strategies:
OneVsRestClassifier
class from Scikit-learn.n * (n-1) / 2
, where n
is the number of classes. If you have ten classes, you have 45 models, 10 * (10 - 1) / 2
. This approach relies on the OneVsOneClassifier
class from Scikit-learn.In the case of logistic regression, the default multiclass strategy is the one versus rest. The example in this section shows how to use both the strategies with the handwritten digit dataset, containing a class for numbers from 0 to 9. The following code loads the data and places it into variables:
from sklearn.datasets import load_digits
digits = load_digits()
train = range(0, 1700)
test = range(1700, len(digits.data))
X = digits.data[train]
y = digits.target[train]
tX = digits.data[test]
ty = digits.target[test]
The observations are actually a grid of pixel values. The grid’s dimensions are 8 pixels by 8 pixels. To make the data easier to learn by machine learning algorithms, the code aligns them into a list of 64 elements. The example reserves a part of the available examples for a test.
from sklearn.multiclass import OneVsRestClassifier
from sklearn.multiclass import OneVsOneClassifier
OVR = OneVsRestClassifier(LogisticRegression()).fit(X, y)
OVO = OneVsOneClassifier(LogisticRegression()).fit(X, y)
print('One vs rest accuracy: %.3f' % OVR.score(tX, ty))
print('One vs one accuracy: %.3f' % OVO.score(tX, ty))
The performances of the two multiclass strategies are
One vs rest accuracy: 0.938
One vs one accuracy: 0.969
The two multiclass classes OneVsRestClassifier
and OneVsOneClassifier
operate by incorporating the estimator (in this case, LogisticRegression
). After incorporation, they usually work just like any other learning algorithm in Scikit-learn. Interestingly, the one-versus-one strategy obtained the highest accuracy thanks to its high number of models in competition.
You might wonder why anyone would name an algorithm Naïve Bayes. The naïve part comes from its formulation; it makes some extreme simplifications to standard probability calculations. The reference to Bayes in its name relates to the Reverend Bayes and his theorem on probability.
Reverend Thomas Bayes (1701–1761) was an English statistician and a philosopher who formulated his theorem during the first half of the eighteenth century. The theorem was never published while he was alive. It has deeply revolutionized the theory of probability by introducing the idea of conditional probability — that is, probability conditioned by evidence.
Of course, it helps to start from the beginning — probability itself. Probability tells you the likelihood of an event and is expressed in a numeric form. The probability of an event is measured in the range from 0 to 1 (from 0 percent to 100 percent) and it’s empirically derived from counting the number of times the specific event happened with respect to all the events. You can calculate it from data!
When you observe events (for example, when a feature has a certain characteristic), and you want to estimate the probability associated with the event, you count the number of times the characteristic appears in the data and divide that figure by the total number of observations available. The result is a number ranging from 0 to 1, which expresses the probability.
When you estimate the probability of an event, you tend to believe that you can apply the probability in each situation. The term for this belief is a priori because it constitutes the first estimate of probability with regard to an event (the one that comes to mind first). For example, if you estimate the probability of an unknown person’s being a female, you might say, after some counting, that it’s 50 percent, which is the prior, or the first, probability that you will stick with.
The prior probability can change in the face of evidence, that is, something that can radically modify your expectations. For example, the evidence of whether a person is male or female could be that the person’s hair is long or short. You can estimate having long hair as an event with 35 percent probability for the general population, but within the female population, it’s 60 percent. If the percentage is higher in the female population, contrary to the general probability (the prior for having long hair), that should be some useful information that you can use.
Imagine that you have to guess whether a person is male or female and the evidence is that the person has long hair. This sounds like a predictive problem, and in the end, this situation is really similar to predicting a categorical variable from data: We have a target variable with different categories and you have to guess the probability of each category on the basis of evidence, the data. Reverend Bayes provided a useful formula:
P(A|B) = P(B|A)*P(A) / P(B)
The formula looks like statistical jargon and is a bit counterintuitive, so it needs to be explained in depth. Reading the formula using the previous example as input makes the meaning behind the formula quite a bit clearer:
Therefore, getting back to the previous example, even if being a female is a 50 percent probability, just knowing evidence like long hair takes it up to 85.7 percent, which is a more favorable chance for the guess. You can be more confident in guessing that the person with long hair is a female because you have a bit less than a 15 percent chance of being wrong.
Naïve Bayes, leveraging the simple Bayes’ rule, takes advantage of all the evidence available in order to modify the prior base probability of your predictions. Because your data contains so much evidence — that is, it has many features — the data makes a big sum of all the probabilities derived from a simplified Naïve Bayes formula.
Because you don’t know (or simply ignore) the relationships between each piece of evidence, you probably just plug all of them in to Naïve Bayes. The simple and naïve move of throwing everything that you know at the formula works well indeed, and many studies report good performance despite the fact that you make a naïve assumption. It’s okay to use everything for prediction, even though it seems as though it shouldn’t be okay given the strong association between variables. Here are some of the ways in which you commonly see Naïve Bayes used:
Naïve Bayes is also popular because it doesn’t need as much data to work. It can naturally handle multiple classes. With some slight variable modifications (transforming them into classes), it can also handle numeric variables. Scikit-learn provides three Naïve Bayes classes in the sklearn.naive_bayes
module:
MultinomialNB
: Uses the probabilities derived from a feature’s presence. When a feature is present, it assigns a certain probability to the outcome, which the textual data indicates for the prediction.BernoulliNB
: Provides the multinomial functionality of Naïve Bayes, but it penalizes the absence of a feature. It assigns a different probability when the feature is present than when it’s absent. In fact, it treats all features as dichotomous variables (the distribution of a dichotomous variable is a Bernoulli distribution). You can also use it with textual data.GaussianNB
: Defines a version of Naïve Bayes that expects a normal distribution of all the features. Hence, this class is suboptimal for textual data in which words are sparse (use the multinomial or Bernoulli distributions instead). If your variables have positive and negative values, this is the best choice.Naïve Bayes is particularly popular for document classification. In textual problems, you often have millions of features involved, one for each word spelled correctly or incorrectly. Sometimes the text is associated with other nearby words in n-grams, that is, sequences of consecutive words. Naïve Bayes can learn the textual features quickly and provide fast predictions based on the input.
This section tests text classifications using the binomial and multinomial Naïve Bayes models offered by Scikit-learn. The examples rely on the 20newsgroups
dataset, which contains a large number of posts from 20 kinds of newsgroups. The dataset is divided into a training set, for building your textual models, and a test set, which is comprised of posts that temporarily follow the training set. You use the test set to test the accuracy of your predictions:
from sklearn.datasets import fetch_20newsgroups
newsgroups_train = fetch_20newsgroups(
subset='train', remove=('headers', 'footers',
'quotes'))
newsgroups_test = fetch_20newsgroups(
subset='test', remove=('headers', 'footers',
'quotes'))
After loading the two sets into memory, you import the two Naïve Bayes models and instantiate them. At this point, you set alpha values, which are useful for avoiding a zero probability for rare features (a zero probability would exclude these features from the analysis). You typically use a small value for alpha, as shown in the following code:
from sklearn.naive_bayes import BernoulliNB, MultinomialNB
Bernoulli = BernoulliNB(alpha=0.01)
Multinomial = MultinomialNB(alpha=0.01)
In Chapter 12, you use the hashing trick to model textual data without fear of encountering new words when using the model after the training phase. You can use two different hashing tricks, one counting the words (for the multinomial approach) and one recording whether a word appeared in a binary variable (the binomial approach). You can also remove stop words, that is, common words found in the English language, such as a, the, in, and so on.
import sklearn.feature_extraction.text as txt
multinomial = txt.HashingVectorizer(stop_words='english',
binary=False, norm=None)
binary = txt.HashingVectorizer(stop_words='english',
binary=True, norm=None)
At this point, you can train the two classifiers and test them on the test set, which is a set of posts that temporarily appear after the training set. The test measure is accuracy, which is the percentage of right guesses that the algorithm makes.
import numpy as np
target = newsgroups_train.target
target_test = newsgroups_test.target
multi_X = np.abs(
multinomial.transform(newsgroups_train.data))
multi_Xt = np.abs(
multinomial.transform(newsgroups_test.data))
bin_X = binary.transform(newsgroups_train.data)
bin_Xt = binary.transform(newsgroups_test.data)
Multinomial.fit(multi_X, target)
Bernoulli.fit(bin_X, target)
from sklearn.metrics import accuracy_score
from sklearn.metrics import accuracy_score
for name, model, data in [('BernoulliNB', Bernoulli, bin_Xt),
('MultinomialNB', Multinomial, multi_Xt)]:
accuracy = accuracy_score(y_true=target_test,
y_pred=model.predict(data))
print ('Accuracy for %s: %.3f' % (name, accuracy))
The reported accuracies for the two Naïve Bayes models are
Accuracy for BernoulliNB: 0.570
Accuracy for MultinomialNB: 0.651
You might notice that it won’t take long for both models to train and report their predictions on the test set. Consider that the training set is made up of more than 11,000 posts containing 300,000 words, and the test set contains about 7,500 other posts.
print('number of posts in training: %i'
% len(newsgroups_train.data))
D={word:True for post in newsgroups_train.data
for word in post.split(' ')}
print('number of distinct words in training: %i'
% len(D))
print('number of posts in test: %i'
% len(newsgroups_test.data))
Running the code returns all these useful text statistics:
number of posts in training: 11314
number of distinct words in training: 300972
number of posts in test: 7532
K-Nearest Neighbors (KNN) is not about building rules from data based on coefficients or probability. KNN works on the basis of similarities. When you have to predict something like a class, it may be the best to find the most similar observations to the one you want to classify or estimate. You can then derive the answer you need from the similar cases.
Observing how many observations are similar doesn’t imply learning something, but rather measuring. Because KNN isn’t learning anything, it’s considered lazy, and you’ll hear it referenced as a lazy learner or an instance-based learner. The idea is that similar premises usually provide similar results, and it’s important not to forget to get such low-hanging fruit before trying to climb the tree!
The algorithm is fast during training because it only has to memorize data about the observations. It actually calculates more during predictions. When there are too many observations, the algorithm can become slow and memory consuming. You’re best advised not to use it with big data or it may take almost forever to predict anything! Moreover, this simple and effective algorithm works better when you have distinct data groups without too many variables involved because the algorithm is also sensitive to the dimensionality curse.
The curse of dimensionality happens as the number of variables increases. Consider a situation in which you’re measuring the distance between observations and, as the space becomes larger and larger, it becomes difficult to find real neighbors — a problem for KNN, which sometimes mistakes a far observation for a near one. Rendering the idea is just like playing chess on a multidimensional chessboard. When playing on the classic 2-D board, most pieces are near and you can more easily spot opportunities and menaces for your pawns when you have 32 pieces and 64 positions. However, when you start playing on a 3-D board, such as those found in some sci-fi films, your 32 pieces can become lost in 512 possible positions. Now just imagine playing with a 12-D chessboard. You can easily misunderstand what is near and what is far, which is what happens with KNN.
For an example showing how to use KNN, you can start with the digit dataset again. KNN is particularly useful, just like Naïve Bayes, when you have to predict many classes, or in situations that would require you to build too many models or rely on a complex model.
from sklearn.datasets import load_digits
from sklearn.decomposition import PCA
digits = load_digits()
train = range(0, 1700)
test = range(1700, len(digits.data))
pca = PCA(n_components = 25)
pca.fit(digits.data[train])
X = pca.transform(digits.data[train])
y = digits.target[train]
tX = pca.transform(digits.data[test])
ty = digits.target[test]
KNN is an algorithm that’s quite sensitive to outliers. Moreover, you have to rescale your variables and remove some redundant information. In this example, you use PCA. Rescaling is not necessary because the data represents pixels, which means that it’s already scaled.
from sklearn.neighbors import KNeighborsClassifier
kNN = KNeighborsClassifier(n_neighbors=5, p=2)
kNN.fit(X, y)
KNN uses a distance measure to determine which observations to consider as possible neighbors for the target case. You can easily change the predefined distance using the p
parameter:
Usually, the Euclidean distance is the right measure, but sometimes it can give you worse results, especially when the analysis involves many correlated variables. The following code shows that the analysis seems fine with it.
print('Accuracy: %.3f' % kNN.score(tX,ty) )
print('Prediction: %s Actual: %s'
% (kNN.predict(tX[-15:,:]),ty[-15:]))
The code returns the accuracy and a sample of the predictions you can compare with the actual values in order to spot differences:
Accuracy: 0.990
Prediction: [2 2 5 7 9 5 4 8 1 4 9 0 8 9 8]
Actual: [2 2 5 7 9 5 4 8 8 4 9 0 8 9 8]
A critical parameter that you have to define in KNN is k. As k increases, KNN considers more points for its predictions, and the decisions are less influenced by noisy instances that could exercise an undue influence. Your decisions are based on an average of more observations, and they become more solid. When the k value you use is too large, you start considering neighbors that are too far, sharing less and less with the case you have to predict.
It’s an important trade-off. When the value of k is less, you consider a more homogeneous pool of neighbors but can more easily make an error by taking the few similar cases for granted. When the value of k is more, you consider more cases at a higher risk of observing neighbors that are too far or that are outliers. Getting back to the previous example with handwritten digit data, you can experiment with changing the k value, as shown in the following code:
for k in [1, 5, 10, 50, 100, 200]:
kNN = KNeighborsClassifier(n_neighbors=k).fit(X, y)
print('for k = %3i accuracy is %.3f'
% (k, kNN.score(tX, ty))
After running this code, you get an overview of what happens when k changes and determine the value of k that best fits the data:
for k = 1 accuracy is 0.979
for k = 5 accuracy is 0.990
for k = 10 accuracy is 0.969
for k = 50 accuracy is 0.959
for k = 100 accuracy is 0.959
for k = 200 accuracy is 0.907
Through experimentation, you find that setting n_neighbors
(the parameter representing k) to 5 is the optimum choice, resulting in the highest accuracy. Using just the nearest neighbor (n_neighbors
=1) isn’t a bad choice, but setting the value above 5 instead brings decreasing results in the classification task.
18.188.198.37