Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 17

Exploring Four Simple and Effective Algorithms

IN THIS CHAPTER

Using linear and logistic regression

Understanding Bayes’ theorem and using it for naive classification

Predicting on the basis of cases being similar with KNN

In this new part of the book, you start to explore all the algorithms and tools necessary for learning from data (training a model with data) and being capable of predicting a numeric estimate (for example, house pricing) or a class (for instance, the species of an Iris flower) given any new example that you didn’t have before. In this chapter, you start with the simplest algorithms and work toward those that are more complex. The four algorithms in this chapter represent a good starting point for any data scientist.

You don’t have to type the source code for this chapter manually. In fact, it’s a lot easier if you use the downloadable source (see the Introduction for download instructions). The source code for this chapter appears in the P4DS4D2_17_ Exploring_Four_Simple_and_Effective_Algorithms.ipynb source code file.

Guessing the Number: Linear Regression

Regression has a long history in statistics, from building simple but effective linear models of economic, psychological, social, or political data, to hypothesis testing for understanding group differences, to modeling more complex problems with ordinal values, binary and multiple classes, count data, and hierarchical relationships. It’s also a common tool in data science, a Swiss Army knife of machine learning that you can use for every problem. Stripped of most of its statistical properties, data science practitioners perceive linear regression as a simple, understandable, yet effective algorithm for estimations, and, in its logistic regression version, for classification as well.

CONSIDERING SIMPLE AND COMPLEX

Simple and complex aren’t absolute terms in machine learning; their meaning is relative to the data problem you’re facing. Some algorithms are simple summations while others require complex calculations and data manipulations (and Python deals with both the simple and complex algorithms for you). The data makes the difference. As a good practice, test multiple models, starting with the basic ones. You may discover that a simple solution performs better in many cases. For example, you may want to keep things simple and use a linear model instead of a more sophisticated approach and get more solid results. This is in essence what is implied by the “no free lunch” theorem: No one approach suits all problems, and even the most simple solution may hold the key to solving an important problem.

The “no free lunch” theorem by David Wolpert and William Macready states that “any two optimization algorithms are equivalent when their performance is averaged across all possible problems.” If the algorithms are equivalent in the abstract, no one is superior to the other unless proved in a specific, practical problem. See the discussion at http://www.no-free-lunch.org/ for more details about no-free-lunch theorems; two of them are actually used for machine learning.

Defining the family of linear models

Linear regression is a statistical model that defines the relationship between a target variable and a set of predictive features. It does so by using a formula of the following type:

y = bx + a.

You can translate this formula into something readable and useful for many problems. For instance, if you’re trying to guess your sales based on historical results and available data about advertising expenditures, the same preceding formula becomes

sales = b * (advertising expenditure) + a

Memories from your high school algebra and geometry tell you that the formulation y=bx+a is a line in a coordinate plane made of an x axis (the abscissa) and a y axis (the ordinate). Most machine learning mathematics is actually high school level, and Python can handle them nicely for you, too.

You can demystify the formula by explaining its components: a is the value of the intercept (the value of y when x is zero) and b is a coefficient that expresses the slope of the line (the relationship between x and y). If b is positive, y increases and decreases as x increases and decreases — when b is negative, y behaves in the opposite manner. You can understand b as the unit change in y given a unit change in x. When the value of b is near zero, the effect of x on y is slight, but if the value of b is high, either positive or negative, the effect of changes in x on y are great.

Linear regression, therefore, can find the best y = bx + a and represent the relationship between your target variable, y, with respect to your predictive feature, x. Both a (alpha) and b (beta coefficient) are estimated on the basis of the data, and they are found using the linear regression algorithm so that the difference between all the real y target values and all the y values derived from the linear regression formula are the minimum possible.

You can express this relationship graphically as the sum of the square of all the vertical distances between all the data points and the regression line. Such a sum is always the minimum possible when you calculate the regression line correctly using an estimation called ordinary least squares, which is derived from statistics or the equivalent gradient descent, a machine learning method. The differences between the real y values and the regression line (the predicted y values) are defined as residuals (because they are what are left after a regression: the errors).

Using more variables

When using a single variable for predicting y, you use simple linear regression, but when working with many variables, you use multiple linear regression. When you have many variables, their scale isn’t important in creating precise linear regression predictions. But a good habit is to standardize X because the scale of the variables is quite important for some variants of regression (that you see later on) and it is insightful for your understanding of data to compare coefficients according to their impact on y.

The following example relies on the Boston dataset from Scikit-learn. It tries to guess Boston housing prices using a linear regression. The example also tries to determine which variables influence the result more, so the example standardizes the predictors.

from sklearn.datasets import load_boston

from sklearn.preprocessing import scale

boston = load_boston()

X = scale(boston.data)

y = boston.target

The regression class in Scikit-learn is part of the linear_model module. Having previously scaled the X variable, you have no other preparations or special parameters to decide when using this algorithm.

from sklearn.linear_model import LinearRegression

regression = LinearRegression(normalize=True)

regression.fit(X, y)

Now that the algorithm is fitted, you can use the score method to report the R² measure, which is a measure that ranges from 0 to 1 and points out how using a particular regression model is better in predicting y than using a simple mean would be. (The act of fitting creates a line or curve that best matches the data points provided by the data; you fit the line or curve to the data points in order to perform various tasks, such as predictions, based on the trends or patterns produced by the data.) You can also see R² as being the quantity of target information explained by the model (the same as the squared correlation), so getting near 1 means being able to explain most of the y variable using the model.

print(regression.score(X, y))

Here is the resulting score:

0.740607742865

In this case, R² on the previously fitted data is about 0.74, a good result for a simple model. You can interpret the R² score as the percentage of information present in the target variable that has been explained by the model using the predictors. A score of 0.74, therefore, means that the model has fit the larger part of the information you wanted to predict and that only 26 percent of it remains unexplained.

Calculating R² on the same set of data used for the training is considered reasonable in statistics when using linear models. In data science and machine learning, it’s always the correct practice to test scores on data that has not been used for training. Algorithms of greater complexity can memorize the data better than they learn from it, but this statement can be also true sometimes for simpler models, such as linear regression.

To understand what drives the estimates in the multiple regression model, you have to look at the coefficients_ attribute, which is an array containing the regression beta coefficients. The coefficients are the numbers estimated by the linear regression model in order to effectively transform the input variables in the formula into the target y prediction. Printing at the same time, the boston.DESCR attribute helps you understand which variable the coefficients reference. The zip function will generate an iterable of both attributes, and you can print it for reporting.

print([a + ':' + str(round(b, 2)) for a, b in zip(

boston.feature_names, regression.coef_,)])

The reported variables and their rounded coefficients (b values, or slopes, as described in the “Defining the family of linear models” section, earlier in this chapter) are

['CRIM:-0.92', 'ZN:1.08', 'INDUS:0.14', 'CHAS:0.68',

'NOX:-2.06', 'RM:2.67', 'AGE:0.02', 'DIS:-3.1', 'RAD:2.66',

'TAX:-2.08', 'PTRATIO:-2.06', 'B:0.86', 'LSTAT:-3.75']

DIS is the weighted distances to five employment centers. It shows the major absolute unit change. For example, in real estate, a house that’s too far from people’s interests (such as work) lowers the value. As a contrast, AGE and INDUS, with both proportions describing building age and showing whether nonretail activities are available in the area, don’t influence the result as much because the absolute value of their beta coefficients is lower than DIS.

Understanding limitations and problems

Although linear regression is a simple yet effective estimation tool, it has quite a few problems. The problems can reduce the benefit of using linear regressions in some cases, but it really depends on the data. You determine whether any problems exist by employing the method and testing its efficacy. Unless you work hard on data (see Chapter 19), you may encounter these limitations:

Linear regression can model only quantitative data. When modeling categories as response, you need to modify the data into a logistic regression.
If data is missing and you don’t deal with it properly, the model stops working. It’s important to impute the missing values or, using the value of zero for the variable, to create an additional binary variable pointing out that a value is missing.
Also, outliers are quite disruptive for a linear regression because linear regression tries to minimize the square value of the residuals, and outliers have big residuals, forcing the algorithm to focus more on them than on the mass of regular points.
The relation between the target and each predictor variable is based on a single coefficient — there isn’t an automatic way to represent complex relations like a parabola (there is a unique value of x maximizing y) or exponential growth. The only way you can manage to model such relations is to use mathematical transformations of x (and sometimes y) or add new variables. Chapter 19 explores both the use of transformations and the addition of variables.
The greatest limitation is that linear regression provides a summation of terms, which can vary independently of each other. It’s hard to figure out how to represent the effect of certain variables that affect the result in very different ways according to their value. A solution is to create interaction terms, that is, to multiply two or more variables to create a new variable; however, doing so requires that you know what variables to multiply and that you create the new variable before running the linear regression. In short, you can’t easily represent complex situations with your data, just simple ones.

Moving to Logistic Regression

Linear regression is well suited for estimating values, but it isn’t the best tool for predicting the class of an observation. In spite of the statistical theory that advises against it, you can actually try to classify a binary class by scoring one class as 1 and the other as 0. The results are disappointing most of the time, so the statistical theory wasn’t wrong!

The fact is that linear regression works on a continuum of numeric estimates. In order to classify correctly, you need a more suitable measure, such as the probability of class ownership. Thanks to the following formula, you can transform a linear regression numeric estimate into a probability that is more apt to describe how a class fits an observation:

probability of a class = exp(r) / (1+exp(r))

r is the regression result (the sum of the variables weighted by the coefficients) and exp is the exponential function. exp(r) corresponds to Euler’s number e elevated to the power of r. A linear regression using such a formula (also called a link function) for transforming its results into probabilities is a logistic regression.

Applying logistic regression

Logistic regression is similar to linear regression, with the only difference being the y data, which should contain integer values indicating the class relative to the observation. Using the Iris dataset from the Scikit-learn datasets module, you can use the values 0, 1, and 2 to denote three classes that correspond to three species:

from sklearn.datasets import load_iris

iris = load_iris()

X = iris.data[:-1,:],

y = iris.target[:-1]

To make the example easier to work with, leave a single value out so that later you can use this value to test the efficacy of the logistic regression model on it.

from sklearn.linear_model import LogisticRegression

logistic = LogisticRegression()

logistic.fit(X, y)

single_row_pred = logistic.predict(

iris.data[-1, :].reshape(1, -1))

single_row_pred_proba = logistic.predict_proba(

iris.data[-1, :].reshape(1, -1))

print ('Predicted class %s, real class %s'

% (single_row_pred, iris.target[-1]))

print ('Probabilities for each class from 0 to 2: %s'

% single_row_pred_proba)

The preceding code snippet outputs the following:

Predicted class [2], real class 2

Probabilities for each class from 0 to 2:

[[ 0.00168787 0.28720074 0.71111138]]

In contrast to linear regression, logistic regression doesn’t just output the resulting class (in this case, the class 2) but also estimates the probability of the observation’s being part of all three classes. Based on the observation used for prediction, logistic regression estimates a probability of 71 percent of its being from class 2 — a high probability, but not a perfect score, therefore leaving a margin of uncertainty.

Using probabilities lets you guess the most probable class, but you can also order the predictions with respect to being part of that class. This is especially useful for medical purposes: Ranking a prediction in terms of likelihood with respect to others can reveal what patients are at most risk of getting or already having a disease.

Considering when classes are more

The previous problem, logistic regression, automatically handles a multiple class problem (it started with three iris species to guess). Most algorithms provided by Scikit-learn that predict probabilities or a score for class can automatically handle multiclass problems using two different strategies:

One versus rest: The algorithm compares every class with all the remaining classes, building a model for every class. If you have ten classes to guess, you have ten models. This approach relies on the OneVsRestClassifier class from Scikit-learn.
One versus one: The algorithm compares every class against every individual remaining class, building a number of models equivalent to n * (n-1) / 2, where n is the number of classes. If you have ten classes, you have 45 models, 10 * (10 - 1) / 2. This approach relies on the OneVsOneClassifier class from Scikit-learn.

In the case of logistic regression, the default multiclass strategy is the one versus rest. The example in this section shows how to use both the strategies with the handwritten digit dataset, containing a class for numbers from 0 to 9. The following code loads the data and places it into variables:

from sklearn.datasets import load_digits

digits = load_digits()

train = range(0, 1700)

test = range(1700, len(digits.data))

X = digits.data[train]

y = digits.target[train]

tX = digits.data[test]

ty = digits.target[test]

The observations are actually a grid of pixel values. The grid’s dimensions are 8 pixels by 8 pixels. To make the data easier to learn by machine learning algorithms, the code aligns them into a list of 64 elements. The example reserves a part of the available examples for a test.

from sklearn.multiclass import OneVsRestClassifier

from sklearn.multiclass import OneVsOneClassifier

OVR = OneVsRestClassifier(LogisticRegression()).fit(X, y)

OVO = OneVsOneClassifier(LogisticRegression()).fit(X, y)

print('One vs rest accuracy: %.3f' % OVR.score(tX, ty))

print('One vs one accuracy: %.3f' % OVO.score(tX, ty))

The performances of the two multiclass strategies are

One vs rest accuracy: 0.938

One vs one accuracy: 0.969

The two multiclass classes OneVsRestClassifier and OneVsOneClassifier operate by incorporating the estimator (in this case, LogisticRegression). After incorporation, they usually work just like any other learning algorithm in Scikit-learn. Interestingly, the one-versus-one strategy obtained the highest accuracy thanks to its high number of models in competition.

Making Things as Simple as Naïve Bayes

You might wonder why anyone would name an algorithm Naïve Bayes. The naïve part comes from its formulation; it makes some extreme simplifications to standard probability calculations. The reference to Bayes in its name relates to the Reverend Bayes and his theorem on probability.

Reverend Thomas Bayes (1701–1761) was an English statistician and a philosopher who formulated his theorem during the first half of the eighteenth century. The theorem was never published while he was alive. It has deeply revolutionized the theory of probability by introducing the idea of conditional probability — that is, probability conditioned by evidence.

Of course, it helps to start from the beginning — probability itself. Probability tells you the likelihood of an event and is expressed in a numeric form. The probability of an event is measured in the range from 0 to 1 (from 0 percent to 100 percent) and it’s empirically derived from counting the number of times the specific event happened with respect to all the events. You can calculate it from data!

When you observe events (for example, when a feature has a certain characteristic), and you want to estimate the probability associated with the event, you count the number of times the characteristic appears in the data and divide that figure by the total number of observations available. The result is a number ranging from 0 to 1, which expresses the probability.

When you estimate the probability of an event, you tend to believe that you can apply the probability in each situation. The term for this belief is a priori because it constitutes the first estimate of probability with regard to an event (the one that comes to mind first). For example, if you estimate the probability of an unknown person’s being a female, you might say, after some counting, that it’s 50 percent, which is the prior, or the first, probability that you will stick with.

The prior probability can change in the face of evidence, that is, something that can radically modify your expectations. For example, the evidence of whether a person is male or female could be that the person’s hair is long or short. You can estimate having long hair as an event with 35 percent probability for the general population, but within the female population, it’s 60 percent. If the percentage is higher in the female population, contrary to the general probability (the prior for having long hair), that should be some useful information that you can use.

Imagine that you have to guess whether a person is male or female and the evidence is that the person has long hair. This sounds like a predictive problem, and in the end, this situation is really similar to predicting a categorical variable from data: We have a target variable with different categories and you have to guess the probability of each category on the basis of evidence, the data. Reverend Bayes provided a useful formula:

P(A|B) = P(B|A)*P(A) / P(B)

The formula looks like statistical jargon and is a bit counterintuitive, so it needs to be explained in depth. Reading the formula using the previous example as input makes the meaning behind the formula quite a bit clearer:

P(A|B) is the probability of being a female (event A) given long hair (evidence B). This part of the formula defines what you want to predict. In short, it says to predict y given x where y is an outcome (male or female) and x is the evidence (long or short hair).
P(B|A) is the probability of having long hair when the person is a female. In this case, you already know that it’s 60 percent. In every data problem, you can obtain this figure easily by simple cross-tabulation of the features against the target outcome.
P(A) is the probability of being a female, a 50 percent general chance (a prior).
P(B) is the probability of having long hair, which is 35 percent (another prior).

When reading parts of the formula such as P(A|B), you should read them as follows: probability of A given B. The | symbol translates as given. A probability expressed in this way is a conditional probability, because it’s the probability of A conditioned by the evidence presented by B. In this example, plugging the numbers into the formula translates into: 60% * 50% / 35% = 85.7%.

Therefore, getting back to the previous example, even if being a female is a 50 percent probability, just knowing evidence like long hair takes it up to 85.7 percent, which is a more favorable chance for the guess. You can be more confident in guessing that the person with long hair is a female because you have a bit less than a 15 percent chance of being wrong.

Finding out that Naïve Bayes isn’t so naïve

Naïve Bayes, leveraging the simple Bayes’ rule, takes advantage of all the evidence available in order to modify the prior base probability of your predictions. Because your data contains so much evidence — that is, it has many features — the data makes a big sum of all the probabilities derived from a simplified Naïve Bayes formula.

As discussed in the “Guessing the number: linear regression” section, earlier in this chapter, summing variables implies that the model takes them as separate and unique pieces of information. But this isn’t true in reality, because applications exist in a world of interconnections, with every piece of information connecting to many other pieces. Using one piece of information more than once means giving more emphasis to that particular piece.

Because you don’t know (or simply ignore) the relationships between each piece of evidence, you probably just plug all of them in to Naïve Bayes. The simple and naïve move of throwing everything that you know at the formula works well indeed, and many studies report good performance despite the fact that you make a naïve assumption. It’s okay to use everything for prediction, even though it seems as though it shouldn’t be okay given the strong association between variables. Here are some of the ways in which you commonly see Naïve Bayes used:

Building spam detectors (catching all annoying e-mails in your inbox)
Sentiment analysis (guessing whether a text contains positive or negative attitudes with respect to a topic, and detecting the mood of the speaker)
Text-processing tasks such as spell correction, or guessing the language used to write or classify the text into a larger category

Naïve Bayes is also popular because it doesn’t need as much data to work. It can naturally handle multiple classes. With some slight variable modifications (transforming them into classes), it can also handle numeric variables. Scikit-learn provides three Naïve Bayes classes in the sklearn.naive_bayes module:

MultinomialNB: Uses the probabilities derived from a feature’s presence. When a feature is present, it assigns a certain probability to the outcome, which the textual data indicates for the prediction.
BernoulliNB: Provides the multinomial functionality of Naïve Bayes, but it penalizes the absence of a feature. It assigns a different probability when the feature is present than when it’s absent. In fact, it treats all features as dichotomous variables (the distribution of a dichotomous variable is a Bernoulli distribution). You can also use it with textual data.
GaussianNB: Defines a version of Naïve Bayes that expects a normal distribution of all the features. Hence, this class is suboptimal for textual data in which words are sparse (use the multinomial or Bernoulli distributions instead). If your variables have positive and negative values, this is the best choice.

Predicting text classifications

Naïve Bayes is particularly popular for document classification. In textual problems, you often have millions of features involved, one for each word spelled correctly or incorrectly. Sometimes the text is associated with other nearby words in n-grams, that is, sequences of consecutive words. Naïve Bayes can learn the textual features quickly and provide fast predictions based on the input.

This section tests text classifications using the binomial and multinomial Naïve Bayes models offered by Scikit-learn. The examples rely on the 20newsgroups dataset, which contains a large number of posts from 20 kinds of newsgroups. The dataset is divided into a training set, for building your textual models, and a test set, which is comprised of posts that temporarily follow the training set. You use the test set to test the accuracy of your predictions:

from sklearn.datasets import fetch_20newsgroups

newsgroups_train = fetch_20newsgroups(

subset='train', remove=('headers', 'footers',

'quotes'))

newsgroups_test = fetch_20newsgroups(

subset='test', remove=('headers', 'footers',

'quotes'))

After loading the two sets into memory, you import the two Naïve Bayes models and instantiate them. At this point, you set alpha values, which are useful for avoiding a zero probability for rare features (a zero probability would exclude these features from the analysis). You typically use a small value for alpha, as shown in the following code:

from sklearn.naive_bayes import BernoulliNB, MultinomialNB

Bernoulli = BernoulliNB(alpha=0.01)

Multinomial = MultinomialNB(alpha=0.01)

In Chapter 12, you use the hashing trick to model textual data without fear of encountering new words when using the model after the training phase. You can use two different hashing tricks, one counting the words (for the multinomial approach) and one recording whether a word appeared in a binary variable (the binomial approach). You can also remove stop words, that is, common words found in the English language, such as a, the, in, and so on.

import sklearn.feature_extraction.text as txt

multinomial = txt.HashingVectorizer(stop_words='english',

binary=False, norm=None)

binary = txt.HashingVectorizer(stop_words='english',

binary=True, norm=None)

At this point, you can train the two classifiers and test them on the test set, which is a set of posts that temporarily appear after the training set. The test measure is accuracy, which is the percentage of right guesses that the algorithm makes.

import numpy as np

target = newsgroups_train.target

target_test = newsgroups_test.target

multi_X = np.abs(

multinomial.transform(newsgroups_train.data))

multi_Xt = np.abs(

multinomial.transform(newsgroups_test.data))

bin_X = binary.transform(newsgroups_train.data)

bin_Xt = binary.transform(newsgroups_test.data)

Multinomial.fit(multi_X, target)

Bernoulli.fit(bin_X, target)

from sklearn.metrics import accuracy_score

for name, model, data in [('BernoulliNB', Bernoulli, bin_Xt),

('MultinomialNB', Multinomial, multi_Xt)]:

accuracy = accuracy_score(y_true=target_test,

y_pred=model.predict(data))

print ('Accuracy for %s: %.3f' % (name, accuracy))

The reported accuracies for the two Naïve Bayes models are

Accuracy for BernoulliNB: 0.570

Accuracy for MultinomialNB: 0.651

You might notice that it won’t take long for both models to train and report their predictions on the test set. Consider that the training set is made up of more than 11,000 posts containing 300,000 words, and the test set contains about 7,500 other posts.

print('number of posts in training: %i'

% len(newsgroups_train.data))

D={word:True for post in newsgroups_train.data

for word in post.split(' ')}

print('number of distinct words in training: %i'

% len(D))

print('number of posts in test: %i'

% len(newsgroups_test.data))

Running the code returns all these useful text statistics:

number of posts in training: 11314

number of distinct words in training: 300972

number of posts in test: 7532

Learning Lazily with Nearest Neighbors

K-Nearest Neighbors (KNN) is not about building rules from data based on coefficients or probability. KNN works on the basis of similarities. When you have to predict something like a class, it may be the best to find the most similar observations to the one you want to classify or estimate. You can then derive the answer you need from the similar cases.

Observing how many observations are similar doesn’t imply learning something, but rather measuring. Because KNN isn’t learning anything, it’s considered lazy, and you’ll hear it referenced as a lazy learner or an instance-based learner. The idea is that similar premises usually provide similar results, and it’s important not to forget to get such low-hanging fruit before trying to climb the tree!

The algorithm is fast during training because it only has to memorize data about the observations. It actually calculates more during predictions. When there are too many observations, the algorithm can become slow and memory consuming. You’re best advised not to use it with big data or it may take almost forever to predict anything! Moreover, this simple and effective algorithm works better when you have distinct data groups without too many variables involved because the algorithm is also sensitive to the dimensionality curse.

The curse of dimensionality happens as the number of variables increases. Consider a situation in which you’re measuring the distance between observations and, as the space becomes larger and larger, it becomes difficult to find real neighbors — a problem for KNN, which sometimes mistakes a far observation for a near one. Rendering the idea is just like playing chess on a multidimensional chessboard. When playing on the classic 2-D board, most pieces are near and you can more easily spot opportunities and menaces for your pawns when you have 32 pieces and 64 positions. However, when you start playing on a 3-D board, such as those found in some sci-fi films, your 32 pieces can become lost in 512 possible positions. Now just imagine playing with a 12-D chessboard. You can easily misunderstand what is near and what is far, which is what happens with KNN.

You can still make KNN smart in detecting similarities between observations by removing redundant information and simplifying the data dimensionality using data the reduction techniques, as explained in Chapter 14.

Predicting after observing neighbors

For an example showing how to use KNN, you can start with the digit dataset again. KNN is particularly useful, just like Naïve Bayes, when you have to predict many classes, or in situations that would require you to build too many models or rely on a complex model.

from sklearn.datasets import load_digits

from sklearn.decomposition import PCA

digits = load_digits()

train = range(0, 1700)

test = range(1700, len(digits.data))

pca = PCA(n_components = 25)

pca.fit(digits.data[train])

X = pca.transform(digits.data[train])

y = digits.target[train]

tX = pca.transform(digits.data[test])

ty = digits.target[test]

KNN is an algorithm that’s quite sensitive to outliers. Moreover, you have to rescale your variables and remove some redundant information. In this example, you use PCA. Rescaling is not necessary because the data represents pixels, which means that it’s already scaled.

You can avoid the problem with outliers by keeping the neighborhood small, that is, by not looking too far for similar examples.

Knowing the data type can save you a lot of time and many mistakes. For example, in this case, you know that the data represents pixel values. Doing EDA (as described in Chapter 13) is always the first step and can provide you with useful insights, but getting additional information about how the data was obtained and what the data represents is also a good practice and can be just as useful. To see this task in action, you reserve cases in tX and try a few cases that KNN won’t look up when looking for neighbors.

from sklearn.neighbors import KNeighborsClassifier

kNN = KNeighborsClassifier(n_neighbors=5, p=2)

kNN.fit(X, y)

KNN uses a distance measure to determine which observations to consider as possible neighbors for the target case. You can easily change the predefined distance using the p parameter:

When p is 2, use the Euclidean distance (discussed as part of the clustering topic in Chapter 15).
When p is 1, use the Manhattan distance metric, which is the absolute distance between observations. In a 2-D square, when you go from one corner to the opposite one, the Manhattan distance is the same as walking the perimeter, whereas Euclidean is like walking on the diagonal. Although the Manhattan distance isn’t the shortest route, it’s a more realistic measure than Euclidean distance, and it’s less sensitive to noise and high dimensionality.

Usually, the Euclidean distance is the right measure, but sometimes it can give you worse results, especially when the analysis involves many correlated variables. The following code shows that the analysis seems fine with it.

print('Accuracy: %.3f' % kNN.score(tX,ty) )

print('Prediction: %s Actual: %s'

% (kNN.predict(tX[-15:,:]),ty[-15:]))

The code returns the accuracy and a sample of the predictions you can compare with the actual values in order to spot differences:

Accuracy: 0.990

Prediction: [2 2 5 7 9 5 4 8 1 4 9 0 8 9 8]

Actual: [2 2 5 7 9 5 4 8 8 4 9 0 8 9 8]

Choosing your k parameter wisely

A critical parameter that you have to define in KNN is k. As k increases, KNN considers more points for its predictions, and the decisions are less influenced by noisy instances that could exercise an undue influence. Your decisions are based on an average of more observations, and they become more solid. When the k value you use is too large, you start considering neighbors that are too far, sharing less and less with the case you have to predict.

It’s an important trade-off. When the value of k is less, you consider a more homogeneous pool of neighbors but can more easily make an error by taking the few similar cases for granted. When the value of k is more, you consider more cases at a higher risk of observing neighbors that are too far or that are outliers. Getting back to the previous example with handwritten digit data, you can experiment with changing the k value, as shown in the following code:

for k in [1, 5, 10, 50, 100, 200]:

kNN = KNeighborsClassifier(n_neighbors=k).fit(X, y)

print('for k = %3i accuracy is %.3f'

% (k, kNN.score(tX, ty))

After running this code, you get an overview of what happens when k changes and determine the value of k that best fits the data:

for k = 1 accuracy is 0.979

for k = 5 accuracy is 0.990

for k = 10 accuracy is 0.969

for k = 50 accuracy is 0.959

for k = 100 accuracy is 0.959

for k = 200 accuracy is 0.907

Through experimentation, you find that setting n_neighbors (the parameter representing k) to 5 is the optimum choice, resulting in the highest accuracy. Using just the nearest neighbor (n_neighbors =1) isn’t a bad choice, but setting the value above 5 instead brings decreasing results in the classification task.

As a rule of thumb, when your dataset doesn’t have many observations, set k as a number near the squared number of available observations. However, there is no general rule, and trying different k values is always a good way to optimize your KNN performance. Always start from low values and work toward higher values.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 17: Exploring Four Simple and Effective Algorithms

Create new playlist

Sign In

Sign Up