© Valentina Porcu 2018
Valentina PorcuPython for Data Mining Quick Syntax Referencehttps://doi.org/10.1007/978-1-4842-4113-4_11

11. Scikit-learn

Valentina Porcu1 
(1)
Nuoro, Italy
 

Scikit-learn is one of the most important and most used packages for machine learning with Python. It features many functions for various predictive algorithms. In this chapter, we examine some of the algorithms included in the Scikit-learn package. Given the breadth of the subject, the example presented reflect the most used models. Those of you who have no prior knowledge of machine learning may find it difficult to understand some of the techniques presented in this chapter, which are not explained in detail.

What Is Machine Learning?

Machine learning is a branch of data analysis that transforms datasets built in a particular way into predictions that can be applied to new data. Machine learning uses data we already have to predict future behaviors. Machine-learning techniques have been a real revolution in data mining and they have a great impact on a variety of fields of application.

Machine learning is widespread among many applications used every day. Large companies such as Amazon, Netflix, Google, Apple, and Facebook use machine-learning algorithms for various reasons. For instance, Facebook uses machine learning to recognize faces in images; Amazon and Netflix analyze customer preferences (the last thing you viewed or bought) to propose new products that might match your interests.

Google, for example, uses machine learning in translation and automatic driving, or also suggesting the road with less traffic based on our habits or the place where we are used to go on a given day of the week. Machine learning also helps us detect spam messages from non spam messages, often using probabilistic methods or by combining multiple methods (for instance, adding probabilistic methods to keywords and user-defined rules).

Apple and Microsoft use machine learning to provide us with a voice assistant that helps us work with a phone or tablet using only our voice. Other companies are currently refining artificial intelligence methods, automated driving, and more.

The field of machine learning gained wider attention when a supercomputer (Watson), developed by IBM, took part to the Jeopardy! quiz program.

Machine learning has also been used to predict election results, first by Nate Silver, a scientist who, in October 2012, published a preview of US elections, and whose results were very close to the actual data.

Predictive data mining is used in the healthcare field. Patient data and clinical records can help identify people who are at greater risk of contracting certain conditions and illness such as diabetes or heart disease. DNA analysis and genetic kits have been used, for example, to detect genes responsible for or otherwise related to certain types of cancer, including breast cancer.

One of the most outdated uses of machine learning is handwriting recognition—in particular, handwritten addresses and zip codes. Recognition is based on various occurrences of each handwritten number using neural networks (conducted by Bell Labs).

Research in machine learning and related topics, such as deep learning and artificial intelligence, improves every day and is at the forefront of the computing world. Some web sites such as Kaggle publish contents every few days during which subscribers try to solve a given problem. The most famous Kaggle contest was announced by Netflix. In 2006, Netflix awarded a $1,000,000 USD prize for a recommendation system. The system that was designed has not been implemented by Netflix because it is too complex and computationally expensive.

Let’s look at the various modules and techniques in the Scikit-learn package.

Import Datasets Included in Scikit-learn

First, the scikit-learn package includes some datasets, which we can import like this:
>>> from sklearn import datasets
To import one of the datasets, type
>>> iris = datasets.load_iris()

The iris dataset is made up of petals and sepals of three different types of iris: versicolor, virginica, and setosa. It contains 150 equally divided cases on three types of flowers, and five variables.

In the previous chapters we saw how to import a dataset in .csv from an external file on our local computer or from a website using a link. Scikit-learn also includes some of the most used dataset for data mining, like iris and Boston. The format used in Scikit-learn can result a little bit confusing for a beginner. All the dataset is included in a same object, but not in a tabular format like the .csv files we examined in the previous chapters. Datasets in Scikit-learn contain data in array format, the target or label (the variable we want to predict) and some other informations, like variable description (in the DESCR object) and the columns names in two distinct objects: feature_names for the variables, and target_names for the target or label.
>>> type(iris)
sklearn.datasets.base.Bunch
>>> iris.keys()
dict_keys(['data', 'target', 'target_names', 'DESCR', 'feature_names'])
We can display the actual dataset:
>>> iris.data
array([[ 5.1,  3.5,  1.4,  0.2],
       [ 4.9,  3. ,  1.4,  0.2],
       [ 4.7,  3.2,  1.3,  0.2],
       [ 4.6,  3.1,  1.5,  0.2],
       [ 5. ,  3.6,  1.4,  0.2],
       [ 5.4,  3.9,  1.7,  0.4],
       [ 4.6,  3.4,  1.4,  0.3],
       [ 5. ,  3.4,  1.5,  0.2],
       [ 4.4,  2.9,  1.4,  0.2],
[...]
In this way, we display only the numeric data. To see the actual classification, we must proceed as follows:
>>> iris.target
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])
# data target is numeric because scikit-learn does not accept categorical data by default, so it is necessary to encode data in numeric form
# we display the names
>>> iris.target_names
array(['setosa', 'versicolor', 'virginica'],
      dtype='<U10')
We can display the number of cases and variables as follows:
>>> iris.data.shape
(150, 4)
We can acquire a description of data by using .DESCR:
>>> iris.DESCR

../images/469457_1_En_11_Chapter/469457_1_En_11_Figa_HTML.jpg

Creation of Training and Testing Datasets

In machine learning we tipically start from a dataset in a .csv format. From this file we will create 4 pieces. One thing that Scikit-learn allows us to do is to create a training dataset and a testing dataset. In machine learning, we typically start with a labeled dataset and divide it into two parts: one for training (about 70%–80% of the dataset), which is used to train the algorithm; and one for testing (the remaining 20%–30%), which is used to test the efficacy of the data algorithm. This feature allows us to compare actual data with those predicted by the algorithm and see how they work. Training and testing datasets are also divised in two parts: one with the variables that we will use to create a model, and one with the variable we want to learn to predict.
>>> from sklearn.model_selection import train_test_split
>>> x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.3)
# we only need to specify the percentage of the test dataset—in this case, 30% (0.3)
If we apply this to our iris dataset, for example, we can create four objects: a training object that contains four variables of the iris dataset and 70% of the cases, a test object that contains the rest of the elements (30%), and a label or target variable, which is always divided in two.
>>> x_train, x_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size = 0.3)
Let’s check the size of the various objects created:
>>> x_train.data.shape
(105, 4)
>>> x_test.data.shape
(45, 4)
>>> y_train.data.shape
(105,)
>>> y_test.data.shape
(45,)

Preprocessing

Scikit-learn permits preprocessing of data (although we do not need to do so with the iris dataset).
>>> from sklearn import preprocessing
# iris_scaled = pd.DataFrame(preprocessing.scale(iris_data))

Regression

Regression analysis is used to explain the relationship between a variable, y, called a response variable or dependent variable , and one or more independent variables.

To calculate the regression, let’s import the correct model from Scikit-learn:
>>> from sklearn.linear_model import LinearRegression
# we simplify the work a bit by creating a copy of the regression model
>>> lr = LinearRegression()
# we create the model using the training objects
>>> lr.fit(x_train, y_train)
# we view the coefficients
>>> print(lr.intercept_)
>>> lr.coef_
# we predict the membership for the test objects
>>> pred = lr.predict(x_test)
>>> print(pred)
Now let’s look at the code that allows us to apply metrics to measure model efficacy:
>>> from sklearn import metrics
>>> print('MAE', metrics.mean_absolute_error(y_test, pred))
>>> print('MSE', metrics.mean_squared_error(y_test, pred))
>>> print('RMSE', np.sqrt(metrics.mean_squared_error(y_test, pred)))
>>> metrics.explained_variance_score(y_test, pred)

K-Nearest Neighbors

The k-nearest neighbor algorithm is a supervised algorithm used for data prediction and data mining. It is also used for pattern recognition (such as facial recognition), for identifying patterns in genetic code, for identifying illnesses, and for film and music recommendation systems. The logic behind the k-neighbor algorithm can be summed up in the Latin phrase “Similes cum similar bus facillime congregantur”—meaning, “similar ones gather together easily.” In short, we use the algorithm to analyze cases in a dataset to find similar elements. New cases are then aggregated to newly formed groups, depending on how close they are to a group and how far from one another. The k-neighbor algorithm then calculates the distance of the unclassified item to the others, and assigns it the closest class of element (or elements, k). k is nothing more than the number of close observations that we can use to determine the class of an item with an unknown class. For example, if we set k equal to two, we assign the class to the item based on its two closest elements. If we set it equal to three, we get the three closest elements, and so on.
# we cannot use the k-neighbor algorithm on the iris dataset, so from this point onward, we limit ourselves to giving an idea of the code for the various classification models
>>> from sklearn.neighbors import KNeighborsClassifier
>>> knn = KNeighborsClassifier(n_neighbors = 3)
>>> x_train = from the dataset we will use the variables except the label
>>> y_train = the test labels
>>> x_test = new data
>>> y_test = new data labels
>>> knn.fit(x_train, y_train)
>>> new_pred = knn.predict(x_new)
>>> print(new_pred)
When dealing with classification, we use methods other than those of regression to test the adequacy of a model:
>>> from sklearn.metrics import classification_report, confusion_matrix
>>> print(confusion_matrix(y_test, y_pred))
>>> print(classification_report(y_test, y_pred))

Cross-validation

Cross-validation consists of dividing a dataset into a number of equal parts generally indicated by k (often five or ten parts) then testing the adequacy of the prediction model on these k groups.
>>> from sklearn.model_selection import cross_val_score
>>> cv5 = cross_val_score(model, x_train, y_train, cv = 5)
>>> cv10 = cross_val_score(model, x_train, y_train, cv = 10)
>>> print(np.mean(cv5))
>>> print(np.mean(cv10))

Support Vector Machine

Support Vector Machine (SVM) is used to determine the boundary between items belonging to two different classes, then projecting them into multidimensional space to discern the hyperplane that maximizes margins between the two sets of data.
>>> from sklearn.svm import SVC
>>> clf = svm.SVC(kernel='linear', C=1).fit(x_train, y_train)
>>> clf.score(x_test, y_test)
>>> from sklearn.model_selection import cross_val_predict
>>> pred = cross_val_predict(clf, iris.data, iris.target, cv=10)
>>> metrics.accuracy_score(iris.target, pred)

Decision Trees

The basic idea behind a decision tree is a divide et impera model, in which at each step we can reduce variability between nodes. Let’s us start with the entire dataset, which is then divided into smaller groups that are based more homogeneously and intrinsically on internal characteristics.
>>> from sklearn.tree import DecisionTreeClassifier
>>> dtc = DecisionTreeClassifier()
>>> dtc = dtc.fit(x_train, y_train)
>>> pred = dtc.predict(y_test)
>>> sklearn.metrics.confusion_matrix(y_test, pred)
>>> sklearn.metrics.accuracy_score(y_test, pred)
>>> sklearn.metrics.classification_report(y_test, pred)

KMeans

KMeans is an unsupervised method of classification, which means we do not have a label to guide us during classification. For this reason, we choose to use clustering as a helpful exploratory analysis method, because it allows us to group elements of a dataset based on how similar or dissimilar they are.

Clustering includes a set of methods that allows segmentation of a heterogeneous population into homogeneous subgroups. Of all the clustering methods available, KMeans is one of the important ones. The basic concept of clustering is based on the fact that we divide the items of a set into homogeneous groups without labeling them initially. Label-free data must be grouped in such a way that they are not only homogeneous within their clusters, but also heterogeneous to other elements of the other clusters. After we split our clustered items, we can classify new items as belonging to either cluster of the first dataset.
>>> from sklearn.cluster import KMeans
>>> kmeans = KMeans(n_clusters=4)
>>> kmeans.fit(df)
>>> pred = kmeans.predict(df)
>>> pred

This was just a cursory discussion of machine learning using the Scikit-learn package. Machine learning is a challenging topic and therefore not easy to sum up in a few pages. I thought it would be helpful to expose to some predictive data mining concepts and various Scikit-learn modules that can be used for machine learning.

Managing Dates

Managing dates using Python is important, especially when dealing with time series representations. We can handle dates using the datetime package and pandas. First, we must import datetime.
>>> import datetime as dt
# we create a first object that contains time
>>> t1 = dt.time(19, 43 , 30)
>>> print(t1)
19:43:30
# to create an object featuring a date, we use dates
>>> dt.date.today()
>>> datetime.date(2017, 3, 28)
# we can query the created object about the year, the month, the day
>>> today = dt.date.today()
>>> today.year
2017
>>> today.month
3
>>> today.day
28
>>> t2 = dt.date(2016, 5, 20)
>>> print(t2)
2016-05-20
# we can query an object to find the year, month, and day
>>> t2.year
2016
>>> t2.month
5
>>> t2.day
20
# we can find the exact hour and minute from our computer
>>> dt.datetime.now()
>>> datetime.datetime(2017, 3, 30, 13, 4, 52, 591324)

Resources for parsing a date are available at http://strftime.org/ .

Let’s carry on with date management using pandas.
>>> import pandas as pd
# we can manage various data formats through Timestamp
>>> pd.Timestamp("2016-3-7")
>>> pd.Timestamp("2016/4/10")
>>> pd.Timestamp("2015, 12, 10")
>>> pd.Timestamp("2015, 12, 10 12:42:57")
>>> date1 = ["2016/4/10", "2015, 12, 10", "2015, 12, 10 12:42:57"]
>>> print(date1)
['2016/4/10', '2015, 12, 10', '2015, 12, 10 12:42:57']
>>> type(date1)
list
>>> pd.to_datetime(date1)
DatetimeIndex(['2016-04-10 00:00:00', '2015-12-10 00:00:00',
               '2015-12-10 12:42:57'],
              dtype='datetime64[ns]', freq=None)
# we create another object that contains our dates, but also some other element
>>> date2 = ["2016/4/10", "2015, 12, 10", "2015, 12, 10 12:42:57", "October", "2011", "test"]
# if we pass this object in Timestamp, we get an error
>>> pd.to_datetime(date2)
# we can handle errors with the 'coerce' parameter
>>> pd.to_datetime(date2, errors = "coerce")
DatetimeIndex(['2016-04-10 00:00:00', '2015-12-10 00:00:00',
               '2015-12-10 12:42:57',                 'NaT',
               '2011-01-01 00:00:00',                 'NaT'],
              dtype='datetime64[ns]', freq=None)

Dates that are not recognized are identified as NaT.

Let’s carry on and create a range of dates:
>>> period1 = pd.date_range(start = "2016  01 01", end = "2016 12 31")
>>> print(period1)
DatetimeIndex(['2016-01-01', '2016-01-02', '2016-01-03', '2016-01-04',
               '2016-01-05', '2016-01-06', '2016-01-07', '2016-01-08',
               '2016-01-09', '2016-01-10',
               ...
               '2016-12-22', '2016-12-23', '2016-12-24', '2016-12-25',
               '2016-12-26', '2016-12-27', '2016-12-28', '2016-12-29',
               '2016-12-30', '2016-12-31'],
              dtype='datetime64[ns]', length=366, freq="D")
The frequency argument (freq=‘D’) means that a day (day) interval is set, but we can modify it:
# for example, by inserting ten days
>>> pd.date_range(start = "2016  01 01", end = "2016 12 31", freq = "10D")
DatetimeIndex(['2016-01-01', '2016-01-11', '2016-01-21', '2016-01-31',
               '2016-02-10', '2016-02-20', '2016-03-01', '2016-03-11',
               '2016-03-21', '2016-03-31', '2016-04-10', '2016-04-20',
               '2016-04-30', '2016-05-10', '2016-05-20', '2016-05-30',
               '2016-06-09', '2016-06-19', '2016-06-29', '2016-07-09',
               '2016-07-19', '2016-07-29', '2016-08-08', '2016-08-18',
               '2016-08-28', '2016-09-07', '2016-09-17', '2016-09-27',
               '2016-10-07', '2016-10-17', '2016-10-27', '2016-11-06',
               '2016-11-16', '2016-11-26', '2016-12-06', '2016-12-16',
               '2016-12-26'],
              dtype='datetime64[ns]', freq="10D")
# or 12 hours
>>> pd.date_range(start = "2016  01 01", end = "2016 12 31", freq = "12H")
DatetimeIndex(['2016-01-01 00:00:00', '2016-01-01 12:00:00',
               '2016-01-02 00:00:00', '2016-01-02 12:00:00',
               '2016-01-03 00:00:00', '2016-01-03 12:00:00',
               '2016-01-04 00:00:00', '2016-01-04 12:00:00',
               '2016-01-05 00:00:00', '2016-01-05 12:00:00',
               ...
# with frequency on Monday
>>> pd.date_range(start = "2016  01 01", end = "2016 12 31", freq = "W-Mon")
DatetimeIndex(['2016-01-04', '2016-01-11', '2016-01-18', '2016-01-25',
               '2016-02-01', '2016-02-08', '2016-02-15', '2016-02-22',
               '2016-02-29', '2016-03-07', '2016-03-14', '2016-03-21',
               '2016-03-28', '2016-04-04', '2016-04-11', '2016-04-18',
...
# with frequency on Wednesday
>>> pd.date_range(start = "2016  01 01", end = "2016 12 31", freq = "W-Wed")
DatetimeIndex(['2016-01-06', '2016-01-13', '2016-01-20', '2016-01-27',
               '2016-02-03', '2016-02-10', '2016-02-17', '2016-02-24',
               '2016-03-02', '2016-03-09', '2016-03-16', '2016-03-23',
               '2016-03-30', '2016-04-06', '2016-04-13', '2016-04-20',
...
There are other methods that allow us to handle dates. Here is another way of creating a range of dates:
>>> range1 = pd.date_range(start = "2016  01 01", end = "2016 03 31", freq = "D")
# and now we use the .weekday_name method
>>> range1.weekday_name
array(['Friday', 'Saturday', 'Sunday', 'Monday', 'Tuesday', 'Wednesday',
       'Thursday', 'Friday', 'Saturday', 'Sunday', 'Monday', 'Tuesday',
       'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday', 'Monday',
       'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday',
       'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday',
...
Using Jupyter, we can view the methods for managing dates using the Tab key (Figure 11-1).
../images/469457_1_En_11_Chapter/469457_1_En_11_Fig1_HTML.jpg
Figure 11-1

Managing dates with the Tab key

Data Sources

For starting with data mining and machine learning we will use a lot of dataset too understand how an algorithm works. Many data mining datasets can be downloaded from the University of California at Irvine (UCI) online store. The UCI web site ( http://archive.ics.uci.edu/ml/index.php ) (Figure 11-2) includes all the most used datasets for data science, such as iris, Boston, Wine, SMS Spam collection, and many more ( http://archive.ics.uci.edu/ml/datasets.html ).
../images/469457_1_En_11_Chapter/469457_1_En_11_Fig2_HTML.jpg
Figure 11-2

Some of the datasets on the UCI web site

Recently, even Kaggle has begun to encourage data scientists to publish datasets ( https://www.kaggle.com/datasets ) to effect exchange among data scientists (Figure 11-3).
../images/469457_1_En_11_Chapter/469457_1_En_11_Fig3_HTML.jpg
Figure 11-3

Some of the datasets on the Kaggle webs ite

As we have seen, the Scikit-learn package also includes datasets that can be imported. For more information on scikit-learn and the featured datasets, you browse the package documentation at http://scikit-learn.org/stable/datasets/ .

A pandas package module, called datareader, features tools to extract data from some online sources ( https://pandas-datareader.readthedocs.io/en/latest/remote_data.html#google-finance )—particularly those dealing with stock exchange repositories, such as Yahoo! Finance and Google Finance.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.189.188.238