Data mining and machine learning

We are going to focus on three kinds of problems: Classification, Dimensionality reduction, and Clustering. Each of these problems is used in both data mining and machine learning to draw conclusions about the data. Let's explain each of these settings in different sections.

Classification

Classification is an example of supervised learning. There is a set of training data with an attribute that classifies it in one of several categories. The goal is to find the value of that attribute for new data. For example, with our running database, we could use all the data from the year 2013 to figure out which financial complaints got solved positively for the customer, which ones got solved without relief, and which ones remained in progress. This will offer us good insight on, for instance, which companies are faster to respond to consumer complaints positively, if there are states where complaints are less likely to get resolved, and so on.

Let's start by finding the kind of company responses observed in the database:

In [1]: import numpy as np, pandas as pd, matplotlib.pyplot as plt
In [2]: data = pd.read_csv("Consumer_Complaints.csv",
   ...:                    low_memory=False, parse_dates=[8,9])
In [3]: print data['Company response'].unique()
['Closed with non-monetary relief' 'In progress' 'Closed with explanation'
 'Closed with monetary relief' 'Closed' 'Untimely response'
 'Closed without relief' 'Closed with relief']

These are eight different categories and our target for deciding the fate of future complaints. Let's create a set of training data by gathering all complaints formulated during the year 2013, and keeping only the columns that we believe are relevant to the decision-making process:

  • The product and subproduct that originate the complaint.
  • The issue (but not the sub-issue) that consumers had with the product.
  • State (but not ZIP code) where the complaint was filed.
  • Method of submission of the complaint.
  • The company that offered the service.

The size of this training data will dictate which algorithm is to be used for classification.

Tip

Before the data can be processed, we need to encode nonnumerical labels so they can be properly treated by the different classifying algorithms. We do that with the class LabelEncoder from the module sklearn.preprocessing.

We will then try to classify all complaints formulated during the year 2014:

In [4]: in_2013 = data['Date received'].map(lambda t: t.year==2013);
   ...: in_2014 = data['Date received'].map(lambda t: t.year==2014);
   ...: df = data[in_2013 | in_2014]; 
   ...: df['Year'] = df['Date received'].map(lambda t: t.year); 
   ...: irrelevant = ['Date received', 'Date sent to company',
   ...:               'Complaint ID', 'Timely response?',
   ...:               'Consumer disputed?', 'Sub-issue','ZIP code'];
   ...: df.drop(irrelevant, 1, inplace=True); 
   ...: df = df.dropna()
In [5]: from sklearn.preprocessing import LabelEncoder
In [6]: encoder = {}
In [7]: for column in df.columns:
   ...:     if df[column].dtype != 'int':
   ...:          le = LabelEncoder()
   ...:          le.fit(df[column].unique())
   ...:          df[column] = le.transform(df[column])
   ...:          encoder[column] = le
   ...:
In [8]: training = df[df.Year==2013]; 
   ...: target = training['Company response']; 
   ...: training.drop(['Company response', 'Year'], 1, inplace=True)
In [9]: test = df[df.Year==2014]; 
   ...: true_result = test['Company response']; 
   ...: test.drop(['Company response', 'Year'], 1, inplace=True)
In [10]: len(training)
Out[10]: 77100

Support vector classification

The training data here is not too big (anything less than 100,000 is considered manageable). For this volume of training data, it is suggested that we employ support vector classification with a linear kernel.

Three flavors of this algorithm are coded as classes in the module sklearn.svm (for support vector machines): SVC, NuSVC, and a simplified version of SVC with linear kernel, LinearSVC, which is what we need:

In [11]: from sklearn.svm import LinearSVC
In [12]: clf = LinearSVC(); 
   ....: clf.fit(training, target)
Out[12]:
LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='l2', multi_class='ovr', penalty='l2',
     random_state=None, tol=0.0001, verbose=0)

We are ready to evaluate the performance of this classifier:

In [13]: clf.predict(test)==true_result
Out[13]:
0     False
2     False
3      True
4      True
6     False
7      True
9      True
11    False
12    False
13    False
14    False
15     True
16     True
20    False
21    False
...
101604     True
101607    False
101610     True
101611     True
101613     True
101614     True
101616     True
101617     True
101618     True
101620     True
101621     True
101622     True
101625     True
101626     True
101627     True
Name: Company response, Length: 65282, dtype: bool
In [18]: float(sum(_)) / float(len(_))
Out[18]: 0.7985509022395147

With this method, we correctly classify almost 80% of the complaints.

Tip

In rare cases where this method does not work, we can always resort to plain SVC or its NuSVC variation with a carefully chosen kernel.

The power of a classifier lies in the applications. For instance, if we would like to purchase via web a conventional fixed mortgage from Bank of America in the state of South Carolina, and we fear problems with settlement process and cost, what does the classifier tell us about our chances of having the matter settled?

In [19]: encoder['Product'].transform(['Mortgage'])[0]
Out[19]: 4
In [20]: encoder['Sub-product'].transform(['Conventional fixed mortgage'])[0]
Out[20]: 5
In [21]: encoder['Issue'].transform(['Settlement process and costs'])[0]
Out[21]: 27
In [19]: encoder['State'].transform(['SC'])[0]
Out[19]: 50
In [23]: encoder['Submitted via'].transform(['Web'])[0]
Out[23]: 5
In [24]: encoder['Company'].transform(['Bank of America'])[0]
Out[24]: 247
In [25]: clf.predict([4,5,27,50,5,247])
Out[25]: array([1])
In [26]: encoder['Company response'].inverse_transform(_)[0]
Out[26]: 'Closed with explanation'

A satisfactory outlook!

Trees

It is possible to create a decision tree illustrating a set of rules to facilitate the classification. In the scikit-learn toolkit, we have a class implemented for this purpose—DecisionTreeClassifier in the submodule sklearn.tree. Let's see it in action:

In [27]: from sklearn.tree import DecisionTreeClassifier
In [28]: clf = DecisionTreeClassifier().fit(training, target)
In [29]: clf.predict(test) == true_result
Out[29]:
0      True
2     False
3      True
4      True
6     False
7      True
9      True
11    False
12    False
13    False
14    False
15     True
16     True
20    False
21    False
...
101604     True
101607     True
101610     True
101611     True
101613     True
101614     True
101616     True
101617     True
101618     True
101620     True
101621     True
101622     True
101625     True
101626    False
101627     True
Name: Company response, Length: 65282, dtype: bool
In [30]: float(sum(_)) / len(_)
Out[30]: 0.7400661744431849

It looks like this simple classifier was successful in predicting the outcome of about 74% of the complaints in 2014.

Tip

It is possible to create a dot file readable with the Graphviz visualization software available at http://www.graphviz.org/:

In [31]: from sklearn.tree import export_graphviz
In [32]: export_graphviz(clf, out_file="tree.dot")

Opening this file in Graphviz gives us an impressive set of rules

The following is a detail of the tree (too large to fit in these pages!):

Trees

We also have implementations of random forests and extremely randomized trees, both of them within the submodule sklearn.ensemble. The corresponding classes are called RandomForestClassifier and ExtraTreesClassifier, respectively.

In any of the previous cases, the coding of the classifier is exactly as in the cases of the SVC and the basic decision trees.

Naive Bayes

Similar results are obtained with Naive Bayes methods. In the module sklearn.naive_bayes, we have three implementations of this algorithm:

  • The class GaussianNB for the Gaussian Naive Bayes, where the likelihood of the features is assumed to be Gaussian.
  • The class BernoulliNB for Naive Bayes for data distributes according to multivariate Bernoulli distributions (each feature is assumed to be a binary valued variable).
  • The class MultinomialNB for multinomially distributed data.

Nearest neighbors

For an even better result for this case, we employ classification by nearest neighbors. This is exactly the same procedure we employed in the setting of computational geometry to perform the corresponding geometric query problem. In this setting, note how we have coded our data as points in a Euclidean space of high dimension, and we can thus translate those methods for this classification purpose.

The advantage in this case is that we don't necessarily have to use Euclidean distances for our computations. For instance, since the data is essentially different regardless of its numerical value, it makes sense to impose a Hamming metric to calculate distance between labels. We have a generalization of the nearest neighbors algorithm implemented as the class KNeighborsClassifier in the module sklearn.neighbors:

In [33]: from sklearn.neighbors import KNeighborsClassifier
In [34]: clf = KNeighborsClassifier(n_neighbors=8,metric='hamming');
   ....: clf.fit(training, target)
Out[34]:
KNeighborsClassifier(algorithm='auto', leaf_size=30,metric='hamming',
           n_neighbors=8, p=2, weights='uniform')
In [35]: clf.predict(test)==true_result
Out[35]:
0      True
2     False
3      True
4      True
6     False
7      True
9      True
11    False
12    False
13    False
14    False
15     True
16     True
20    False
21    False
...
101604     True
101607    False
101610     True
101611     True
101613     True
101614     True
101616     True
101617     True
101618     True
101620     True
101621     True
101622     True
101625     True
101626     True
101627     True
Name: Company response, Length: 65282, dtype: bool
In [36]: float(sum(_))/len(_)
Out[36]: 0.791274777120799

More than 79% of success!

Dimensionality reduction

Data often observes internal structure, but high dimension (number of columns, in a sense) makes it difficult to extract and select this internal structure. Often, it is possible to perform smart projections of this data on lower-dimensional manifolds, and analyze these projections for search of features. We refer to this technique as dimensionality reduction.

Note

For the following example, we decided to drop all dates that contain any NaN. This significantly reduced the volume of the data, making the subsequent study and results simpler to understand. For a more elaborate and complete study, force all occurrences of NaN to be zeros—substitute the method dropna() with fillna(0).

Let's observe how to profit from these processes with our running example. We gather all the daily complains by product, and analyze the data:

In [37]: df = data.groupby(['Date received', 'Product']).size(); 
   ....: df = df.unstack().dropna()
In [38]: df.head()
Out[38]:
Product        Bank account or service  Consumer loan  Credit card  
Date received                                                       
2013-11-06                          66             14           41  
2013-11-07                          44             11           33  
2013-11-08                          49             11           36  
2013-11-09                           9              4           20  
2013-11-11                          15              4           23  
Product        Credit reporting  Debt collection  Money transfers  Mortgage  
Date received                                                                
2013-11-06                   62              129                2       153  
2013-11-07                   43               99                2       128  
2013-11-08                   44               83                8       113  
2013-11-09                   19               33                2        23  
2013-11-11                   32               68                2        46  
Product        Payday loan  Student loan 
Date received                            
2013-11-06               2            14 
2013-11-07               8            10 
2013-11-08              12             7 
2013-11-09               3             4 
2013-11-11               2            14 
In [39]: df.shape
Out[39]: (233, 9)

We may regard this data as 233 points in a space of 9 dimensions.

Principal component analysis

For this small kind of data, and without any other prior information, one of the best procedures of dimensionality reduction results in projecting over a two-dimensional plane. However, it's not just any plane—we seek one projection that ensures that the projected data has the largest possible variance. We accomplish this with the information we obtain from eigenvalues and eigenvectors of a matrix that represents our data. This process is called principal component analysis (PCA).

Tip

PCA is regarded as one of the most useful techniques in Statistical methods. For an amazing survey of theory (in both scopes of Linear Algebra and Statistics), coding techniques, and applications, the best resource is the second edition of the book Principal Component Analysis, written by I.T. Jolliffe and published by Springer in their Springer Series in Statistics in 2002.

We have an implementation in the scikit-learn toolkit, the class PCA, in the submodule sklearn.decomposition:

In [40]: from sklearn.decomposition import PCA
In [41]: model = PCA(n_components=2)
In [42]: model.fit(df)
Out[42]: PCA(copy=True, n_components=2, whiten=False)
In [43]: projected_df = model.transform(df)
In [44]: plt.figure(); 
   ....: plt.scatter(projected_df[:,0], projected_df[:,1]); 
   ....: plt.title('Principal Component Analysis scatterplot of df'); 
   ....: plt.show()

Observe how the data consists of two very well differentiated clusters of points—one of them considerably larger than the other—and some outliers. We will come back to this problem in the next section.

Principal component analysis

Isometric mappings

We don't necessarily need to project on hyperplanes. One neat trick is to assume that the data itself lies on a nonlinear submanifold, and obtain a representation of this object with the points on it. This gives us flexibility to search for projections where the projected data satisfies relevant properties. For instance, if we require the projections to maintain geodesic distance among points (whenever possible), we achieve a so-called isometric mapping (isomap).

In the SciPy stack, we have an implementation of this method as the class Isomap in the submodule sklearn.manifold:

In [45]: from sklearn.manifold import Isomap
In [46]: model = Isomap().fit(df)
In [47]: isomapped_df = model.transform(df)
In [48]: plt.figure(); 
   ....: plt.scatter(isomapped_df[:,0], isomapped_df[:,1]); 
   ....: plt.title('Isometric Map scatterplot of df'); 
   ....: plt.show()

Although visually very different, this method also offers us two very clear clusters, one of them much larger than the other. The smaller cluster appears as a sequence of points clearly aligned:

Isometric mappings

Spectral embedding

Another possibility is to embed the data nonlinearly by applying spectral analysis on an affinity/similarity matrix. The results carry similar quality as the previous two examples:

In [49]: from sklearn.manifold import SpectralEmbedding
In [50]: model = SpectralEmbedding().fit(df)
In [51]: embedded_df = model.embedding_
In [52]: plt.figure(); 
   ....: plt.scatter(embedded_df[:,0], embedded_df[:,1]); 
   ....: plt.title('Spectral Embedding scatterplot of df'); 
   ....: plt.show()

In this case, the clusters are more clearly defined than in the previous examples.

Spectral embedding

Locally linear embedding

Another possible projection, similar in some sense to isometric maps, seeks to preserve the distance within local neighborhoods—the locally linear embedding. We have an implementation through the class LocallyLinearEmbedding, again within the submodule sklearn.manifold:

In [53]: from sklearn.manifold import LocallyLinearEmbedding
In [54]: model = LocallyLinearEmbedding().fit(df)
In [55]: lle_df = model.transform(df)
In [56]: plt.figure(); 
   ....: plt.scatter(lle_df[:,0], lle_df[:,1]); 
   ....: plt.title('Locally Linear Embedding scatterplot of df'); 
   ....: plt.show()

Note the extremely conglomerated two clusters, and two outliers.

Locally linear embedding

Clustering

Clustering is similar in some sense to the problem of classification, yet more complex. When facing a dataset, we acknowledge the possibility of having some hidden structure, in a way that will allow us to predict the behavior of future data. Searching for this structure is performed by finding common patterns and gathering data conforming to those patterns in different clusters. For this reason, we also refer to this problem as data mining.

There are many different methods to perform clustering, depending on the volume of data, and the a priori information we have on the number of clusters. We are going to explore the following settings:

  • MeanShift.
  • Gaussian mixture models.
  • K-means.
  • Spectral clustering.

MeanShift

We employ the technique of MeanShift clustering when the data does not exceed 10,000 points, and we do not know a priori the number of clusters we need. Let's experiment with the running example from the section on dimensionality reduction, but we will remain oblivious from the two clusters suggested by all projections. We will let the mean shift clustering take that decision for us.

We have an implementation in the SciPy stack through the class MeanShift in the submodule sklearn.cluster of the scikit-learn toolkit. One of the ingredients of this algorithm is approximation with radial basis functions (discussed in Chapter 1, Numerical Linear Algebra), for which we need to provide an appropriate bandwidth. The algorithm, if not provided with one, will try to estimate from the data. This process could potentially be very slow and expensive, and it is generally a good idea to do estimation by ourselves, so we can control resources. We can do so with the helper function estimate_bandwidth, in the same submodule.

Tip

The implementation of classes and routines for clustering algorithms in the scikit-learn toolkit require the data to be fed as a numpy array, rather than a pandas data frame.

We can perform this switch easily with the dataframe method .values.

In [57]: from sklearn.cluster import MeanShift, estimate_bandwidth
In [58]: bandwidth = estimate_bandwidth(df.values, n_samples=1000)
In [59]: model = MeanShift(bandwidth=bandwidth, bin_seeding=True)
In [60]: model.fit(df.values)

At this point, the object model has successfully computed a series of labels and attached them to each point in the data, so they are properly clustered. We can allow some unclassifiable points to remain unclassified—we accomplish this by setting the optional Boolean flag cluster_all to False. By default, the algorithm forces every single piece of data into a cluster.

Let's find the number of labels and visualize the result on top of one of the projections from the previous section, for quality purposes:

In [61]: np.unique(model.labels_)      # how many clusters?
Out[61]: array([0,  1,  2])
In [62]: plt.figure(); 
   ....: plt.scatter(isomapped_df[:,0], isomapped_df[:,1],
   ....:             c=model.labels_, s = 50 + 100*model.labels_); 
   ....: plt.title('MeanShift clustering of df
 Isometric Mapping 
   ....: scatterplot
 color/size indicates cluster'); 
   ....: plt.tight_layout(); 
   ....: plt.show()

Note how the two clear clusters are correctly computed and one outlier received its own cluster.

MeanShift

Let's find out the significance of these clusters. First, the outlier:

In [63]: df[model.labels_ == 2]
Out[63]:
Product        Bank account or service  Consumer loan  Credit card  
Date received                                                       
2014-06-26                         117             19           89  
Product        Credit reporting  Debt collection  Money transfers  Mortgage  
Date received                                                                
2014-06-26                   85              159                5       420  
Product        Payday loan  Student loan 
Date received                            
2014-06-26               7            12 

What is so different in the amount of complaints produced on July 26th, 2014? Let us produce a plot with the each cluster of dates, to see if we may guess what the differences are about.

In [64]: fig = plt.figure(); 
   ....: ax1 = fig.add_subplot(211); 
   ....: ax2 = fig.add_subplot(212); 
   ....: df[model.labels_==0].plot(ax=ax1); 
   ....: df[model.labels_==1].plot(ax=ax2); 
   ....: plt.show()

We should get an output similar to the following:

MeanShift

Visually, it appears as if the cluster has been formed by gathering dates with high volume of complaints, versus low volume of complaints. But not only that: a closer inspection reveals that on dates from cluster 0, mortgages are unequivocally the number one reason for complaint. On the other hand, for dates in cluster 1, complaints on mortgages get relegated to a second or third position, always behind debt collection and payday loans:

In [65]: plt.figure(); 
   ....: df[model.labels_==0].sum(axis=1).plot(label='cluster 0'); 
   ....: df[model.labels_==1].sum(axis=1).plot(label='cluster 1'); 
   ....: plt.legend(); 
   ....: plt.show()

Indeed, that was the case:

MeanShift
In [66]: df[model.labels_==0].sum(axis=1).describe()
Out[66]:
count    190.000000
mean     528.605263
std       80.900075
min      337.000000
25%      465.000000
50%      537.000000
75%      585.000000
max      748.000000
dtype: float64
In [67]: df[model.labels_==1].sum(axis=1).describe()
Out[67]:
count     42.000000
mean     156.738095
std       56.182655
min       42.000000
25%      124.250000
50%      140.500000
75%      175.750000
max      335.000000
dtype: float64
In [68]: df[model.labels_==2].sum(axis=1)
Out[68]:
Date received
2014-06-26       913
dtype: float64

Note that the volume of complaints in the cluster labeled 1 does not go over 335 daily complaints. Complaints formulated on days from the zero cluster are all between 337 and 748. On the outlier date—July 26th, 2014—there were 913 complaints.

Gaussian mixture models

Gaussian mixture models are probabilistic models that make assumptions on the way the data has been generated, and the distributions it obeys. These algorithms approximate the parameters defining the involved distributions.

In its purest form, this method implements the expectation-maximization (EM) algorithm in order to fit the model. We access this implementation with the class GMM in the submodule sklearn.mixture of the scikit-learn toolkit. This implementation requires us to provide with the number of desired clusters, though. And unlike other methods, it will try its hardest to categorize the data into as many clusters required, no matter whether these artificial clusters make any logical sense.

In order to perform clustering on relatively small amounts of data without previous knowledge of the number of clusters that we need, we may employ a variant of Gaussian mixture models that use variational inference algorithms instead. We call this a variational Gaussian mixture. We have an implementation of this algorithm as the class VBGMM in the same submodule.

For this particular method, we do need to feed an upper bound of the number of clusters we expect, but the algorithm will compute the optimal number for us.

For instance, in our running example— which clearly shows two clusters—we are going to impose an upper bound of 30, and observe the behavior of the VBGMM algorithm:

In [69]: from sklearn.mixture import VBGMM
In [70]: model = VBGMM(n_components=30).fit(df)
In [71]: labels = model.predict(df)
In [72]: len(np.unique(labels))    # how many clusters?
Out[72]: 2

Only two clusters!

In [73]: a, b = np.unique(labels)
In [74]: sizes = 50 + 100 * (labels - a) / float(b-a)
In [75]: plt.figure(); 
   ....: plt.scatter(embedded_df[:,0], embedded_df[:,1],
   ....:             c=labels, s=sizes); 
   ....: plt.title('VBGMM clustering of df
 Spectral Embedding 
   ....: scatterplot
 color/size indicates cluster'); 
   ....: plt.tight_layout(); 
   ....: plt.show()

We should get an output similar to the following:

Gaussian mixture models

Kmeans

If we previously know the number of clusters that we require, regardless of the amount of data, a good algorithm for clustering is Lloyd's algorithm (better known as the method of K-means).

In the module scipy.cluster.vq we have an efficient set of routines for k-means clustering. A parallel-capable algorithm is implemented as the class KMeans in the submodule sklearn.cluster of the scikit-learn toolkit. For instance, if we require on our data a partition into four clusters, using all CPUs of our computer, we could issue from the toolkit the following code:

In [76]: from sklearn.cluster import KMeans
In [77]: model = KMeans(n_clusters=4, n_jobs=-1).fit(df)
In [78]: plt.figure(); 
   ....: plt.scatter(isomapped_df[:,0], isomapped_df[:,1],
   ....:             c=model.labels_, s=10 + 100*model.labels_); 
   ....: plt.title('KMeans clustering of df
 Isometric Mapping 
   ....: scatterplot
 color/size indicates cluster'); 
   ....: plt.tight_layout(); 
   ....: plt.show()

We should get an output similar to the following:

Kmeans

Note how this artificial clustering still manages to categorize different dates by the volume of complaints received, no matter the product:

In [79]: plt.figure()
In [80]: for label in np.unique(model.labels_):
   ....:     if sum(model.labels_==label) > 1:
   ....:         object = df[model.labels_==label].sum(axis = 1)
   ....:         object.plot(label=label)
   ....:
In [81]: plt.legend(); 
   ....: plt.show()

We should get an output similar to the following:

Kmeans

Tip

The cluster that has not been represented in this plot is, as in our previous clustering analysis, is the single date July 26, 2014, when nearly a thousand complaints were received.

In case of huge amounts of data (more than 10,000 points), we often use a variation of K-means that runs on randomly sampled subsets of the data on different iterations, to reduce computation time. This method is called Mini-batch Kmeans, and it has been implemented as the class MiniBatchKMeans in the same submodule. The quality of the clustering is slightly worse as compared to when using pure K-means, but the process is significantly faster.

Spectral clustering

By performing a low-dimension spectral embedding of the data (with different metrics) prior to a K-means, we are often able to tackle clustering when any of the previous methods fail to categorize data in a meaningful way. We have an amazingly clever implementation based on algebraic multigrid solvers in the scikit-learn toolkit, as the class SpectralClustering in the submodule sklearn.cluster.

Tip

To handle the algebraic multigrid solvers, it is highly recommended to have the package pyamg installed. This package was developed by Nathan Bell, Luke Olson, and Jacob Schroder, from the department of computer science at the University of Illinois at Urbana-Champaign. This is not strictly necessary, but doing so will speed up our computations immensely. The package can be downloaded in several formats from http://pyamg.org/ or installed as usual with either a pip, easy_install or conda command from the console.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.17.79.20