We are going to focus on three kinds of problems: Classification, Dimensionality reduction, and Clustering. Each of these problems is used in both data mining and machine learning to draw conclusions about the data. Let's explain each of these settings in different sections.
Classification is an example of supervised learning. There is a set of training data with an attribute that classifies it in one of several categories. The goal is to find the value of that attribute for new data. For example, with our running database, we could use all the data from the year 2013 to figure out which financial complaints got solved positively for the customer, which ones got solved without relief, and which ones remained in progress. This will offer us good insight on, for instance, which companies are faster to respond to consumer complaints positively, if there are states where complaints are less likely to get resolved, and so on.
Let's start by finding the kind of company responses observed in the database:
In [1]: import numpy as np, pandas as pd, matplotlib.pyplot as plt In [2]: data = pd.read_csv("Consumer_Complaints.csv", ...: low_memory=False, parse_dates=[8,9]) In [3]: print data['Company response'].unique() ['Closed with non-monetary relief' 'In progress' 'Closed with explanation' 'Closed with monetary relief' 'Closed' 'Untimely response' 'Closed without relief' 'Closed with relief']
These are eight different categories and our target for deciding the fate of future complaints. Let's create a set of training data by gathering all complaints formulated during the year 2013, and keeping only the columns that we believe are relevant to the decision-making process:
The size of this training data will dictate which algorithm is to be used for classification.
We will then try to classify all complaints formulated during the year 2014:
In [4]: in_2013 = data['Date received'].map(lambda t: t.year==2013); ...: in_2014 = data['Date received'].map(lambda t: t.year==2014); ...: df = data[in_2013 | in_2014]; ...: df['Year'] = df['Date received'].map(lambda t: t.year); ...: irrelevant = ['Date received', 'Date sent to company', ...: 'Complaint ID', 'Timely response?', ...: 'Consumer disputed?', 'Sub-issue','ZIP code']; ...: df.drop(irrelevant, 1, inplace=True); ...: df = df.dropna() In [5]: from sklearn.preprocessing import LabelEncoder In [6]: encoder = {} In [7]: for column in df.columns: ...: if df[column].dtype != 'int': ...: le = LabelEncoder() ...: le.fit(df[column].unique()) ...: df[column] = le.transform(df[column]) ...: encoder[column] = le ...: In [8]: training = df[df.Year==2013]; ...: target = training['Company response']; ...: training.drop(['Company response', 'Year'], 1, inplace=True) In [9]: test = df[df.Year==2014]; ...: true_result = test['Company response']; ...: test.drop(['Company response', 'Year'], 1, inplace=True) In [10]: len(training) Out[10]: 77100
The training data here is not too big (anything less than 100,000 is considered manageable). For this volume of training data, it is suggested that we employ support vector classification with a linear kernel.
Three flavors of this algorithm are coded as classes in the module sklearn.svm
(for support vector machines): SVC, NuSVC, and a simplified version of SVC with linear kernel, LinearSVC, which is what we need:
In [11]: from sklearn.svm import LinearSVC In [12]: clf = LinearSVC(); ....: clf.fit(training, target) Out[12]: LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True, intercept_scaling=1, loss='l2', multi_class='ovr', penalty='l2', random_state=None, tol=0.0001, verbose=0)
We are ready to evaluate the performance of this classifier:
In [13]: clf.predict(test)==true_result Out[13]: 0 False 2 False 3 True 4 True 6 False 7 True 9 True 11 False 12 False 13 False 14 False 15 True 16 True 20 False 21 False ... 101604 True 101607 False 101610 True 101611 True 101613 True 101614 True 101616 True 101617 True 101618 True 101620 True 101621 True 101622 True 101625 True 101626 True 101627 True Name: Company response, Length: 65282, dtype: bool In [18]: float(sum(_)) / float(len(_)) Out[18]: 0.7985509022395147
With this method, we correctly classify almost 80% of the complaints.
The power of a classifier lies in the applications. For instance, if we would like to purchase via web a conventional fixed mortgage from Bank of America in the state of South Carolina, and we fear problems with settlement process and cost, what does the classifier tell us about our chances of having the matter settled?
In [19]: encoder['Product'].transform(['Mortgage'])[0] Out[19]: 4 In [20]: encoder['Sub-product'].transform(['Conventional fixed mortgage'])[0] Out[20]: 5 In [21]: encoder['Issue'].transform(['Settlement process and costs'])[0] Out[21]: 27 In [19]: encoder['State'].transform(['SC'])[0] Out[19]: 50 In [23]: encoder['Submitted via'].transform(['Web'])[0] Out[23]: 5 In [24]: encoder['Company'].transform(['Bank of America'])[0] Out[24]: 247 In [25]: clf.predict([4,5,27,50,5,247]) Out[25]: array([1]) In [26]: encoder['Company response'].inverse_transform(_)[0] Out[26]: 'Closed with explanation'
It is possible to create a decision tree illustrating a set of rules to facilitate the classification. In the scikit-learn
toolkit, we have a class implemented for this purpose—DecisionTreeClassifier
in the submodule sklearn.tree
. Let's see it in action:
In [27]: from sklearn.tree import DecisionTreeClassifier In [28]: clf = DecisionTreeClassifier().fit(training, target) In [29]: clf.predict(test) == true_result Out[29]: 0 True 2 False 3 True 4 True 6 False 7 True 9 True 11 False 12 False 13 False 14 False 15 True 16 True 20 False 21 False ... 101604 True 101607 True 101610 True 101611 True 101613 True 101614 True 101616 True 101617 True 101618 True 101620 True 101621 True 101622 True 101625 True 101626 False 101627 True Name: Company response, Length: 65282, dtype: bool In [30]: float(sum(_)) / len(_) Out[30]: 0.7400661744431849
It looks like this simple classifier was successful in predicting the outcome of about 74% of the complaints in 2014.
It is possible to create a dot
file readable with the Graphviz visualization software available at http://www.graphviz.org/:
In [31]: from sklearn.tree import export_graphviz In [32]: export_graphviz(clf, out_file="tree.dot")
Opening this file in Graphviz gives us an impressive set of rules
The following is a detail of the tree (too large to fit in these pages!):
We also have implementations of random forests and extremely randomized trees, both of them within the submodule sklearn.ensemble
. The corresponding classes are called RandomForestClassifier
and ExtraTreesClassifier
, respectively.
In any of the previous cases, the coding of the classifier is exactly as in the cases of the SVC and the basic decision trees.
Similar results are obtained with Naive Bayes methods. In the module sklearn.naive_bayes
, we have three implementations of this algorithm:
GaussianNB
for the Gaussian Naive Bayes, where the likelihood of the features is assumed to be Gaussian.BernoulliNB
for Naive Bayes for data distributes according to multivariate Bernoulli distributions (each feature is assumed to be a binary valued variable).MultinomialNB
for multinomially distributed data.For an even better result for this case, we employ classification by nearest neighbors. This is exactly the same procedure we employed in the setting of computational geometry to perform the corresponding geometric query problem. In this setting, note how we have coded our data as points in a Euclidean space of high dimension, and we can thus translate those methods for this classification purpose.
The advantage in this case is that we don't necessarily have to use Euclidean distances for our computations. For instance, since the data is essentially different regardless of its numerical value, it makes sense to impose a Hamming metric to calculate distance between labels. We have a generalization of the nearest neighbors algorithm implemented as the class KNeighborsClassifier
in the module sklearn.neighbors
:
In [33]: from sklearn.neighbors import KNeighborsClassifier In [34]: clf = KNeighborsClassifier(n_neighbors=8,metric='hamming'); ....: clf.fit(training, target) Out[34]: KNeighborsClassifier(algorithm='auto', leaf_size=30,metric='hamming', n_neighbors=8, p=2, weights='uniform') In [35]: clf.predict(test)==true_result Out[35]: 0 True 2 False 3 True 4 True 6 False 7 True 9 True 11 False 12 False 13 False 14 False 15 True 16 True 20 False 21 False ... 101604 True 101607 False 101610 True 101611 True 101613 True 101614 True 101616 True 101617 True 101618 True 101620 True 101621 True 101622 True 101625 True 101626 True 101627 True Name: Company response, Length: 65282, dtype: bool In [36]: float(sum(_))/len(_) Out[36]: 0.791274777120799
Data often observes internal structure, but high dimension (number of columns, in a sense) makes it difficult to extract and select this internal structure. Often, it is possible to perform smart projections of this data on lower-dimensional manifolds, and analyze these projections for search of features. We refer to this technique as dimensionality reduction.
For the following example, we decided to drop all dates that contain any NaN
. This significantly reduced the volume of the data, making the subsequent study and results simpler to understand. For a more elaborate and complete study, force all occurrences of NaN
to be zeros—substitute the method dropna()
with fillna(0)
.
Let's observe how to profit from these processes with our running example. We gather all the daily complains by product, and analyze the data:
In [37]: df = data.groupby(['Date received', 'Product']).size(); ....: df = df.unstack().dropna() In [38]: df.head() Out[38]: Product Bank account or service Consumer loan Credit card Date received 2013-11-06 66 14 41 2013-11-07 44 11 33 2013-11-08 49 11 36 2013-11-09 9 4 20 2013-11-11 15 4 23 Product Credit reporting Debt collection Money transfers Mortgage Date received 2013-11-06 62 129 2 153 2013-11-07 43 99 2 128 2013-11-08 44 83 8 113 2013-11-09 19 33 2 23 2013-11-11 32 68 2 46 Product Payday loan Student loan Date received 2013-11-06 2 14 2013-11-07 8 10 2013-11-08 12 7 2013-11-09 3 4 2013-11-11 2 14 In [39]: df.shape Out[39]: (233, 9)
We may regard this data as 233
points in a space of 9
dimensions.
For this small kind of data, and without any other prior information, one of the best procedures of dimensionality reduction results in projecting over a two-dimensional plane. However, it's not just any plane—we seek one projection that ensures that the projected data has the largest possible variance. We accomplish this with the information we obtain from eigenvalues and eigenvectors of a matrix that represents our data. This process is called principal component analysis (PCA).
PCA is regarded as one of the most useful techniques in Statistical methods. For an amazing survey of theory (in both scopes of Linear Algebra and Statistics), coding techniques, and applications, the best resource is the second edition of the book Principal Component Analysis, written by I.T. Jolliffe and published by Springer in their Springer Series in Statistics in 2002.
We have an implementation in the scikit-learn
toolkit, the class PCA
, in the submodule sklearn.decomposition
:
In [40]: from sklearn.decomposition import PCA In [41]: model = PCA(n_components=2) In [42]: model.fit(df) Out[42]: PCA(copy=True, n_components=2, whiten=False) In [43]: projected_df = model.transform(df) In [44]: plt.figure(); ....: plt.scatter(projected_df[:,0], projected_df[:,1]); ....: plt.title('Principal Component Analysis scatterplot of df'); ....: plt.show()
Observe how the data consists of two very well differentiated clusters of points—one of them considerably larger than the other—and some outliers. We will come back to this problem in the next section.
We don't necessarily need to project on hyperplanes. One neat trick is to assume that the data itself lies on a nonlinear submanifold, and obtain a representation of this object with the points on it. This gives us flexibility to search for projections where the projected data satisfies relevant properties. For instance, if we require the projections to maintain geodesic distance among points (whenever possible), we achieve a so-called isometric mapping (isomap).
In the SciPy stack, we have an implementation of this method as the class Isomap
in the submodule sklearn.manifold
:
In [45]: from sklearn.manifold import Isomap In [46]: model = Isomap().fit(df) In [47]: isomapped_df = model.transform(df) In [48]: plt.figure(); ....: plt.scatter(isomapped_df[:,0], isomapped_df[:,1]); ....: plt.title('Isometric Map scatterplot of df'); ....: plt.show()
Although visually very different, this method also offers us two very clear clusters, one of them much larger than the other. The smaller cluster appears as a sequence of points clearly aligned:
Another possibility is to embed the data nonlinearly by applying spectral analysis on an affinity/similarity matrix. The results carry similar quality as the previous two examples:
In [49]: from sklearn.manifold import SpectralEmbedding In [50]: model = SpectralEmbedding().fit(df) In [51]: embedded_df = model.embedding_ In [52]: plt.figure(); ....: plt.scatter(embedded_df[:,0], embedded_df[:,1]); ....: plt.title('Spectral Embedding scatterplot of df'); ....: plt.show()
In this case, the clusters are more clearly defined than in the previous examples.
Another possible projection, similar in some sense to isometric maps, seeks to preserve the distance within local neighborhoods—the locally linear embedding. We have an implementation through the class LocallyLinearEmbedding
, again within the submodule sklearn.manifold
:
In [53]: from sklearn.manifold import LocallyLinearEmbedding In [54]: model = LocallyLinearEmbedding().fit(df) In [55]: lle_df = model.transform(df) In [56]: plt.figure(); ....: plt.scatter(lle_df[:,0], lle_df[:,1]); ....: plt.title('Locally Linear Embedding scatterplot of df'); ....: plt.show()
Note the extremely conglomerated two clusters, and two outliers.
Clustering is similar in some sense to the problem of classification, yet more complex. When facing a dataset, we acknowledge the possibility of having some hidden structure, in a way that will allow us to predict the behavior of future data. Searching for this structure is performed by finding common patterns and gathering data conforming to those patterns in different clusters. For this reason, we also refer to this problem as data mining.
There are many different methods to perform clustering, depending on the volume of data, and the a priori information we have on the number of clusters. We are going to explore the following settings:
We employ the technique of MeanShift clustering when the data does not exceed 10,000 points, and we do not know a priori the number of clusters we need. Let's experiment with the running example from the section on dimensionality reduction, but we will remain oblivious from the two clusters suggested by all projections. We will let the mean shift clustering take that decision for us.
We have an implementation in the SciPy stack through the class MeanShift
in the submodule sklearn.cluster
of the scikit-learn
toolkit. One of the ingredients of this algorithm is approximation with radial basis functions (discussed in Chapter 1, Numerical Linear Algebra), for which we need to provide an appropriate bandwidth. The algorithm, if not provided with one, will try to estimate from the data. This process could potentially be very slow and expensive, and it is generally a good idea to do estimation by ourselves, so we can control resources. We can do so with the helper function estimate_bandwidth
, in the same submodule.
In [57]: from sklearn.cluster import MeanShift, estimate_bandwidth In [58]: bandwidth = estimate_bandwidth(df.values, n_samples=1000) In [59]: model = MeanShift(bandwidth=bandwidth, bin_seeding=True) In [60]: model.fit(df.values)
At this point, the object model
has successfully computed a series of labels and attached them to each point in the data, so they are properly clustered. We can allow some unclassifiable points to remain unclassified—we accomplish this by setting the optional Boolean flag cluster_all
to False
. By default, the algorithm forces every single piece of data into a cluster.
Let's find the number of labels and visualize the result on top of one of the projections from the previous section, for quality purposes:
In [61]: np.unique(model.labels_) # how many clusters? Out[61]: array([0, 1, 2]) In [62]: plt.figure(); ....: plt.scatter(isomapped_df[:,0], isomapped_df[:,1], ....: c=model.labels_, s = 50 + 100*model.labels_); ....: plt.title('MeanShift clustering of df Isometric Mapping ....: scatterplot color/size indicates cluster'); ....: plt.tight_layout(); ....: plt.show()
Note how the two clear clusters are correctly computed and one outlier received its own cluster.
Let's find out the significance of these clusters. First, the outlier:
In [63]: df[model.labels_ == 2] Out[63]: Product Bank account or service Consumer loan Credit card Date received 2014-06-26 117 19 89 Product Credit reporting Debt collection Money transfers Mortgage Date received 2014-06-26 85 159 5 420 Product Payday loan Student loan Date received 2014-06-26 7 12
What is so different in the amount of complaints produced on July 26th, 2014? Let us produce a plot with the each cluster of dates, to see if we may guess what the differences are about.
In [64]: fig = plt.figure(); ....: ax1 = fig.add_subplot(211); ....: ax2 = fig.add_subplot(212); ....: df[model.labels_==0].plot(ax=ax1); ....: df[model.labels_==1].plot(ax=ax2); ....: plt.show()
We should get an output similar to the following:
Visually, it appears as if the cluster has been formed by gathering dates with high volume of complaints, versus low volume of complaints. But not only that: a closer inspection reveals that on dates from cluster 0, mortgages are unequivocally the number one reason for complaint. On the other hand, for dates in cluster 1, complaints on mortgages get relegated to a second or third position, always behind debt collection and payday loans:
In [65]: plt.figure(); ....: df[model.labels_==0].sum(axis=1).plot(label='cluster 0'); ....: df[model.labels_==1].sum(axis=1).plot(label='cluster 1'); ....: plt.legend(); ....: plt.show()
In [66]: df[model.labels_==0].sum(axis=1).describe() Out[66]: count 190.000000 mean 528.605263 std 80.900075 min 337.000000 25% 465.000000 50% 537.000000 75% 585.000000 max 748.000000 dtype: float64 In [67]: df[model.labels_==1].sum(axis=1).describe() Out[67]: count 42.000000 mean 156.738095 std 56.182655 min 42.000000 25% 124.250000 50% 140.500000 75% 175.750000 max 335.000000 dtype: float64 In [68]: df[model.labels_==2].sum(axis=1) Out[68]: Date received 2014-06-26 913 dtype: float64
Note that the volume of complaints in the cluster labeled 1 does not go over 335 daily complaints. Complaints formulated on days from the zero cluster are all between 337 and 748. On the outlier date—July 26th, 2014—there were 913 complaints.
Gaussian mixture models are probabilistic models that make assumptions on the way the data has been generated, and the distributions it obeys. These algorithms approximate the parameters defining the involved distributions.
In its purest form, this method implements the
expectation-maximization (EM) algorithm in order to fit the model. We access this implementation with the class GMM
in the submodule sklearn.mixture
of the scikit-learn
toolkit. This implementation requires us to provide with the number of desired clusters, though. And unlike other methods, it will try its hardest to categorize the data into as many clusters required, no matter whether these artificial clusters make any logical sense.
In order to perform clustering on relatively small amounts of data without previous knowledge of the number of clusters that we need, we may employ a variant of Gaussian mixture models that use variational inference algorithms instead. We call this a variational Gaussian mixture. We have an implementation of this algorithm as the class VBGMM
in the same submodule.
For this particular method, we do need to feed an upper bound of the number of clusters we expect, but the algorithm will compute the optimal number for us.
For instance, in our running example— which clearly shows two clusters—we are going to impose an upper bound of 30, and observe the behavior of the VBGMM
algorithm:
In [69]: from sklearn.mixture import VBGMM In [70]: model = VBGMM(n_components=30).fit(df) In [71]: labels = model.predict(df) In [72]: len(np.unique(labels)) # how many clusters? Out[72]: 2
Only two clusters!
In [73]: a, b = np.unique(labels) In [74]: sizes = 50 + 100 * (labels - a) / float(b-a) In [75]: plt.figure(); ....: plt.scatter(embedded_df[:,0], embedded_df[:,1], ....: c=labels, s=sizes); ....: plt.title('VBGMM clustering of df Spectral Embedding ....: scatterplot color/size indicates cluster'); ....: plt.tight_layout(); ....: plt.show()
We should get an output similar to the following:
If we previously know the number of clusters that we require, regardless of the amount of data, a good algorithm for clustering is Lloyd's algorithm (better known as the method of K-means).
In the module scipy.cluster.vq
we have an efficient set of routines for k-means clustering. A parallel-capable algorithm is implemented as the class KMeans
in the submodule sklearn.cluster
of the scikit-learn
toolkit. For instance, if we require on our data a partition into four clusters, using all CPUs of our computer, we could issue from the toolkit the following code:
In [76]: from sklearn.cluster import KMeans In [77]: model = KMeans(n_clusters=4, n_jobs=-1).fit(df) In [78]: plt.figure(); ....: plt.scatter(isomapped_df[:,0], isomapped_df[:,1], ....: c=model.labels_, s=10 + 100*model.labels_); ....: plt.title('KMeans clustering of df Isometric Mapping ....: scatterplot color/size indicates cluster'); ....: plt.tight_layout(); ....: plt.show()
We should get an output similar to the following:
Note how this artificial clustering still manages to categorize different dates by the volume of complaints received, no matter the product:
In [79]: plt.figure() In [80]: for label in np.unique(model.labels_): ....: if sum(model.labels_==label) > 1: ....: object = df[model.labels_==label].sum(axis = 1) ....: object.plot(label=label) ....: In [81]: plt.legend(); ....: plt.show()
We should get an output similar to the following:
In case of huge amounts of data (more than 10,000 points), we often use a variation of K-means that runs on randomly sampled subsets of the data on different iterations, to reduce computation time. This method is called
Mini-batch Kmeans, and it has been implemented as the class MiniBatchKMeans
in the same submodule. The quality of the clustering is slightly worse as compared to when using pure K-means, but the process is significantly faster.
By performing a low-dimension spectral embedding of the data (with different metrics) prior to a K-means, we are often able to tackle clustering when any of the previous methods fail to categorize data in a meaningful way. We have an amazingly clever implementation based on algebraic multigrid solvers in the scikit-learn
toolkit, as the class SpectralClustering
in the submodule sklearn.cluster
.
To handle the algebraic multigrid solvers, it is highly recommended to have the package pyamg
installed. This package was developed by Nathan Bell, Luke Olson, and Jacob Schroder, from the department of computer science at the University of Illinois at Urbana-Champaign. This is not strictly necessary, but doing so will speed up our computations immensely. The package can be downloaded in several formats from http://pyamg.org/ or installed as usual with either a pip
, easy_install
or conda
command from the console.
3.17.79.20