Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 7. Nonlinear Featurization via K-Means Model Stacking

PCA is very useful when the data lies in a linear subspace like a flat pancake. But what if the data forms a more complicated shape?¹ A flat plane (linear subspace) can be generalized to a manifold (nonlinear subspace), which can be thought of as a surface that gets stretched and rolled in various ways.²

If a linear subspace is a flat sheet of paper, then a rolled up sheet of paper is a simple example of a nonlinear manifold. Informally, this is called a Swiss roll (see Figure 7-1). Once rolled, a 2D plane occupies 3D space. Yet it is essentially still a 2D object. In other words, it has low intrinsic dimensionality, a concept we’ve already touched upon in “Intuition”. If we could somehow unroll the Swiss roll, we’d recover the 2D plane. This is the goal of nonlinear dimensionality reduction, which assumes that the manifold is simpler than the full dimension it occupies and attempts to unfold it.

The key observation is that even when a big manifold looks complicated, the local neighborhood around each point can often be well approximated with a patch of flat surface. In other words, the patches to encode global structure using local structure.³ Nonlinear dimensionality reduction is also called nonlinear embedding or manifold learning. Nonlinear embeddings are useful for aggressively compressing high-dimensional data into low-dimensional data. They are often used for visualization in two or three dimensions.

The goal of feature engineering, however, isn’t so much to make the feature dimensions as low as possible, but to arrive at the right features for the task. In this chapter, the right features are those that represent the spatial characteristics of the data.

Clustering algorithms are usually not presented as techniques for local structure learning. But they in fact enable just that. Points that are close to each other (where “closeness” can be defined by a chosen metric) belong to the same cluster. Given a clustering, a data point can be represented by its cluster membership vector. If the number of clusters is smaller than the original number of features, then the new representation will have fewer dimensions than the original; the original data is compressed into a lower dimension. We will unpack this idea in this chapter.

Compared to nonlinear embedding techniques, clustering may produce more features. But if the end goal is feature engineering instead of visualization, this is not a problem.

We will illustrate the idea of local structure learning with a common clustering algorithm called k-means. It is simple to understand and implement. Instead of nonlinear manifold reduction, it is more apt to say that k-means performs nonlinear manifold feature extraction. Used correctly, it can be a powerful tool in our feature engineering repertoire.

k-Means Clustering

k-means is a clustering algorithm. Clustering algorithms group data depending on how they are laid out in space. They are unsupervised in that they do not require any sort of label—it’s the algorithm’s job to infer cluster labels based solely on the geometry of the data itself.

A clustering algorithm depends on a metric—a measurement of closeness between data points. The most popular metric is the Euclidean distance or Euclidean metric. It comes from Euclidean geometry and measures the straight-line distance between two points. It should feel very normal to us because this is the distance we see in everyday physical reality.

The Euclidean distance between two vectors x and y is the ℓ² norm of x – y. (See “ℓ² Normalization” for more on the ℓ² norm.) In math speak, it is usually written as ‖x – y‖₂ or just ‖x – y‖.

k-means establishes a hard clustering, meaning that each data point is assigned to one and only one cluster. The algorithm learns to position the cluster centers such that the total sum of the Euclidean distance between each data point and its cluster center is minimized. For those who like to read math instead of words, here is the objective function:

\min_{C_{1}, \dots, C_{k}, μ_{1}, \dots, μ_{k}} \sum_{i = 1}^{k} \sum_{x \in C_{i}} {∥ x - μ_{i} ∥}_{2}

$\min_{C_{1}, \dots, C_{k}, μ_{1}, \dots, μ_{k}} \sum_{i = 1}^{k} \sum_{x \in C_{i}} {∥ x - μ_{i} ∥}_{2}$

Each cluster C_i contains a subset of data points. The center of cluster i is equal to the average of all the data points in the cluster:

μ_{i} = \sum_{x \in C_{i}} x / n_{i}

$μ_{i} = \sum_{x \in C_{i}} x / n_{i}$

where n_i denotes the number of data points in cluster i.

Figure 7-2 shows k-means at work on two different, randomly generated datasets. The data in (a) is generated from random Gaussian distributions with the same variance but different means. The data in (c) is generated uniformly at random. These toy problems are very simple to solve, and k-means does a good job. (The results could be sensitive to the number of clusters, which must be given to the algorithm.)

The code for this example is found in Example 7-1.

Example 7-1. Code to generate k-means examples

>>> import numpy as np
>>> from sklearn.cluster import KMeans
>>> from sklearn.datasets import make_blobs

>>> import matplotlib.pyplot as plt

>>> n_data = 1000
>>> seed = 1
>>> n_clusters = 4

# Generate random Gaussian blobs and run k-means
>>> blobs, blob_labels = make_blobs(n_samples=n_data, n_features=2, 
...                                 centers=n_centers, random_state=seed)
>>> clusters_blob = KMeans(n_clusters=n_centers, random_state=seed).fit_predict(blobs)

# Generate data uniformly at random and run k-means
>>> uniform = np.random.rand(n_data, 2)
>>> clusters_uniform = KMeans(n_clusters=n_clusters, 
...                           random_state=seed).fit_predict(uniform)

# Matplotlib incantations for visualizing results
>>> figure = plt.figure()
>>> plt.subplot(221)
>>> plt.scatter(blobs[:, 0], blobs[:, 1], c=blob_labels, cmap='gist_rainbow')
>>> plt.title("(a) Four randomly generated blobs", fontsize=14)
>>> plt.axis('off')

>>> plt.subplot(222)
>>> plt.scatter(blobs[:, 0], blobs[:, 1], c=clusters_blob, cmap='gist_rainbow')
>>> plt.title("(b) Clusters found via K-means", fontsize=14)
>>> plt.axis('off')

>>> plt.subplot(223)
>>> plt.scatter(uniform[:, 0], uniform[:, 1])
>>> plt.title("(c) 1000 randomly generated points", fontsize=14)
>>> plt.axis('off')

>>> plt.subplot(224)
>>> plt.scatter(uniform[:, 0], uniform[:, 1], c=clusters_uniform, cmap='gist_rainbow')
>>> plt.title("(d) Clusters found via K-means", fontsize=14)
>>> plt.axis('off')

Clustering as Surface Tiling

Common applications of clustering assume that there are natural clusters to be found; i.e., there are regions of dense data scattered in an otherwise empty space. In these situations, there is a notion of the correct number of clusters, and people have invented clustering indices that measure the quality of data groupings in order to select for k.

However, when data is spread out fairly uniformly like in Figure 7-2(c), there is no longer a correct number of clusters. In this case, the role of a clustering algorithm is vector quantization, i.e., partitioning the data into a finite number of chunks. The number of clusters can be selected based on acceptable approximation error when using quantized vectors instead of the original ones.

Visually, this usage of k-means can be thought of as covering the data surface with patches, like in Figure 7-3. This is indeed what we get if we run k-means on a Swiss roll dataset.

Example 7-2 uses scikit-learn to generate a noisy dataset on the Swiss roll, cluster it with k-means, and visualize the clustering results using Matplotlib. The data points are colored according to their cluster IDs.

Example 7-2. k-means on the Swiss roll

>>> from mpl_toolkits.mplot3d import Axes3D
>>> from sklearn import manifold, datasets

# Generate a noisy Swiss roll dataset
>>> X, color = datasets.samples_generator.make_swiss_roll(n_samples=1500)

# Approximate the data with 100 k-means clusters
>>> clusters_swiss_roll = KMeans(n_clusters=100, random_state=1).fit_predict(X)

# Plot the dataset with k-means cluster IDs as the color
>>> fig2 = plt.figure()
>>> ax = fig2.add_subplot(111, projection='3d')
>>> ax.scatter(X[:, 0], X[:, 1], X[:, 2], c=clusters_swiss_roll, cmap='Spectral')

In this example, we generated 1,500 points at random on the Swiss roll surface, and asked k-means to approximate it with 100 clusters. We pulled the number 100 out of a hat because it seems like a fairly large number to cover a fairly small space. The result (Figure 7-4) looks nice; the clusters are indeed very local and different sections of the manifold are mapped to different clusters. Great! Are we done?

The problem is that if we pick a k that is too small, then the results won’t be so nice from a manifold learning perspective. Figure 7-5 shows the output of k-means on the Swiss roll with 10 clusters. We can clearly see data from very different sections of the manifold being mapped to the same clusters (e.g., the yellow, purple, green, and magenta clusters—see, we told you the illustrations are best viewed in color!).

If the data is distributed uniformly throughout the space, then picking the right k boils down to a sphere-packing problem. In d dimensions, one could fit roughly 1/r^d spheres of radius r. Each k-means cluster is a sphere, and the radius is the maximum error of representing points in that sphere with the centroid. So, if we are willing to tolerate a maximum approximation error of r per data point, then the number of clusters is O(1/r^d), where d is the dimension of the original feature space of the data.

Uniform distribution is the worst-case scenario for k-means. If data density is not uniform, then we will be able to represent more data with fewer clusters. In general, it is difficult to tell how data is distributed in high-dimensional space. One can be conservative and pick a larger k, but it can’t be too large, because k will become the number of features for the next modeling step.

k-Means Featurization for Classification

When using k-means as a featurization procedure, a data point can be represented by its cluster membership (a sparse one-hot encoding of the cluster membership categorical variable; see “One-Hot Encoding”), which we now illustrate.

If a target variable is also available, then we have the choice of giving that information as a hint to the clustering procedure. One way to incorporate target information is to simply include the target variable as an additional input feature to the k-means algorithm. Since the objective is to minimize the total Euclidean distance over all input dimensions, the clustering procedure will attempt to balance similarity in the target value as well as in the original feature space. The target values can be scaled to get more or less attention from the clustering algorithm. Larger differences in the target will produce clusters that pay more attention to the classification boundary.

k-Means Featurization

Clustering algorithms analyze the spatial distribution of data. Therefore, k-means featurization creates a compressed spatial index of the data which can be fed into the model in the next stage. This is an example of model stacking.

Example 7-3 shows a simple k-means featurizer. It is defined as a class object that can be fitted to training data and transform any new data.

Example 7-3. k-means featurizer

>>> import numpy as np
>>> from sklearn.cluster import KMeans

>>> class KMeansFeaturizer:
...     """Transforms numeric data into k-means cluster memberships.
...     
...     This transformer runs k-means on the input data and converts each data point
...     into the ID of the closest cluster. If a target variable is present, it is 
...     scaled and included as input to k-means in order to derive clusters that
...     obey the classification boundary as well as group similar points together.
...     """
... 
...     def __init__(self, k=100, target_scale=5.0, random_state=None):
...         self.k = k
...         self.target_scale = target_scale
...         self.random_state = random_state
...         
...     def fit(self, X, y=None):
...         """Runs k-means on the input data and finds centroids.
...         """
...         if y is None:
...             # No target variable, just do plain k-means
...             km_model = KMeans(n_clusters=self.k, 
...                               n_init=20, 
...                               random_state=self.random_state)
...             km_model.fit(X)
... 
...             self.km_model_ = km_model
...             self.cluster_centers_ = km_model.cluster_centers_
...             return self
... 
...         # There is target information. Apply appropriate scaling and include
...         # it in the input data to k-means.            
...         data_with_target = np.hstack((X, y[:,np.newaxis]*self.target_scale))
... 
...         # Build a pre-training k-means model on data and target
...         km_model_pretrain = KMeans(n_clusters=self.k, 
...                                    n_init=20, 
...                                    random_state=self.random_state)
...         km_model_pretrain.fit(data_with_target)
... 
...         # Run k-means a second time to get the clusters in the original space
...         # without target info. Initialize using centroids found in pre-training.
...         # Go through a single iteration of cluster assignment and centroid 
...         # recomputation.
...         km_model = KMeans(n_clusters=self.k, 
...                           init=km_model_pretrain.cluster_centers_[:,:2], 
...                           n_init=1, 
...                           max_iter=1)
...         km_model.fit(X)
... 
...         self.km_model = km_model
...         self.cluster_centers_ = km_model.cluster_centers_
...         return self
...         
...     def transform(self, X, y=None):
...         """Outputs the closest cluster ID for each input data point.
...         """
...         clusters = self.km_model.predict(X)
...         return clusters[:,np.newaxis]
... 
...     def fit_transform(self, X, y=None):
...         self.fit(X, y)
...         return self.transform(X, y)

To illustrate the difference between using and not using target information when clustering in Example 7-4, we apply the featurizer to a synthetic dataset generated using scikit-learn’s make_moons function and plot the Voronoi diagram of the cluster boundaries.

Example 7-4. k-means featurization with and without target hints

>>> from scipy.spatial import Voronoi, voronoi_plot_2d
>>> from sklearn.datasets import make_moons

>>> training_data, training_labels = make_moons(n_samples=2000, noise=0.2)
>>> kmf_hint = KMeansFeaturizer(k=100, target_scale=10).fit(training_data, 
...                                                         training_labels)
>>> kmf_no_hint = KMeansFeaturizer(k=100, target_scale=0).fit(training_data, 
...                                                           training_labels)

>>> def kmeans_voronoi_plot(X, y, cluster_centers, ax):
...    """Plots the Voronoi diagram of the k-means clusters overlaid with the data"""
...    ax.scatter(X[:, 0], X[:, 1], c=y, cmap='Set1', alpha=0.2)
...    vor = Voronoi(cluster_centers)
...    voronoi_plot_2d(vor, ax=ax, show_vertices=False, alpha=0.5)

Figure 7-6 shows a comparison of the results. The two moons of the dataset are colored according to their class labels. The bottom panel shows the clusters trained without target information. Notice that a number of clusters span the empty space between the two classes. The top panel shows that when the clustering algorithm is given target information, the cluster boundaries align much better along class boundaries.

Let’s test the effectiveness of k-means features for classification. Example 7-5 applies logistic regression on the input data augmented with k-means cluster features. It compares the results against the support vector machine with radial basis function kernel (RBF SVM), k-nearest neighbors (kNN), random forest (RF), and gradient boosting tree (GBT) classifiers. RF and GBT are popular nonlinear classifiers with state-of-the-art performance. RBF SVM is a reasonable nonlinear classifier for Euclidean space. kNN classifies data according to the average of its k nearest neighbors.

The default input data to the classifiers consists of the 2D coordinates of each data point. Logistic regression is also given the cluster membership features (labeled “LR with k-means” in Figure 7-7). As a baseline, we also try logistic regression on just the 2D coordinates (labeled “LR”).

Example 7-5. Classification with k-means cluster features

>>> from sklearn.linear_model import LogisticRegression
>>> from sklearn.svm import SVC
>>> from sklearn.neighbors import KNeighborsClassifier
>>> from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier

### Generate some test data from the same distribution as training data
>>> test_data, test_labels = make_moons(n_samples=2000, noise=0.3)

### Use the k-means featurizer to generate cluster features
>>> training_cluster_features = kmf_hint.transform(training_data)
>>> test_cluster_features = kmf_hint.transform(test_data)

### Form new input features with cluster features
>>> training_with_cluster = scipy.sparse.hstack((training_data, 
...                                              training_cluster_features))
>>> test_with_cluster = scipy.sparse.hstack((test_data, test_cluster_features))

### Build the classifiers
>>> lr_cluster = LogisticRegression(random_state=seed).fit(training_with_cluster, 
...                                                        training_labels)
>>> classifier_names = ['LR',
...                     'kNN',
...                     'RBF SVM',
...                     'Random Forest',
...                     'Boosted Trees']
>>> classifiers = [LogisticRegression(random_state=seed),
...                KNeighborsClassifier(5),
...                SVC(gamma=2, C=1),
...                RandomForestClassifier(max_depth=5, n_estimators=10, max_features=1),
...                GradientBoostingClassifier(n_estimators=10, learning_rate=1.0, 
...                                           max_depth=5)]
>>> for model in classifiers:
...     model.fit(training_data, training_labels)

### Helper function to evaluate classifier performance using ROC
>>> def test_roc(model, data, labels):
...     if hasattr(model, "decision_function"):
...         predictions = model.decision_function(data)
...     else:
...         predictions = model.predict_proba(data)[:,1]
...     fpr, tpr, _ = sklearn.metrics.roc_curve(labels, predictions)
...     return fpr, tpr

### Plot results
>>> import matplotlib.pyplot as plt
>>> plt.figure()
>>> fpr_cluster, tpr_cluster = test_roc(lr_cluster, test_with_cluster, test_labels)
>>> plt.plot(fpr_cluster, tpr_cluster, 'r-', label='LR with k-means')

>>> for i, model in enumerate(classifiers):
...     fpr, tpr = test_roc(model, test_data, test_labels)
...     plt.plot(fpr, tpr, label=classifier_names[i])
    
>>> plt.plot([0, 1], [0, 1], 'k--')
>>> plt.legend()

Figure 7-7 shows the receiver operating characteristic (ROC) curves of each of the classifiers when evaluated on the test set. A ROC curve shows the trade-off between true positives and false positives as we vary the classification decision boundary. (See Zheng [2015] for more details.) A good classifier should quickly reach a high true positive rate and a low false positive rate, so curves that rise sharply toward the upper-left corner are good.

Our plot shows that logistic regression performs much better with cluster features than without. In fact, with cluster features, the linear classifier performs just as well as nonlinear classifiers. One minor caveat is that in this toy example, we did not tune the hyperparameters for any of the models. There may be performance differences once the models are fully tuned, but at least this shows that it is possible for LR with k-means to be on a par with nonlinear classifiers. This is a nice result because linear classifiers are much cheaper to train than nonlinear classifiers. Lower computation cost allows us to try more models with different features in the same period of time, which increases the chance of ending up with a much better model.

Alternative Dense Featurization

Instead of one-hot cluster membership, a data point can also be represented by a dense vector of its inverse distance to each cluster center. This retains more information than simple binary cluster assignment, but the representation is now dense. There is a trade-off here. One-hot cluster membership results in a very lightweight, sparse representation, but one might need a larger k to represent data of complex shapes. Inverse distance representation is dense, which could be more expensive for the modeling step, but one might be able to get away with a smaller k.

A compromise between sparse and dense is to retain inverse distances for only p of the closest clusters. But now p is an extra hyperparameter to tune. (Can you understand why feature engineering requires so much fiddling?) There is no free lunch.

Pros, Cons, and Gotchas

Using k-means to turn spatial data into features is an example of model stacking, where the input to one model is the output of another. Another example of stacking is to use the output of a decision tree–type model (random forest or gradient boosting tree) as input to a linear classifier. Stacking has become an increasingly popular technique in recent years. Nonlinear classifiers are expensive to train and maintain. The key intuition with stacking is to push the nonlinearities into the features and use a very simple, usually linear model as the last layer. The featurizer can be trained offline, which means that one can use expensive models that require more computation power or memory but generate useful features. The simple model at the top level can be quickly adapted to the changing distributions of online data. This is a great trade-off between accuracy and speed, and this strategy is often used in applications like targeted advertising that require fast adaptation to changing data distributions.

Key Intuition for Model Stacking

Use sophisticated base layers (often with expensive models) to generate good (often nonlinear) features, combined with a simple and fast top-layer model. This often strikes the right balance between model accuracy and speed.

Compared to using a nonlinear classifier, k-means stacked with logistic regression is cheaper to train and store. Table 7-1 is a chart detailing the training and prediction complexity in both computation and memory for a number of machine learning models. n denotes the number of data points, d the number of (original) features.

Table 7-1. Complexity of ML models
Model	Time	Space
k-means training	O(nkd)^a	O(kd)
k-means predict	O(kd)	O(kd)
LR + cluster features training	O(n(d+k))	O(d+k)
LR + cluster features predict	O(d+k)	O(d+k)
RBF SVM training	O(n²d)	O(n²)
RBF SVM predict	O(sd)	O(sd)
GBT training	O(nd2^mt)	O(nd + 2^mt)
GBT predict	O(2^mt)	O(2^mt)
kNN training	O(1)	O(nd)
kNN predict	O(nd + k log n)	O(nd)
^a Streaming k-means can be done in time O(nd (log k + log log n)), which is much faster than O(nkd) for large k.

For k-means, the training time is O(nkd) because each iteration involves computing the d-dimensional distance between every data point and every centroid (k). We optimistically assume that the number of iterations is not a function of n, though this may not be true in all cases. Prediction requires computing the distance between the new data point and each of the k centroids, which is O(kd). The storage space requirement is O(kd), for the coordinates of the k centroids.

Logistic regression training and prediction are linear in both the number of data points and feature dimensions. RBF SVM training is expensive because it involves computing the kernel matrix for every pair of input data. RBF SVM prediction is less expensive than training; it is linear in the number of support vectors s and the feature dimension d. GBT training and prediction are linear in data size and the size of the model (t trees, each with at most 2^m leaves, where m is the maximum depth of the tree). A naive implementation of kNN requires no training time at all because the training data itself is essentially the model. The cost is paid at prediction time, where the input must be evaluated against each of the original training points and partially sorted to retrieve the k closest neighbors.

Overall, k-means + LR is the only combination that is linear (with respect to the size of training data, O(nd), and model size, O(kd)) at both training and prediction time. The complexity is most similar to that of GBT, which has costs that are linear in the number of data points, the feature dimension, and the size of the model (O(2^mt)). It is hard to say whether k-means + LR or GBT will result in a smaller model—it depends on the spatial characteristics of the data.

Potential for Data Leakage

Those who remember our caution regarding data leakage (see “Guarding against data leakage”) might ask whether including the target variable in the k-means featurization step would cause such a problem. The answer is “yes,” but not as much in the case of bin counting. If we use the same dataset for learning the clusters and building the classification model, then information about the target will have leaked into the input variables. As a result, accuracy evaluations on the training data will probably be overly optimistic, but the bias will go away when evaluating on a hold-out validation set or test set. Furthermore, the leakage will not be as bad as in the case of bin-counting statistics (see “Bin Counting”), because the lossy compression of the clustering algorithm will have abstracted away some of that information. To be extra careful about preventing leakage, hold out a separate dataset for deriving the clusters, just like in the case of bin counting.

k-means featurization is useful for real-valued, bounded numeric features that form clumps of dense regions in space. The clumps can be of any shape, because we can just increase the number of clusters to approximate them. (Unlike in the classic clustering setup, we are not concerned with discovering the “true” number of clusters; we only need to cover them.)

k-means cannot handle feature spaces where the Euclidean distance does not make sense—i.e., weirdly distributed numeric variables or categorical variables. If the feature set contains those variables, then there are several ways to handle them:

Apply k-means featurization only on the real-valued, bounded numeric features.
Define a custom metric to handle multiple data types and use the k-medoids algorithms. (k-medoids is analogous to k-means but allows for arbitrary distance metrics.)
Convert categorical variables to binning statistics (see “Bin Counting”), then featurize them using k-means.

Combined with techniques for handling categorical variables and time series, k-means featurization can be adapted to handle the kind of rich data that often appears in customer marketing and sales analytics. The resulting clusters can be thought of as user segments, which are very useful features for the next modeling step.

Summary

This chapter illustrated the concept of model stacking using a somewhat unconventional approach: combining supervised k-means with a simple linear classifier. k-means is usually used as an unsupervised modeling method to find dense clusters of data points in feature space. Here, however, k-means is optionally given the class labels as input. This helps k-means to find clusters that better align with the boundary between classes.

Deep learning, which we will discuss in the next chapter, takes model stacking to a whole new level by layering neural networks on top of one another. Two recent winners of the ImageNet Large Scale Visual Recognition Challenge involved 13 and 22 layers of neural networks. They take advantage of the availability of lots of unlabeled training images and look for combinations of pixels that yield good image features. The technique in this chapter separately trains the k-means featurizer from the linear classifier. But it’s possible to jointly optimize the featurizer and the classier. As we shall see, deep learning training takes the latter route.

Bibliography

Dunning, Ted. The man is a walking encyclopedia of data science. He is a frequent speaker at industry events, and likes beer and nice people. Buy him a beer and talk to him. You won’t regret it.

Zheng, Alice. Evaluating Machine Learning Models. Sebastopol, CA: O’Reilly Media, 2015.

¹ This chapter is inspired by a conversation with Ted Dunning, active Apache contributor and noted author. The stacking example came directly from Ted, and he provided many helpful comments in the course of writing. If one could have coauthors for individual chapters, Ted would be a coauthor for this one.

² We use the words “surface” and “manifold” interchangeably in this chapter. The analogy works well for two-dimensional manifolds embedded in a three-dimensional space, but it breaks down beyond three dimensions. A high-dimensional manifold does not conform to our usual notion of a “surface.” Some of the more outlandish manifolds have holes, and some loop back onto themselves in a way that would never happen in the real physical world (e.g., M.C. Escher’s endless waterfall). Most data models assume nice manifolds, not the crazy ones.

³ This is a tried-and-true idea in mathematics. For instance, the derivative of a function measures the speed of change at each point. Globally, the function may do all sorts of weird things. But locally, it can be approximated by a linear function of the derivative. If we know the derivative at each point, then calculus allows us to more or less recover the entire original function.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for
7. Nonlinear Featurization via K-Means Model Stacking

Chapter 7. Nonlinear Featurization via K-Means Model Stacking

Figure 7-1. The Swiss roll, a nonlinear manifold

k-Means Clustering

Figure 7-2. k-means examples demonstrating how the clustering algorithm partitions space

Example 7-1. Code to generate k-means examples

Clustering as Surface Tiling

Figure 7-3. Conceptual local patches on the Swiss roll from a clustering algorithm

Example 7-2. k-means on the Swiss roll

Figure 7-4. Approximating a Swiss roll dataset using k-means with 100 clusters

Figure 7-5. k-means on the Swiss roll with 10 clusters

k-Means Featurization for Classification

k-Means Featurization

Example 7-3. k-means featurizer

Example 7-4. k-means featurization with and without target hints

Figure 7-6. k-means clusters with (top panel) and without (bottom panel) using target class information

Example 7-5. Classification with k-means cluster features

Figure 7-7. ROCs of k-means + logistic regression versus nonlinear classifiers and plain logistic regression on the synthetic two-moons dataset

Alternative Dense Featurization

Pros, Cons, and Gotchas

Key Intuition for Model Stacking

Potential for Data Leakage

Summary

Bibliography

Table of Contents for 7. Nonlinear Featurization via K-Means Model Stacking

Create new playlist

Sign In

Sign Up

Chapter 7. Nonlinear Featurization via K-Means Model Stacking

Figure 7-1. The Swiss roll, a nonlinear manifold

k-Means Clustering

Figure 7-2. k-means examples demonstrating how the clustering algorithm partitions space

Example 7-1. Code to generate k-means examples

Clustering as Surface Tiling

Figure 7-3. Conceptual local patches on the Swiss roll from a clustering algorithm

Example 7-2. k-means on the Swiss roll

Figure 7-4. Approximating a Swiss roll dataset using k-means with 100 clusters

Figure 7-5. k-means on the Swiss roll with 10 clusters

k-Means Featurization for Classification

k-Means Featurization

Example 7-3. k-means featurizer

Example 7-4. k-means featurization with and without target hints

Figure 7-6. k-means clusters with (top panel) and without (bottom panel) using target class information

Example 7-5. Classification with k-means cluster features

Figure 7-7. ROCs of k-means + logistic regression versus nonlinear classifiers and plain logistic regression on the synthetic two-moons dataset

Alternative Dense Featurization

Pros, Cons, and Gotchas

Key Intuition for Model Stacking

Potential for Data Leakage

Summary

Bibliography

Table of Contents for
7. Nonlinear Featurization via K-Means Model Stacking