Clustering

Clustering divides a dataset into clusters. This is an unsupervised learning task since we typically don't have any labels. In the most realistic cases, the complexity is so high that we are not able to find the best division in clusters; however, we can usually find a decent approximation. The clustering analysis task requires a distance function, which indicates how close items are to each other. A common distance is Euclidean distance, which is the distance as a bird flies. Another common distance is taxicab distance, which measures distance in city blocks. Clustering was first used in the 1930s by social science researchers without modern computers.

Clustering can be hard or soft. In hard clustering, an item belongs to only to a cluster, while in soft clustering, an item can belong to multiple clusters with varying probabilities. In this book, I have used only the hard clustering method.

We can also have items that do not belong to any cluster. These items are considered outliers of anomalies, or just noise. A cluster can also be part of another cluster, which can also be an item in another higher-level cluster. If we have a cluster hierarchy, we speak of a hierarchical clustering. There are more than 100 clustering algorithms, the most widely used of those is the k-means algorithm. k-means clustering assigns data points to k clusters. The problem of clustering is not solvable directly, but we can apply heuristics, which achieve an acceptable result. The k-means algorithm attempts to find the best clusters for a dataset, given a number of clusters. We are supposed to either know this number or find it through trial and error. In this recipe, I evaluate the clusters through the Within Set Sum of Squared Error (WSSSE) method, also known as Within Cluster Sum of Squares (WCSS). This metric calculates the sum of the squared error of the distance between each point and the centroid of its assigned cluster. The algorithm for k-means iterates between two steps, not including the (usually random) initialization of k-centroids:

  1. Assign each data point a cluster with the lowest distance.
  2. Recalculate the center of the cluster as the mean of the cluster points coordinates.

The algorithm stops when the cluster assignments become stable.

We will use the scikit-learn's KMeans class, as described in the following table:

Constructor parameter

Default

Example values

Description

n_clusters

8

3, 36

The number of clusters to determine

max_iter

300

200, 37

Maximum number of iterations for a single run

n_init

10

8, 10

Number of times to rerun the algorithm with different seeds

tol

1e-4

1e-3, 1e-2

Value regulating stopping conditions

The average complexity is given by O(k n T), where k is the number of clusters, n is the number of samples, and T is the number of iterations. The following code applies clustering and displays a scatter plot of the actual labels and the cluster labels:

>>> from sklearn.feature_extraction.text import CountVectorizer
>>> from sklearn.datasets import fetch_20newsgroups
>>> from nltk.corpus import names
>>> from nltk.stem import WordNetLemmatizer
>>> from sklearn.cluster import KMeans
>>> import matplotlib.pyplot as plt

>>> def letters_only(astr):
return astr.isalpha()

>>> cv = CountVectorizer(stop_words="english", max_features=500)
>>> groups = fetch_20newsgroups()
>>> cleaned = []
>>> all_names = set(names.words())
>>> lemmatizer = WordNetLemmatizer()

>>> for post in groups.data:
cleaned.append(' '.join([
lemmatizer.lemmatize(word.lower())
for word in post.split()
if letters_only(word)
and word not in all_names]))

>>> transformed = cv.fit_transform(cleaned)
>>> km = KMeans(n_clusters=20)
>>> km.fit(transformed)
>>> labels = groups.target
>>> plt.scatter(labels, km.labels_)
>>> plt.xlabel('Newsgroup')
>>> plt.ylabel('Cluster')
>>> plt.show()

Refer to the following figure for the end result:

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.147.47.166