Plotting clusters

Now, let's plot the clusters on the graph. We will start plotting using Principal Component Analysis (PCAsince it is good at capturing the global structure of the data. Then, we will use t-Distributed Stochastic Neighbor Embedding (TSNE) to plot the graph as it is good at capturing the relationship with the neighbors. Let's get started:

  1. Let's start by creating the model again:
clusters = MiniBatchKMeans(n_clusters=4, init_size=1024, batch_size=2048, random_state=20).fit_predict(text)
  1. Let's plot both graphs. First, we will plot using the PCA technique and then using the TSNE technique. Use the following code to do so:
max_label = max(clusters)
max_items = np.random.choice(range(text.shape[0]), size=3000, replace=True)
pca = PCA(n_components=2).fit_transform(text[max_items,:].todense())
tsne = TSNE().fit_transform(PCA(n_components=50).fit_transform(text[max_items,:].todense()))

idx = np.random.choice(range(pca.shape[0]), size=300, replace=True)
label_subset = clusters[max_items]
label_subset = [cm.hsv(i/max_label) for i in label_subset[idx]]

f, ax = plt.subplots(1, 2, figsize=(14, 6))
ax[0].scatter(pca[idx, 0], pca[idx, 1], c=label_subset)
ax[0].set_title('Generated PCA Cluster Plot')

ax[1].scatter(tsne[idx, 0], tsne[idx, 1], c=label_subset)
ax[1].set_title('Generated TSNE Cluster Plot')

The output of the preceding code is as follows:

Each color represents one kind of cluster. In the preceding code, we sampled down the features to capture just 3,000 documents for faster processing and plotted them using a scatter plot. For PCA, we reduced the dimensions to 50. 

You can learn more about TSNE from the official website: https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html.

Note that it is very difficult to find out which keywords were found in each type of cluster. To visualize this better, we need to plot the word cloud from each cluster. 

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.137.167.195