Latent semantic analysis (LSA) is a feature extraction tool. It is helpful for text that is a series of these three steps, which we have already learned in this book:
- A tfidf vectorization
- A PCA (SVD in this case to account for the sparsity of text)
- Row normalization
We can create a scikit-learn pipeline to perform LSA:
tfidf = TfidfVectorizer(ngram_range=(1, 2), stop_words='english') svd = TruncatedSVD(n_components=10) # will extract 10 "topics" normalizer = Normalizer() # will give each document a unit norm lsa = Pipeline(steps=[('tfidf', tfidf), ('svd', svd), ('normalizer', normalizer)])
Now, we can fit and transform our sentences data, like so:
lsa_sentences = lsa.fit_transform(sentences) lsa_sentences.shape
(118151, 10)
We have 118151 rows and 10 columns. These 10 columns come from the 10 extracted PCA/SVD components. We can now apply a KMeans clustering to our lsa_sentences, as follows:
cluster = KMeans(n_clusters=10) cluster.fit(lsa_sentences)
It should be noted that we have chosen both 10 for the KMeans and our PCA. This is not necessary. Generally, you may extract more columns in the SVD module. With the 10 clusters, we are basically saying here, I think there are 10 topics that people are talking about. Please assign each sentence to be one of those topics.
The output is as follows:
KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300, n_clusters=10, n_init=10, n_jobs=1, precompute_distances='auto', random_state=None, tol=0.0001, verbose=0)
Let's time our fit and predict for our original document-term matrix of shape (118151, 280901) and then for our latent semantic analysis of shape (118151, 10) to see the differences:
- First, the original dataset:
%%timeit
# time it takes to cluster on the original document-term matrix of shape (118151, 280901)
cluster.fit(tfidf_transformed)
This gives us:
- We will also time the prediction of Kmeans:
%%timeit
# also time the prediction phase of the Kmeans clustering
cluster.predict(tfidf_transformed)
This gives us:
- Now, the LSA:
%%timeit
# time the time to cluster after latent semantic analysis of shape (118151, 10)
cluster.fit(lsa_sentences)
This gives us:
- We can see that the LSA is over 80 times faster than fitting on the original tfidf dataset. Suppose we time the prediction of the clustering with LSA, like so:
%%timeit
# also time the prediction phase of the Kmeans clustering after LSA was performed
cluster.predict(lsa_sentences)
We can see that the LSA dataset is over four times faster than predicting on the original tfidf dataset.
- Now, let's transform the texts to a cluster distance space where each row represents an observation, like so:
cluster.transform(lsa_sentences).shape
(118151, 10)
predicted_cluster = cluster.predict(lsa_sentences)
predicted_cluster
The output gives us:
- We can now get the distribution of topics, as follows:
# Distribution of "topics" pd.Series(predicted_cluster).value_counts(normalize=True)# create DataFrame of texts and predicted topics texts_df = pd.DataFrame({'text':sentences, 'topic':predicted_cluster}) texts_df.head() print "Top terms per cluster:" original_space_centroids = svd.inverse_transform(cluster.cluster_centers_) order_centroids = original_space_centroids.argsort()[:, ::-1] terms = lsa.steps[0][1].get_feature_names() for i in range(10): print "Cluster %d:" % i print ', '.join([terms[ind] for ind in order_centroids[i, :5]]) print lsa.steps[0][1]
- This gives us each topic with a list of the most interesting phrases (according to our TfidfVectorizer):
Top terms per cluster: Cluster 0: good, breakfast, breakfast good, room, great Cluster 1: hotel, recommend, good, recommend hotel, nice hotel Cluster 2: clean, room clean, rooms, clean comfortable, comfortable Cluster 3: room, room clean, hotel, nice, good Cluster 4: great, location, breakfast, hotel, stay Cluster 5: stay, hotel, good, enjoyed stay, enjoyed Cluster 6: comfortable, bed, clean comfortable, bed comfortable, room Cluster 7: nice, room, hotel, staff, nice hotel Cluster 8: hotel, room, good, great, stay Cluster 9: staff, friendly, staff friendly, helpful, friendly helpful
We can see the top terms by cluster, and some of them make a lot of sense. For example, cluster 1 seems to be about how people would recommend this hotel to their family and friends, while cluster 9 is about the staff and how they are friendly and helpful. In order to complete this application, we want to be able to predict new reviews with topics.
Now, we can try to predict the cluster of a new review, like so:
# topic prediction
print cluster.predict(lsa.transform(['I definitely recommend this hotel']))
print cluster.predict(lsa.transform(['super friendly staff. Love it!']))
The output gives us cluster 1 for the first prediction and cluster 9 for the second prediction, as follows:
Cool! Cluster 1 corresponds to the following:
Cluster 1: hotel, recommend, good, recommend hotel, nice hotel
Cluster 9 corresponds to the following:
Cluster 9: staff, friendly, staff friendly, helpful, friendly helpful
Looks like Cluster 1 is recommending a hotel and Cluster 9 is more staff-centered. Our predictions appear to be fairly accurate!