Latent semantic analysis 

Latent semantic analysis (LSA) is a feature extraction tool. It is helpful for text that is a series of these three steps, which we have already learned in this book:

  • A tfidf vectorization
  • A PCA (SVD in this case to account for the sparsity of text)
  • Row normalization 

We can create a scikit-learn pipeline to perform LSA:

tfidf = TfidfVectorizer(ngram_range=(1, 2), stop_words='english')
svd = TruncatedSVD(n_components=10)  # will extract 10 "topics"
normalizer = Normalizer() # will give each document a unit norm

lsa = Pipeline(steps=[('tfidf', tfidf), ('svd', svd), ('normalizer', normalizer)])

Now, we can fit and transform our sentences data, like so:

lsa_sentences = lsa.fit_transform(sentences)

lsa_sentences.shape

(118151, 10)

We have 118151 rows and 10 columns. These 10 columns come from the 10 extracted PCA/SVD components. We can now apply a KMeans clustering to our lsa_sentences, as follows:

cluster = KMeans(n_clusters=10)

cluster.fit(lsa_sentences)
We are assuming that the reader has basic familiarity with clustering. For more information on clustering and how clustering works, please refer to Principles of Data Science by Packt: https://www.packtpub.com/big-data-and-business-intelligence/principles-data-science

It should be noted that we have chosen both 10 for the KMeans and our PCA. This is not necessary. Generally, you may extract more columns in the SVD module. With the 10 clusters, we are basically saying here, I think there are 10 topics that people are talking about. Please assign each sentence to be one of those topics.

The output is as follows:

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=10, n_init=10, n_jobs=1, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=0)

Let's time our fit and predict for our original document-term matrix of shape (118151, 280901) and then for our latent semantic analysis of shape (118151, 10) to see the differences:

  1. First, the original dataset:
%%timeit
# time it takes to cluster on the original document-term matrix of shape (118151, 280901)
cluster.fit(tfidf_transformed)

This gives us:

1 loop, best of 3: 4min 15s per loop
  1. We will also time the prediction of Kmeans:
%%timeit
# also time the prediction phase of the Kmeans clustering
cluster.predict(tfidf_transformed)

This gives us:

10 loops, best of 3: 120 ms per loop
  1. Now, the LSA:
%%timeit
# time the time to cluster after latent semantic analysis of shape (118151, 10)
cluster.fit(lsa_sentences)

This gives us:

1 loop, best of 3: 3.6 s per loop
  1. We can see that the LSA is over 80 times faster than fitting on the original tfidf dataset. Suppose we time the prediction of the clustering with LSA, like so:
%%timeit
# also time the prediction phase of the Kmeans clustering after LSA was performed
cluster.predict(lsa_sentences)

This gives us:

10 loops, best of 3: 34 ms per loop

We can see that the LSA dataset is over four times faster than predicting on the original tfidf dataset.

  1. Now, let's transform the texts to a cluster distance space where each row represents an observation, like so:
cluster.transform(lsa_sentences).shape
(118151, 10)
predicted_cluster = cluster.predict(lsa_sentences)
predicted_cluster

The output gives us:

array([2, 2, 2, ..., 2, 2, 6], dtype=int32)
  1. We can now get the distribution of topics, as follows:
# Distribution of "topics"
pd.Series(predicted_cluster).value_counts(normalize=True)# create DataFrame of texts and predicted topics
texts_df = pd.DataFrame({'text':sentences, 'topic':predicted_cluster})

texts_df.head()

print "Top terms per cluster:"
original_space_centroids = svd.inverse_transform(cluster.cluster_centers_)
order_centroids = original_space_centroids.argsort()[:, ::-1]
terms = lsa.steps[0][1].get_feature_names()
for i in range(10):
    print "Cluster %d:" % i
    print ', '.join([terms[ind] for ind in order_centroids[i, :5]])
    print 

lsa.steps[0][1]
  1. This gives us each topic with a list of the most interesting phrases (according to our TfidfVectorizer):
Top terms per cluster:
Cluster 0:
good, breakfast, breakfast good, room, great

Cluster 1:
hotel, recommend, good, recommend hotel, nice hotel

Cluster 2:
clean, room clean, rooms, clean comfortable, comfortable

Cluster 3:
room, room clean, hotel, nice, good

Cluster 4:
great, location, breakfast, hotel, stay

Cluster 5:
stay, hotel, good, enjoyed stay, enjoyed

Cluster 6:
comfortable, bed, clean comfortable, bed comfortable, room

Cluster 7:
nice, room, hotel, staff, nice hotel

Cluster 8:
hotel, room, good, great, stay

Cluster 9:
staff, friendly, staff friendly, helpful, friendly helpful

We can see the top terms by cluster, and some of them make a lot of sense. For example, cluster 1 seems to be about how people would recommend this hotel to their family and friends, while cluster 9 is about the staff and how they are friendly and helpful. In order to complete this application, we want to be able to predict new reviews with topics.

Now, we can try to predict the cluster of a new review, like so:

# topic prediction 
print cluster.predict(lsa.transform(['I definitely recommend this hotel']))

print cluster.predict(lsa.transform(['super friendly staff. Love it!']))

The output gives us cluster 1 for the first prediction and cluster 9 for the second prediction, as follows:

[1]
[9]

Cool! Cluster 1 corresponds to the following:

Cluster 1:
hotel, recommend, good, recommend hotel, nice hotel

Cluster 9 corresponds to the following:

Cluster 9:
staff, friendly, staff friendly, helpful, friendly helpful

Looks like Cluster 1 is recommending a hotel and Cluster 9 is more staff-centered. Our predictions appear to be fairly accurate!

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.221.242.131