How to implement LSI using sklearn

We will illustrate the application of LSI using the BBC article data that we introduced in the last chapter because it is small enough to permit quick training and allow us to compare topic assignments to category labels. See the latent_semantic_indexing notebook for additional implementation details:

  1. We begin by loading the documents and creating a train and (stratified) test set with 50 articles.
  2. Then, we vectorize the data using TfidfVectorizer to obtain weighted DTM counts and filter out words that appear in less than 1% or more than 25% of the documents, as well as generic stopwords, to obtain a vocabulary of around 2,900 words:
vectorizer = TfidfVectorizer(max_df=.25, min_df=.01,
stop_words='english',
binary=False)
train_dtm = vectorizer.fit_transform(train_docs.article)
test_dtm = vectorizer.transform(test_docs.article)
  1. We use sklearn's TruncatedSVD class, which only computes the k largest singular values to reduce the dimensionality of the document-term matrix. The deterministic arpack algorithm delivers an exact solution, but the default randomized implementation is more efficient for large matrices.
  2. We compute five topics to match the five categories, which explain only 5.4% of the total DTM variance so higher values would be reasonable:
svd = TruncatedSVD(n_components=5, n_iter=5, random_state=42)
svd.fit(train_dtm)
svd.explained_varianceratio
array([0.00187014, 0.01559661, 0.01389952, 0.01215842, 0.01066485])
  1. LSI identifies a new orthogonal basis for the document-term matrix that reduces the rank to the number of desired topics.
  2. The .transform() method of the trained svd object projects the documents into the new topic space that is the result of reducing the dimensionality of the document vectors and corresponds to the UTΣT transformation illustrated before:
train_doc_topics = svd.transform(train_dtm)
train_doc_topics.shape
(2175, 5)

  1. We can sample an article to view its location in the topic space. We draw a Politics article that is most (positively) associated with topics 1 and 2:
i = randint(0, len(train_docs))
train_docs.iloc[i, :2].append(pd.Series(doc_topics[i], index=topic_labels))
Category Politics
Heading What the election should really be about?
Topic 1 0.33
Topic 2 0.18
Topic 3 0.12
Topic 4 0.02
Topic 5 0.06
  1. The topic assignments for this sample align with the average topic weights for each category illustrated next (Politics is the leftmost). They illustrate how LSI expresses the k topics as directions in a k-dimensional space (the notebook includes a projection of the average topic assignments per category into two-dimensional space).
  2. Each category is clearly defined, and the test assignments match with train assignments. However, the weights are both positive and negative, making it more difficult to interpret the topics:

  1. We can also display the words that are most closely associated with each topic (in absolute terms). The topics appear to capture some semantic information but are not differentiated:
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
44.192.132.66