How to implement LDA using gensim

gensim is a specialized NLP library with a fast LDA implementation and many additional features. We will also use it in the next chapter on word vectors (see the latent_dirichlet_allocation_gensim notebook for details).

It facilitates the conversion of DTM produced by sklearn into gensim data structures as follows:

train_corpus = Sparse2Corpus(train_dtm, documents_columns=False)
test_corpus = Sparse2Corpus(test_dtm, documents_columns=False)
id2word = pd.Series(vectorizer.get_feature_names()).to_dict()

Gensim LDA algorithm includes numerous settings, which are as follows:

LdaModel(corpus=None, 
num_topics=100,
id2word=None,
distributed=False,
chunksize=2000, # No of doc per training chunk.
passes=1, # No of passes through corpus during training
update_every=1, # No of docs to be iterated through per update
alpha='symmetric',
eta=None, # a-priori belief on word probability
decay=0.5, # % of lambda forgotten when new doc is examined
offset=1.0, # controls slow down of first few iterations.
eval_every=10, # how often estimate log perplexity (costly)
iterations=50, # Max. of iterations through the corpus
gamma_threshold=0.001, # Min. change in gamma to continue
minimum_probability=0.01, # Filter topics with lower
probability
random_state=None,
ns_conf=None,
minimum_phi_value=0.01, # lower bound on term probabilities
per_word_topics=False, # Compute most word-topic
probabilities
callbacks=None,
dtype=<class 'numpy.float32'>)

Gensim also provides an LdaMulticore model for parallel training that may speed up training using Python's multiprocessing features for parallel computation.

Model training just requires instantiating the LdaModel object as follows:

lda = LdaModel(corpus=train_corpus,
num_topics=5,
id2word=id2word)

Topic coherence measures whether the words in a topic tend to co-occur together. It adds up a score for each distinct pair of top-ranked words. The score is the log of the probability that a document containing at least one instance of the higher-ranked word also contains at least one instance of the lower-ranked word.

Large negative values indicate words that don't co-occur often; values closer to zero indicate that words tend to co-occur more often. gensim permits topic coherence evaluation that produces the topic coherence and shows the most important words per topic:

coherence = lda_gensim.top_topics(corpus=train_corpus,  coherence='u_mass')

We can display the results as follows:

topic_coherence = []
topic_words = pd.DataFrame()
for t in range(len(coherence)):
label = topic_labels[t]
topic_coherence.append(coherence[t][1])
df = pd.DataFrame(coherence[t][0], columns=[(label, 'prob'), (label, 'term')])
df[(label, 'prob')] = df[(label, 'prob')].apply(lambda x: '{:.2%}'.format(x))
topic_words = pd.concat([topic_words, df], axis=1)

topic_words.columns = pd.MultiIndex.from_tuples(topic_words.columns)
pd.set_option('expand_frame_repr', False)
print(topic_words.head())
pd.Series(topic_coherence, index=topic_labels).plot.bar();

This shows the following top words for each topic:

Topic 1

Topic 2

Topic 3

Topic 4

Topic 5

Probability

Term

Probability

Term

Probability

Term

Probability

Term

Probability

Term

0.55%

online

0.90%

best

1.04%

mobile

0.64%

market

0.94%

labour

0.51%

site

0.87%

game

0.98%

phone

0.53%

growth

0.72%

blair

0.46%

game

0.62%

play

0.51%

music

0.52%

sales

0.72%

brown

0.45%

net

0.61%

won

0.48%

film

0.49%

economy

0.65%

election 

0.44%

used

0.56%

win

0.48%

use

0.45%

prices

0.57%

united

 

And the corresponding coherence scores, which highlight the decay of topic quality (at least in part due to the relatively small dataset):

Decay of topic quality
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.204.35.30