LDA architecture

In the LDA architecture, there are M number of documents having an N number of words, that get processed through the black strip called LDA. It delivers X Topics with Cluster of words. Each topic has psi distribution of words out of topics. Finally, it also comes up with a distribution of topics out of documents, which is denoted by phi.

The following diagram illustrates LDA:

With regard to the Alpha and Beta hyperparameters: alpha represents document-topic concentration and beta represents topic-word concentration. The higher the value of alpha, the more topics we get out of documents. On the other hand, the higher the value of beta, the more words there are in a topic. These can be tweaked based on the domain knowledge.

LDA iterates through each word of every document and assigns and adjusts a topic for it. A new topic X is assigned to it, on the basis of the product of two probabilities: p1= (topic t/document d), which means the proportion of the words of a document assigned to topic t, and p2=(word w/topic t), which refers to the proportion of assignments to topic t spread over all the documents, which has the word w associated with it.

With the number of passes, a good distribution of topic-word and topic-documents is attained.

Let's look at how it's executed in Python:

  1. In this step, we are loading dataset = fetch_20newsgroups, which comes from sklearn:
from sklearn.datasets import fetch_20newsgroups
dataset = fetch_20newsgroups(shuffle=True, random_state=1, remove=('headers', 'footers', 'quotes'))
documents = dataset.data
  1. In this step, we will clean the dataset. In order to do that, the stopwords and WordNetLemmatizer functions are required. Hence, the relevant libraries are must be loaded, as follows:
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
import string
  1. Ensure that you have downloaded the following dictionaries:
import nltk
nltk.download("stopwords")
nltk.download("wordnet")
  1. Here, a clean function is created to put the words in lowercase. Remove the stopwords and pick the words that have a length greater than 3. Also, it makes it punctuation-free. Finally, lemmatize it , as follows:
stop = set(stopwords.words('english'))
punc = set(string.punctuation)
lemma = WordNetLemmatizer()
def clean(doc):
stopw_free = " ".join([i for i in doc.lower().split() if i not in stop and len(i)>3])
punc_free = ''.join(ch for ch in stop_free if ch not in punc)
lemmatized = " ".join(lemma.lemmatize(word) for word in punc_free.split())
return lemmatized
doc_clean = [clean(doc).split() for doc in documents]
  1. Now, we have to make the document term matrix with the help of the gensim library. This library will also enable us to carry out LDA:
import gensim
from gensim import corpora
  1. A document term matrix based on a bag of words is created here:
corpus = corpora.Dictionary(doc_clean)
doc_term_matrix = [corpus.doc2bow(doc) for doc in doc_clean]
  1. Here, a similar matrix is being created with the help of TF-IDF:
from gensim import models
tfidf = models.TfidfModel(doc_term_matrix)
corpus_tfidf = tfidf[doc_term_matrix]
  1. Let's set up the model with a TF-IDF matrix. The number of topics has been given as 10:
lda_model1 = gensim.models.LdaMulticore(corpus_tfidf, num_topics=10, id2word=corpus, passes=2, workers=2)
  1. Let's take a look at the topic with words:
print(lda_model1.print_topics(num_topics=5, num_words=5))

The output is as follows:

  1. A similar exercise will be done for the bag of words; later, we will compare it:
lda_model2 = gensim.models.LdaMulticore(doc_term_matrix, num_topics=10, id2word=corpus, passes=2, workers=2)

print(lda_model2.print_topics(num_topics=5, num_words=5))

We get the following output:

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.16.81.33