We will begin the modeling phase using the following steps:
- We will start by importing the dataset and processing the text using the following lines of code:
import cc.mallet.types.*; import cc.mallet.pipe.*; import cc.mallet.pipe.iterator.*; import cc.mallet.topics.*; import java.util.*; import java.util.regex.*; import java.io.*; public class TopicModeling { public static void main(String[] args) throws Exception { String dataFolderPath = "data/bbc"; String stopListFilePath = "data/stoplists/en.txt";
- We will then create a default pipeline object as previously described:
ArrayList<Pipe> pipeList = new ArrayList<Pipe>(); pipeList.add(new Input2CharSequence("UTF-8")); Pattern tokenPattern = Pattern.compile("[\p{L}\p{N}_]+"); pipeList.add(new CharSequence2TokenSequence(tokenPattern)); pipeList.add(new TokenSequenceLowercase()); pipeList.add(new TokenSequenceRemoveStopwords(new File(stopListFilePath), "utf-8", false, false, false)); pipeList.add(new TokenSequence2FeatureSequence()); pipeList.add(new Target2Label()); SerialPipes pipeline = new SerialPipes(pipeList);
- Next, we will initialize the folderIterator object:
FileIterator folderIterator = new FileIterator( new File[] {new File(dataFolderPath)}, new TxtFilter(), FileIterator.LAST_DIRECTORY);
- We will now construct a new instance list with the pipeline that we want to use to process the text:
InstanceList instances = new InstanceList(pipeline);
- We process each instance provided by iterator:
instances.addThruPipe(folderIterator);
- Now let's create a model with five topics using the cc.mallet.topics.ParallelTopicModel.ParallelTopicModel class that implements a simple threaded LDA model. LDA is a common method for topic modeling that uses Dirichlet distribution to estimate the probability that a selected topic generates a particular document. We will not dive deep into the details in this chapter; the reader is referred to the original paper by D. Blei et al. (2003).
The class is instantiated with parameters alpha and beta, which can be broadly interpreted as follows:
- High alpha value means that each document is likely to contain a mixture of most of the topics, and not any single topic specifically. A low alpha value puts less of such constraints on documents, and this means that it is more likely that a document may contain mixture of just a few, or even only one, of the topics.
- A high beta value means that each topic is likely to contain a mixture of most of the words, and not any word specifically; while a low value means that a topic may contain a mixture of just a few of the words.
In our case, we initially keep both parameters low (alpha_t = 0.01, beta_w = 0.01) as we assume topics in our dataset are not mixed much and there are many words for each of the topics:
int numTopics = 5; ParallelTopicModel model = new ParallelTopicModel(numTopics, 0.01, 0.01);
- We will add instances to the model, and since we are using parallel implementation, we will specify the number of threads that will run in parallel, as follows:
model.addInstances(instances); model.setNumThreads(4);
- We will now run the model for a selected number of iterations. Each iteration is used for better estimation of internal LDA parameters. For testing, we can use a small number of iterations, for example, 50; while in real applications, use 1000 or 2000 iterations. Finally, we will call the void estimate() method that will actually build an LDA model:
model.setNumIterations(1000); model.estimate();
The model outputs the following result:
0 0,06654 game england year time win world 6 1 0,0863 year 1 company market growth economy firm 2 0,05981 people technology mobile mr games users music 3 0,05744 film year music show awards award won 4 0,11395 mr government people labour election party blair [beta: 0,11328] <1000> LL/token: -8,63377 Total time: 45 seconds
LL/token indicates the model's log-likelihood, divided by the total number of tokens, indicating how likely the data is given the model. Increasing values mean the model is improving.
The output also shows the top words describing each topic. The words correspond to initial topics really well:
- Topic 0: game, england, year, time, win, world, 6 ⇒ sport
- Topic 1: year, 1, company, market, growth, economy, firm ⇒ finance
- Topic 2: people, technology, mobile, mr, games, users, music ⇒ tech
- Topic 3: film, year, music, show, awards, award, won ⇒ entertainment
- Topic 4: mr, government, people, labor, election, party, blair ⇒ politics
There are still some words that don't make much sense, for instance, mr, 1, and 6. We could include them in the stop word list. Also, some words appear twice, for example, award and awards. This happened because we didn't apply any stemmer or lemmatization pipe.
In the next section, we'll take a look to check whether the model is any good.