Topic modeling for BBC news

As discussed earlier, the goal of topic modeling is to identify patterns in a text corpus that correspond to document topics. In this example, we will use a dataset originating from BBC news. This dataset is one of the standard benchmarks in machine learning research, and is available for non-commercial and research purposes.

The goal is to build a classifier that is able to assign a topic to an uncategorized document.

BBC dataset

Greene and Cunningham (2006) collected the BBC dataset to study a particular document-clustering challenge using support vector machines. The dataset consists of 2,225 documents from the BBC News website from 2004 to 2005, corresponding to the stories collected from five topical areas: business, entertainment, politics, sport, and tech. The dataset can be grabbed from the following website:

http://mlg.ucd.ie/datasets/bbc.html

Download the raw text files under the Dataset: BBC section. You will also notice that the website contains already processed dataset, but for this example, we want to process the dataset by ourselves. The ZIP contains five folders, one per topic. The actual documents are placed in the corresponding topic folder, as shown in the following screenshot:

BBC dataset

Now, let's build a topic classifier.

Modeling

Start by importing the dataset and processing the text:

import cc.mallet.types.*;
import cc.mallet.pipe.*;
import cc.mallet.pipe.iterator.*;
import cc.mallet.topics.*;

import java.util.*;
import java.util.regex.*;
import java.io.*;

public class TopicModeling {

  public static void main(String[] args) throws Exception {

String dataFolderPath = args[0];
String stopListFilePath = args[1];

Create a default pipeline as previously described:

ArrayList<Pipe> pipeList = new ArrayList<Pipe>();
pipeList.add(new Input2CharSequence("UTF-8"));
Pattern tokenPattern = Pattern.compile("[\p{L}\p{N}_]+");
pipeList.add(new CharSequence2TokenSequence(tokenPattern));
pipeList.add(new TokenSequenceLowercase());
pipeList.add(new TokenSequenceRemoveStopwords(new File(stopListFilePath), "utf-8", false, false, false));
pipeList.add(new TokenSequence2FeatureSequence());
pipeList.add(new Target2Label());
SerialPipes pipeline = new SerialPipes(pipeList);

Next, initialize folderIterator:

FileIterator folderIterator = new FileIterator(
    new File[] {new File(dataFolderPath)},
    new TxtFilter(),
    FileIterator.LAST_DIRECTORY);

Construct a new instance list with the pipeline that we want to use to process the text:

InstanceList instances = new InstanceList(pipeline);

Finally, process each instance provided by the iterator:

instances.addThruPipe(folderIterator);

Now let's create a model with five topics using the cc.mallet.topics.ParallelTopicModel.ParallelTopicModel class that implements a simple threaded Latent Dirichlet Allocation (LDA) model. LDA is a common method for topic modeling that uses Dirichlet distribution to estimate the probability that a selected topic generates a particular document. We will not dive deep into the details in this chapter; the reader is referred to the original paper by D. Blei et al. (2003). Note that there is another classification algorithm in machine learning with the same acronym that refers to Linear Discriminant Analysis (LDA). Beside the common acronym, it has nothing in common with the LDA model.

The class is instantiated with parameters alpha and beta, which can be broadly interpreted, as follows:

  • High alpha value means that each document is likely to contain a mixture of most of the topics, and not any single topic specifically. A low alpha value puts less of such constraints on documents, and this means that it is more likely that a document may contain mixture of just a few, or even only one, of the topics.
  • A high beta value means that each topic is likely to contain a mixture of most of the words, and not any word specifically; while a low value means that a topic may contain a mixture of just a few of the words.

In our case, we initially keep both parameters low (alpha_t = 0.01, beta_w = 0.01) as we assume topics in our dataset are not mixed much and there are many words for each of the topics:

int numTopics = 5;
ParallelTopicModel model = 
new ParallelTopicModel(numTopics, 0.01, 0.01);

Next, add instances to the model, and as we are using parallel implementation, specify the number of threats that will run in parallel, as follows:

model.addInstances(instances);
model.setNumThreads(4);

Run the model for a selected number of iterations. Each iteration is used for better estimation of internal LDA parameters. For testing, we can use a small number of iterations, for example, 50; while in real applications, use 1000 or 2000 iterations. Finally, call the void estimate()method that will actually build an LDA model:

model.setNumIterations(1000);
model.estimate();

The model outputs the following result:

0 0,06654  game england year time win world 6 
1 0,0863  year 1 company market growth economy firm 
2 0,05981  people technology mobile mr games users music 
3 0,05744  film year music show awards award won 
4 0,11395  mr government people labour election party blair 

[beta: 0,11328] 
<1000> LL/token: -8,63377

Total time: 45 seconds

LL/token indicates the model's log-liklihood, divided by the total number of tokens, indicating how likely the data is given the model. Increasing values mean the model is improving.

The output also shows the top words describing each topic. The words correspond to initial topics really well:

  • Topic 0: game, England, year, time, win, world, → sport
  • Topic 1: year, 1, company, market, growth, economy, firm → finance
  • Topic 2: people, technology, mobile, mr, games, users, music → tech
  • Topic 3: film, year, music, show, awards, award, won → entertainment
  • Topic 4: mr, government, people, labor, election, party, blair → politics

There are still some words that don't make much sense, for instance, mr, 1, and 6. We could include them in the stop word list. Also, some words appear twice, for example, award and awards. This happened because we didn't apply any stemmer or lemmatization pipe.

In the next section, we'll take a look to check whether the model is of any good.

Evaluating a model

As statistical topic modeling has unsupervised nature, it makes model selection difficult. For some applications, there may be some extrinsic tasks at hand, such as information retrieval or document classification, for which performance can be evaluated. However, in general, we want to estimate the model's ability to generalize topics regardless of the task.

Wallach et al. (2009) introduced an approach that measures the quality of a model by computing the log probability of held-out documents under the model. Likelihood of unseen documents can be used to compare models—higher likelihood implies a better model.

First, let's split the documents into training and testing set (that is, held-out documents), where we use 90% for training and 10% for testing:

// Split dataset
InstanceList[] instanceSplit= instances.split(new Randoms(), new double[] {0.9, 0.1, 0.0});

Now, let's rebuild our model using only 90% of our documents:

// Use the first 90% for training
model.addInstances(instanceSplit[0]);
model.setNumThreads(4);
model.setNumIterations(50);
model.estimate();

Next, initialize an estimator that implements Wallach's log probability of held-out documents, MarginalProbEstimator:

// Get estimator
MarginalProbEstimator estimator = model.getProbEstimator();

Note

An intuitive description of LDA is summarized by Annalyn Ng in her blog:

https://annalyzin.wordpress.com/2015/06/21/laymans-explanation-of-topic-modeling-with-lda-2/

To get deeper insight into the LDA algorithm, its components, and it working, take a look at the original paper LDA by David Blei et al. (2003) at http://jmlr.csail.mit.edu/papers/v3/blei03a.html or take a look at the summarized presentation by D. Santhanam of Brown University at http://www.cs.brown.edu/courses/csci2950-p/spring2010/lectures/2010-03-03_santhanam.pdf.

The class implements many estimators that require quite deep theoretical knowledge of how the LDA method works. We'll pick the left-to-right evaluator, which is appropriate for a wide range of applications, including text mining, speech recognition, and others. The left-to-right evaluator is implemented as the double evaluateLeftToRight method, accepting the following components:

  • Instances heldOutDocuments: This test the instances
  • int numParticles: This algorithm parameter indicates the number of left-to-right tokens, where default value is 10
  • boolean useResampling: This states whether to resample topics in left-to-right evaluation; resampling is more accurate, but leads to quadratic scaling in the length of documents
  • PrintStream docProbabilityStream: This is the file or stdout in which we write the inferred log probabilities per document

Let's run the estimator, as follows:

double loglike = estimator.evaluateLeftToRight(
  instanceSplit[1], 10, false, null););
System.out.println("Total log likelihood: "+loglike);

In our particular case, the estimator outputs the following log likelihood, which makes sense when it is compared to other models that are either constructed with different parameters, pipelines, or data—the higher the log likelihood, the better the model is:

Total time: 3 seconds
Topic Evaluator: 5 topics, 3 topic bits, 111 topic mask
Total log likelihood: -360849.4240795393
Total log likelihood

Now let's take a look at how to make use of this model.

Reusing a model

As we are usually not building models on the fly, it often makes sense to train a model once and use it repeatedly to classify new data.

Note that if you'd like to classify new documents, they need go through the same pipeline as other documents—the pipe needs to be the same for both training and classification. During training, the pipe's data alphabet is updated with each training instance. If you create a new pipe with the same steps, you don't produce the same pipeline as its data alphabet is empty. Therefore, to use the model on new data, save/load the pipe along with the model and use this pipe to add new instances.

Saving a model

Mallet supports a standard method for saving and restoring objects based on serialization. We simply create a new instance of ObjectOutputStream class and write the object into a file as follows:

String modelPath = "myTopicModel";

//Save model
ObjectOutputStream oos = new ObjectOutputStream(
new FileOutputStream (new File(modelPath+".model")));
oos.writeObject(model);
oos.close();   
  
//Save pipeline
oos = new ObjectOutputStream(
new FileOutputStream (new File(modelPath+".pipeline")));
oos.writeObject(pipeline);
oos.close();

Restoring a model

Restoring a model saved through serialization is simply an inverse operation using the ObjectInputStream class:

String modelPath = "myTopicModel";

//Load model
ObjectInputStream ois = new ObjectInputStream(
  new FileInputStream (new File(modelPath+".model")));
ParallelTopicModel model = (ParallelTopicModel) ois.readObject();
ois.close();   

// Load pipeline
ois = new ObjectInputStream(
  new FileInputStream (new File(modelPath+".pipeline")));
SerialPipes pipeline = (SerialPipes) ois.readObject();
ois.close();   

We discussed how to build an LDA model to automatically classify documents into topics. In the next example, we'll look into another text mining problem—text classification.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.15.182.62