Modeling the topics of online news stories

To see how topic models perform on real data, we will look at two data sets containing articles originating from BBC News during the period of 2004-2005. The first data set, which we will refer to as the BBC data set, contains 2,225 articles that have been grouped into five topics. These are business, entertainment, politics, sports, and technology.

The second data set, which we will call the BBCSports data set, contains 737 articles only on sports. These are also grouped into five categories according to the type of sport being described. The five sports in question are athletics, cricket, football, rugby, and tennis. Our objective will be to see if we can build topic models for each of these two data sets that will group together articles from the same major topic.

Note

Both BBC data sets were presented in a paper by D. Greene and P. Cunningham, titled Producing Accurate Interpretable Clusters from High-Dimensional Data and published in the proceedings of the 9th European Conference on Principles and Practice of Knowledge Discovery in Databases ( PKDD'05) in October 2005.

The two data sets can be found at http://mlg.ucd.ie/datasets/bbc.html. When downloaded, each data set is a folder containing a few different files. We will use the variables bbc_folder and bbcsports_folder to store the paths of these folders on our computer.

Each folder contains three important files. The file with the extension .mtx is essentially a file containing a term document matrix in sparse matrix form. Concretely, the rows of the matrix are terms that can be found in the articles and the columns are the articles themselves. An entry M[i,j] in this matrix contains the number of times the term corresponding to row i was found in document corresponding to column j. A term document matrix is thus a transposed document term matrix, which we encountered in Chapter 8, Probabilistic Graphical Models. The specific format used to store the matrix in the file is a format known as the Matrix Market format, where each line corresponds to a nonempty cell in the matrix.

Typically, when working with texts such as news articles, we would need to perform some preprocessing steps, such as stop-word removal, just as we performed using the tm package in our example on sentiment analysis in Chapter 8, Probabilistic Graphical Models. Fortunately for us, the articles in these data sets have already been processed so that they have been stemmed; stop words have been removed, as have any terms that appear fewer than three times.

In order to interpret the term document matrix, the files with the extension .terms contain the actual terms, one per line, which are the row names of the term document matrix. Similarly, the document names that are the column names of the term document matrix are stored in the files with the extension .docs.

We first create variables for the paths to the three files that we need for each data set:

> bbc_folder <- "~/Downloads/bbc/"
> bbcsports_folder <- "~/Downloads/bbcsport/"

> bbc_source <- paste(bbc_folder, "bbc.mtx", sep = "")
> bbc_source_terms <- paste(bbc_folder, "bbc.terms", sep = "")
> bbc_source_docs <- paste(bbc_folder, "bbc.docs", sep = "")

> bbcsports_source <- paste(bbcsports_folder, "bbcsport.mtx", sep = "")
> bbcsports_source_terms <- paste(bbcsports_folder, 
                                 "bbcsport.terms", sep = "")
> bbcsports_source_docs <- paste(bbcsports_folder, 
                                 "bbcsport.docs", sep = "")

In order to load data into R from a file in Market Matrix format, we can use the readMM() function from the Matrix R package. This function loads the data and stores it into a sparse matrix object. We can convert this into a term document matrix that the tm package can interpret using the as.TermDocumentMatrix() function in the tm package. Aside from the matrix object that is the first argument to that function, we also need to specify the weighting parameter. This parameter describes what the numbers in the original matrix correspond to. In our case, we have raw term frequencies, so we specify the value weightTf:

> library("tm")
> library("Matrix")
> bbc_matrix <- readMM(bbc_source)
> bbc_tdm <- as.TermDocumentMatrix(bbc_matrix, weightTf)

> bbcsports_matrix <- readMM(bbcsports_source)
> bbcsports_tdm <- as.TermDocumentMatrix(bbcsports_matrix, 
                                         weightTf)

Next, we load the terms and document identifiers from the two remaining files and use these to create appropriate row and column names respectively for the term document matrices. We can use the standard scan() function to read files with a single entry per line and load the entries into vectors. Once we have a term vector and a document identifier vector, we will use these to update the row and column names for the term document matrix. Finally, we'll transpose this matrix into a document term matrix as this is the format we will need for subsequent steps:

> bbc_rows <- scan(bbc_source_terms, what = "character")
Read 9635 items
> bbc_cols <- scan(bbc_source_docs, what = "character")
Read 2225 items
> bbc_tdm$dimnames$Terms <- bbc_rows
> bbc_tdm$dimnames$Docs <- bbc_cols
> (bbc_dtm <- t(bbc_tdm))
<<DocumentTermMatrix (documents: 2225, terms: 9635)>>
Non-/sparse entries: 286774/21151101
Sparsity           : 99%
Maximal term length: 24
Weighting          : term frequency (tf)
> bbcsports_rows <- scan(bbcsports_source_terms, what =  
                         "character")
Read 4613 items
> bbcsports_cols <- scan(bbcsports_source_docs, what =  
                         "character")
Read 737 items
> bbcsports_tdm$dimnames$Terms <- bbcsports_rows
> bbcsports_tdm$dimnames$Docs <- bbcsports_cols
> (bbcsports_dtm <- t(bbcsports_tdm))
<<DocumentTermMatrix (documents: 737, terms: 4613)>>
Non-/sparse entries: 85576/3314205
Sparsity           : 97%
Maximal term length: 17
Weighting          : term frequency (tf)

We now have the document term matrices for our two data sets ready. We can see that there are roughly twice as many terms for the BBC data set as there are for the BBCSports data set and the latter also has about a third of the number of documents, so it is a much smaller data set. Before we build our topic models, we must also create the vectors containing the original topic classification of the articles. If we examine the document IDs, we can see that the format of each document identifier is <topic>.<counter>.

> bbc_cols[1:5]
[1] "business.001" "business.002" "business.003" "business.004" 
[5] "business.005"

> bbcsports_cols[1:5]
[1] "athletics.001" "athletics.002" "athletics.003" 
[4] "athletics.004" "athletics.005"

To create a vector with the correct topic assignments, we simply need to strip out the last four characters of each entry. If we then convert the result into a factor, we can see how many documents we have per topic:

> bbc_gold_topics <- sapply(bbc_cols, 
                           function(x) substr(x, 1, nchar(x) - 4))
> bbc_gold_factor <- factor(bbc_gold_topics)
> summary(bbc_gold_factor)
     business entertainment      politics         sport 
          510           386           417           511 
         tech 
          401 

> bbcsports_gold_topics <- sapply(bbcsports_cols, 
                           function(x) substr(x, 1, nchar(x) - 4))
> bbcsports_gold_factor <- factor(bbcsports_gold_topics)
> summary(bbcsports_gold_factor)
athletics   cricket  football     rugby    tennis 
      101       124       265       147       100

This shows that the BBC data set is fairly even in the distribution of its topics. In the BBCSports data, however, we see that there are roughly twice as many articles on football than the other four sports.

For each of our two data sets, we will now build some topic models using the package topicmodels. This is a very useful package as it allows us to use data structures created with the tm package to perform topic modeling. For each data set, we will build the following four different topic models:

  • LDA_VEM: This is an LDA model trained with the Variational Expectation Maximization (VEM) method. This method automatically estimates the α Dirichlet parameter vector.
  • LDA_VEM_α: This is an LDA model trained with VEM but the difference here is that that the α Dirichlet parameter vector is not estimated.
  • LDA_GIB: This is an LDA model trained with Gibbs sampling.
  • CTM_VEM: This is an implementation of the Correlated Topic Model (CTM) model trained with VEM. Currently, the topicmodels package does not support training this method with Gibbs sampling.

To train an LDA model, the topicmodels package provides us with the LDA() function. We will use four key parameters for this function. The first of these specifies the document term matrix for which we want to build an LDA model. The second of these, k, specifies the target number of topics we want to have in our model. The third parameter, method, allows us to select which training algorithm to use. This is set to VEM by default, so we only need to specify this for our LDA_GIB model that uses Gibbs sampling.

Finally, there is a control parameter, which takes in a list of parameters that affect the fitting process. As there is an inherent random component involved in the training of topic models, we can specify a seed parameter in this list in order to make the results reproducible. Additionally, this is where we can specify whether we want to estimate the α Dirichlet parameter. This is also where we can include parameters for the Gibbs sampling procedure, such as the number of omitted Gibbs iterations at the start of the training procedure (burnin), the number of omitted in-between iterations (thin), and the total number of Gibbs iterations (iter). To train a CTM model, the topicmodels package provides us with the CTM() function, which has a similar syntax with the LDA() function.

Using this knowledge, we'll define a function that creates a list of four trained models given a particular document term matrix, the number of topics required, and the seed. For this function, we have used some standard values for the aforementioned training parameters with which the reader is encouraged to experiment, ideally after investigating the references provided for the two optimization methods:

compute_model_list <- function (k, topic_seed, myDtm){
  LDA_VEM <- LDA(myDtm, k = k, control = list(seed = topic_seed))
  LDA_VEM_a <- LDA(myDtm, k = k, control = list(estimate.alpha = 
                   FALSE, seed = topic_seed))
  LDA_GIB <- LDA(myDtm, k = k, method = "Gibbs", control = 
                 list(seed = topic_seed, burnin = 1000, thin = 
                 100, iter = 1000))
  CTM_VEM <- CTM(myDtm, k = k, control = list(seed = topic_seed, 
                 var = list(tol = 10^-4), em = list(tol = 10^-3)))
  return(list(LDA_VEM = LDA_VEM, LDA_VEM_a = LDA_VEM_a, 
         LDA_GIB = LDA_GIB, CTM_VEM = CTM_VEM))
}

We'll now use this function to train a list of models for the two data sets:

> library("topicmodels")
> k <- 5
> topic_seed <- 5798252
> bbc_models <- compute_model_list(k, topic_seed,bbc_dtm)
> bbcsports_models <- compute_model_list(k, topic_seed, 
                                         bbcsports_dtm)

To get a sense of how the topic models have performed, let's first see whether the five topics learned by each model correspond to the five topics to which the articles were originally assigned. Given one of these trained models, we can use the topics() function to get a vector of the most likely topic chosen for each document.

This function actually takes a second parameter k, by default set to 1, which returns the top k topics predicted by the model. We only want one topic per model in this particular instance. Having found the most likely topic, we can then tabulate the predicted topics against the vector of labeled topics. These are the results for the LDA_VEM model for the BBC data set:

> model_topics <- topics(bbc_models$LDA_VEM)
> table(model_topics, bbc_gold_factor)
            bbc_gold_factor
model_topics business entertainment politics sport tech
           1       11           174        2     0  176
           2        4           192        1     0  202
           3      483             3       10     0    7
           4        9            17      403     4   15
           5        3             0        1   507    1

Looking at this table, we can see that topic 5 corresponds almost exclusively to the sports category. Similarly, topics 4 and 3 seem to match to the politics and business categories respectively. Unfortunately, models 1 and 2 both contain a mixture of entertainment and technology articles and as a result this model hasn't really succeeded in distinguishing between the categories that we want.

It should be clear that in an ideal situation, each model topic should match to one gold topic (we often use the adjective gold to refer to the correct or labeled value of a particular variable. This is derived from the expression gold standard which refers to a widely accepted standard). We can repeat this process on the LDA_GIB model, where the story is different:

> model_topics <- topics(bbc_models$LDA_GIB)
> table(model_topics, bbc_gold_factor)
            bbc_gold_factor
model_topics business entertainment politics sport tech
           1      471             2       12     1    5
           2        0             0        3   506    3
           3        9             4        1     0  371
           4       27            16      399     3    9
           5        3           364        2     1   13

Intuitively, we feel that this topic model is a better match to our original topics than the first, as evidenced by the fact that each model topic selects articles from primarily one gold topic.

A rough way to estimate the quality of the match between a topic model and our target vector of topics is to say that the largest value in every row corresponds to the gold topic assigned to the model topic represented by that row. Then, the total accuracy is the ratio of these maximum row values over the total number of documents. In the preceding example, for the LDA_GIB model, this number would be (471+506+371+399+364)/2225 = 2111/2225= 94.9 percent. The following function computes this value given a model and a vector of gold topics:

compute_topic_model_accuracy <- function(model, gold_factor) {
  model_topics <- topics(model)
  model_table <- table(model_topics, gold_factor)
  model_matches <- apply(model_table, 1, max)
  model_accuracy <- sum(model_matches) / sum(model_table)
  return(model_accuracy)
}

Using this notion of accuracy, let's see which model performs better in our two data sets:

> sapply(bbc_models, function(x) 
         compute_topic_model_accuracy(x, bbc_gold_factor))
  LDA_VEM LDA_VEM_a   LDA_GIB   CTM_VEM 
0.7959551 0.7923596 0.9487640 0.6148315 
> sapply(bbcsports_models, function(x) 
         compute_topic_model_accuracy(x, bbcsports_gold_factor))
  LDA_VEM LDA_VEM_a   LDA_GIB   CTM_VEM 
0.7924016 0.7788331 0.7856174 0.7503392

For the BBC data set, we see that the LDA_GIB model significantly outperforms the others and the CTM_VEM model is significantly worse than the LDA models. For the BBCSports data set, all the models perform roughly the same, but the LDA_VEM model is slightly better.

Another way to assess the quality of a model fit is computing the log likelihood of the data given the model, remembering that the larger this value, the better the fit. We can do this with the logLik() function in the topicmodels package, which suggests that the best model is the LDA model trained with Gibbs sampling in both cases:

> sapply(bbc_models, logLik)
  LDA_VEM LDA_VEM_a   LDA_GIB   CTM_VEM 
 -3201542  -3274005  -3017399  -3245828
> sapply(bbcsports_models, logLik)
  LDA_VEM LDA_VEM_a   LDA_GIB   CTM_VEM 
-864357.7 -886561.9 -813889.7 -868561.9 

Model stability

It turns out that the random component of the optimization procedures involved in fitting these models often has a significant impact on the model that is trained. Put differently, we may find that if we use different random number seeds, the results may sometimes change appreciably.

Ideally, we would like our model to be stable, which is to say that we would like the effect of the initial conditions of the optimization procedure that are determined by a random number seed to be minimal. It is a good idea to investigate the effect of different seeds on our four models by training them on multiple seeds:

> seeded_bbc_models <- lapply(5798252 : 5798256, 
              function(x) compute_model_list(k, x, bbc_dtm))
> seeded_bbcsports_models <- lapply(5798252 : 5798256, 
              function(x) compute_model_list(k, x, bbcsports_dtm))

Here we used a sequence of five consecutive seeds and trained our models on both data sets five times. Having done this, we can investigate the accuracy of our models for the various seeds. If the accuracy of a method does not vary by a large degree across the seeds, we can infer that the method is quite stable and produces similar topic models (although, in this case, we are only considering the most prominent topic per document).

> seeded_bbc_models_acc <- sapply(seeded_bbc_models, 
  function(x) sapply(x, function(y) 
  compute_topic_model_accuracy(y, bbc_gold_factor)))
> seeded_bbc_models_acc
               [,1]      [,2]      [,3]      [,4]      [,5]
LDA_VEM   0.7959551 0.7959551 0.7065169 0.7065169 0.7757303
LDA_VEM_a 0.7923596 0.7923596 0.6916854 0.6916854 0.7505618
LDA_GIB   0.9487640 0.9474157 0.9519101 0.9501124 0.9460674
CTM_VEM   0.6148315 0.5883146 0.9366292 0.8026966 0.7074157

> seeded_bbcsports_models_acc <- sapply(seeded_bbcsports_models, 
  function(x) sapply(x, function(y) 
  compute_topic_model_accuracy(y, bbcsports_gold_factor)))
> seeded_bbcsports_models_acc
               [,1]      [,2]      [,3]      [,4]      [,5]
LDA_VEM   0.7924016 0.7924016 0.8616011 0.8616011 0.9050204
LDA_VEM_a 0.7788331 0.7788331 0.8426052 0.8426052 0.8914518
LDA_GIB   0.7856174 0.7978290 0.8073270 0.7978290 0.7761194
CTM_VEM   0.7503392 0.6309362 0.7435550 0.8995929 0.6526459

On both data sets, we can clearly see that Gibbs sampling results in a more stable model and in the case of the BBC data set, it is also the clear winner in terms of accuracy. Gibbs sampling generally tends to produce more accurate models but even though it was not readily apparent on these data sets, it can become significantly slower than VEM methods once the data set becomes large.

The two LDA models trained with variational methods exhibit scores that vary within a roughly 10 percent range on both data sets. On both data sets, we see that LDA_VEM is consistently better than LDA_VEM_a by a small amount. This method also produces, on average, better accuracy among all models in the BBCSports data set. The CTM model is the least stable of all the models, exhibiting a high degree of variance on both data sets. Interestingly, though, the highest performance of the CTM model across the five iterations performs marginally worse than the best accuracy possible using the other methods.

If we see that our model is not very stable across a few seeded iterations, we can specify the nstart parameter during training, which specifies the number of random restarts that are used during the optimization procedure. To see how this works in practice, we have created a modified compute_model_list() function that we named compute_model_list_r(), which takes in an extra parameter, nstart.

The other difference is that the seed parameter now needs a vector of seeds with as many entries as the number of random restarts. To deal with this, we will simply create a suitably sized range of seeds starting from the one provided. Here is our new function:

compute_model_list_r <- function (k, topic_seed, myDtm, nstart) {
  seed_range <- topic_seed : (topic_seed + nstart - 1)
  LDA_VEM <- LDA(myDtm, k = k, control = list(seed = seed_range, 
                 nstart = nstart))
  LDA_VEM_a <- LDA(myDtm, k = k, control = list(estimate.alpha = 
                 FALSE, seed = seed_range, nstart = nstart))
  LDA_GIB <- LDA(myDtm, k = k, method = "Gibbs", control = 
                 list(seed = seed_range, burnin = 1000, thin = 
                 100, iter = 1000, nstart = nstart))
  CTM_VEM <- CTM(myDtm, k = k, control = list(seed = seed_range, 
                 var = list(tol = 10^-4), em = list(tol = 10^-3), 
                 nstart = nstart))
  return(list(LDA_VEM = LDA_VEM, LDA_VEM_a = LDA_VEM_a, 
              LDA_GIB = LDA_GIB, CTM_VEM = CTM_VEM))
}

We will use this function to create a new model list. Note that using random restarts means we are increasing the amount of time needed to train, so these next few commands will take some time to complete.

> nstart <- 5
> topic_seed <- 5798252
> nstarted_bbc_models_r <- 
        compute_model_list_r(k, topic_seed, bbc_dtm, nstart)
> nstarted_bbcsports_models_r <- 
        compute_model_list_r(k, topic_seed, bbcsports_dtm, nstart)
> sapply(nstarted_bbc_models_r, function(x) 
  compute_topic_model_accuracy(x, bbc_gold_factor))
  LDA_VEM LDA_VEM_a   LDA_GIB   CTM_VEM 
0.7959551 0.7923596 0.9487640 0.9366292 
> sapply(nstarted_bbcsports_models_r, function(x) 
  compute_topic_model_accuracy(x, bbcsports_gold_factor))
  LDA_VEM LDA_VEM_a   LDA_GIB   CTM_VEM 
0.9050204 0.8426052 0.7991859 0.8995929

Note that even after using only five random restarts, the accuracy of the models has improved. More importantly, we now see that using random restarts has overcome the fluctuations that the CTM model experiences and as a result it is now performing almost as well as the best model in each data set.

Finding the number of topics

In this predictive task, the number of different topics was known beforehand. This turned out to be very important because it is provided as an input to the functions that trained our models. The number of topics might not be known when we are using topic modeling as a form of exploratory analysis where our goal is simply to cluster documents together based on the similarity of their topics.

This is a challenging question and bears some similarity to the general problem of selecting the number of clusters when we perform clustering. One proposed solution to this problem is to perform cross-validation over a range of different numbers of topics. This approach will not scale well at all when the data set is large, especially since training a single topic model is already quite computationally intensive when we factor issues such as random restarts.

Note

A paper that discusses a number of different approaches for estimating the number of topics in topic models is Reconceptualizing the classification of PNAS articles by Edoardo M. Airoldi and others. This appears in the Proceedings of the National Academy of Sciences, volume 107, 2010.

Topic distributions

We saw in the description of the generative process that we use a Dirichlet distribution to sample a multinomial distribution of topics. In the LDA_VEM model, the αk parameter vector is estimated. Note that in all cases, a symmetric distribution is used in this implementation so that we are only estimating the value of α, which is the value that all the αk parameters take on. For the LDA models, we can investigate which value of this parameter is used with and without estimation:

> bbc_models[[1]]@alpha
[1] 0.04893411
> bbc_models[[2]]@alpha
[1] 10
> bbcsports_models[[1]]@alpha
[1] 0.04037119
> bbcsports_models[[2]]@alpha
[1] 10

As we can see, when we estimate the value of α, we obtain a much lower value of α than we use by default, indicating that for both data sets, the topic distribution is thought to be peaky. We can use the posterior() function in order to view the distribution of topics for each model. For example, for the LDA_VEM model on the BBC data set, we find the following distributions of topics for the first few articles:

> options(digits = 4)
> head(posterior(bbc_models[[1]])$topics)
                     1         2      3         4         5
business.001 0.2700360 0.0477374 0.6818 0.0002222 0.0002222
business.002 0.0002545 0.0002545 0.9990 0.0002545 0.0002545
business.003 0.0003257 0.0003257 0.9987 0.0003257 0.0003257
business.004 0.0002153 0.0002153 0.9991 0.0002153 0.0002153
business.005 0.0337131 0.0004104 0.9651 0.0004104 0.0004104
business.006 0.0423153 0.0004740 0.9563 0.0004740 0.0004740

The following plot is a histogram of the posterior probability of the most likely topic predicted by our four models. The LDA_VEM model assumes a very peaky distribution, whereas the other models have a wider spread. The CTM_VEM model also has a peak at very high probabilities, but unlike LDA_VEM, the probability mass is spread over a wide range of values. We can see that the minimum probability for the most likely topic is 0.2 because we have five topics:

Topic distributions

Another approach to estimating the smoothness of the topic distributions is to compute the model entropy. We will define this as the average entropy of all the topic distributions across the different documents. Smooth distributions will exhibit higher entropy than peaky distributions. To compute the entropy of our model, we will define two functions. The function compute_entropy() computes the entropy of a particular topic distribution of a document, and the compute_model_mean_entropy() function computes the average entropy across all the different documents in the model:

compute_entropy <- function(probs) {
  return(- sum(probs * log(probs)))
}

compute_model_mean_entropy <- function(model) {
  topics <- posterior(model)$topics
  return(mean(apply(topics, 1, compute_entropy)))
}

Using these functions, we can compute the average model entropies for the models trained on our two data sets:

> sapply(bbc_models, compute_model_mean_entropy)
  LDA_VEM LDA_VEM_a   LDA_GIB   CTM_VEM 
0.3119491 1.2664310 1.2720891 0.8373708 
> sapply(bbcsports_models, compute_model_mean_entropy)
  LDA_VEM LDA_VEM_a   LDA_GIB   CTM_VEM 
0.3058856 1.3084006 1.3421798 0.7545975

These results are consistent with what the preceding plots show, which is that the LDA_VEM model, which is the peakiest, has a much lower entropy than the other models.

Word distributions

Just as with the previous section where we looked at the distribution of topics across different documents, we are often also interested in understanding the most important terms that are frequent in documents that are assigned to the same topic. We can see the k most frequent terms of the topics of a model using the function terms(). This takes in a model and a number specifying the number of most frequent terms that we want retrieved. Let's see the ten most important words per topic in the LDA_GIB model of the BBC data set:

> GIB_bbc_model <- bbc_models[[3]]
> terms(GIB_bbc_model, 10)
      Topic 1   Topic 2   Topic 3     Topic 4  Topic 5 
 [1,] "year"    "plai"    "peopl"     "govern" "film"  
 [2,] "compani" "game"    "game"      "labour" "year"  
 [3,] "market"  "win"     "servic"    "parti"  "best"  
 [4,] "sale"    "against" "technolog" "elect"  "show"  
 [5,] "firm"    "england" "mobil"     "minist" "includ"
 [6,] "expect"  "first"   "on"        "plan"   "on"    
 [7,] "share"   "back"    "phone"     "sai"    "award" 
 [8,] "month"   "player"  "get"       "told"   "music" 
 [9,] "bank"    "world"   "work"      "peopl"  "top"   
[10,] "price"   "time"    "wai"       "public" "star" 

As we can see, given this list of word stems, one could easily guess which of the five topic labels we should assign to each topic. A very handy way to visualize frequent terms in a collection of documents is through a word cloud. The R package wordcloud is useful for creating these. The function wordcloud() allows us to specify a vector of terms followed by a vector of their frequencies, and this information is then used for plotting.

Unfortunately, we will have to do some manipulation on the document term matrices in order to get the word frequencies by topic so that we can feed them into this function. To that end, we've created our own function plot_wordcloud() as follows:

plot_wordcloud <- function(model, myDtm, index, numTerms) {
  
  model_terms <- terms(model,numTerms)
  model_topics <- topics(model)
  
  terms_i <- model_terms[,index]
  topic_i <- model_topics == index
  dtm_i <- myDtm[topic_i, terms_i]
  frequencies_i <- colSums(as.matrix(dtm_i))
  wordcloud(terms_i, frequencies_i, min.freq = 0)
}

Our function takes in a model, a document term matrix, an index of a topic, and the number of most frequent terms that we want to display in the word cloud. We begin by first computing the most frequent terms for the model by topic as we did earlier. We also compute the most probable topic assignments. Next, we subset the document term matrix so that we obtain only the cells involving the terms we are interested in and the documents corresponding to the topic with the index that we passed in as a parameter.

From this reduced document term matrix, we sum over the columns to compute the frequencies of the most frequent terms and finally we can plot the word cloud. We've used this function to plot the word clouds for the topics in the BBC data set using the LDA_GIB model and 25 words per topic. This is shown here:

Word distributions

LDA extensions

Topic models are an active area of research and as a result, several extensions for the LDA model have been proposed. We will briefly mention two of these. The first is the supervised LDA model, an implementation of which can be found in the lda R package. This is a more direct way to model a response variable with the standard LDA method and would be a good next step to investigate for the application discussed in this chapter.

A second interesting extension is the author-topic model. This is designed to add an extra step in the generative process to account for authorship information and is a good model to use when building models that summarize or predict the writing habits and topics of authors.

Note

The standard reference for supervised LDA is the paper Supervised Topic Models by David M. Blei and Jon D. McAuliffe. This was published in 2007 in the journal Neural Information Processing Systems. For the author-topic model, consult the paper titled The Author-Topic Model for Authors and Documents by Michal Rosen-Zvi and others. This appears in the proceedings of the 20th conference on Uncertainty In Artificial Intelligence.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.109.34