Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

R packages for LDA

There are mainly two packages in R that can be used for performing LDA on documents. One is the topicmodels package developed by Bettina Grün and Kurt Hornik and the second one is lda developed by Jonathan Chang. Here, we describe both these packages.

The topicmodels package

The topicmodels package is an interface to the C and C++ codes developed by the authors of the papers on LDA and Correlated Topic Models (CTM) (references 7, 8, and 9 in the References section of this chapter). The main function LDA in this package is used to fit LDA models. It can be called by:

>LDA(X,K,method = "Gibbs",control = NULL,model = NULL,...)

Here, X is a document-term matrix that can be generated using the tm package and K is the number of topics. The method is the method to be used for fitting. There are two methods that are supported: Gibbs and VEM.

Let's do a small example of building LDA models using this package. The dataset used is the Reuter_50_50 dataset from the UCI Machine Learning repository (references 10 and 11 in the References section of this chapter). The dataset can be downloaded from https://archive.ics.uci.edu/ml/datasets/Reuter_50_50. For this exercise, we will only use documents from one directory, namely AlanCrosby in the C50train directory. The required preprocessing can be done using the following R script; readers should have installed the tm and topicmodels packages before trying this exercise:

>library(topicmodels)
>library(tm)
>#creation of training corpus from reuters dataset
>dirsourcetrain <- DirSource(directory = "C:/…/C50/C50train/AaronPressman")
>xtrain <- VCorpus(dirsourcetrain)
>#remove extra white space
>xtrain <- tm_map(xtrain,stripWhitespace)
>#changing to lower case
>xtrain <- tm_map(xtrain,content_transformer(tolower))
>#removing stop words
>xtrain <- tm_map(xtrain,removeWords,stopwords("english"))
>#stemming the document
>xtrain <- tm_map(xtrain,stemDocument)
>#creating Document-Term Matrix
>xtrain <-  as.data.frame.matrix(DocumentTermMatrix(xtrain))

The same set of steps can be used to create the test dataset from the /…/C50/C50test/ directory.

Once we have the document-term matrices xtrain and xtest, the LDA model can be built and tested using the following R script:

>#training lda model
>ldamodel <- LDA(xtrain,10,method = "VEM")
>#computation of perplexity, on training data (only with VEM method)
>perp <- perplexity(ldamodel)
>perp
[1] 407.3006

A value of perplexity around 100 indicates a good fit. In this case, we need to add more training data or change the value of K to improve perplexity.

Now let's use the trained LDA model to predict the topics on the test dataset:

>#extracting topics from test data)
>postprob <- posterior(ldamodel,xtest)
>postprob$topics

Here, the test set contains only one file, namely 42764newsML.txt. The distribution of its topic among the 10 topics produced by the LDA model is shown.

The lda package

The lda package was developed by Jonathan Chang and he implemented a collapsed Gibbs sampling method for the estimation of posterior. The package can be downloaded from the CRAN website at http://cran.r-project.org/web/packages/lda/index.html.

The main function in the package, lda.collapsed.gibbs.sampler, uses a collapsed Gibbs sampler to fit three different models. These are Latent Dirichlet allocation (LDA), supervised LDA (sLDA), and the mixed membership stochastic blockmodel (MMSB). These functions take input documents and return point estimates of latent parameters. These functions can be used in R as follows:

>lda.collapsed.gibbs.sampler(documents,K,vocab,num.iterations,alpha,eta,initial = NULL,burnin = NULL,compute.log.likelihood = FALSE,trace = 0L,freeze.topics = FALSE)

Here, documents represents a list containing documents, the length of the list is equal to D, and K is the number of topics; vocab is a character vector specifying the vocabulary of words; alpha and eta are the values of hyperparameters.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for R packages for LDA

Create new playlist

Sign In

Sign Up

R packages for LDA

The topicmodels package

The lda package

Table of Contents for
R packages for LDA