Mining the news with R

In this section, we discuss news mining in R. We start with a successful document classification and then discuss how to collect news articles directly from R.

A successful document classification

In this section, we examine a particular dataset which features a term-document matrix of 2,071 press articles containing the word flu in their title. The articles were found on LexisNexis using this search term in two newspapers, The New York Times and The Guardian, between January 1980 and May 2013. For copyright reasons, we cannot include the original articles here. These have been preprocessed in a similar way to what we have seen before with another software, Rapidminer 5. In addition to the term-document matrix, the type of seasonal flu versus other (avian and swine flu)–is included in the first column of the data frame (the SEASONAL.FLU attribute). When articles discussed seasonal flu and other strands, they were coded as other (value 0). Terms were coded as present (1) or absent (0) in each article. Let's load the dataset (make sure it is in your working directory):

Strands = read.csv("StrandsPackt.csv", header = T)

Let's examine how many articles about seasonal flu we have here:

colSums(Strands[1])

The output is 777.

Our task here will be to predict whether the flu discussed in the article is seasonal flu or another type. Before we do this, we can examine the most frequent terms associated with each class. Remember, the first column is the class, which we exclude from the document frequencies:

Seasonal = subset(Strands, SEASONAL.FLU == 1)
FreqSeasonal = colSums(Seasonal)[-1]
head(FreqSeasonal[order(FreqSeasonal, decreasing=T)], 20)

The following output shows the top 20 terms in document frequency in the articles that discuss seasonal flu:

     year     peopl    vaccin influenza 
      513       460       431       380 
   diseas       com      http       www 
      362       359       357       357 
    nytim      week       get      time 
      356       354       340       324 
    state       url    season    report 
      314       311       302       295 
     viru   control      risk    center 
      295       284       281       272 

Let's examine other strands:

Other = subset(Strands, SEASONAL.FLU == 0)
FreqOther = colSums(Strands[Strands[,1] == 0])[-1]
head(FreqOther[order(FreqOther, decreasing=T)], 20)

The output of these strands follows. One can easily see that terms with most document frequency are different between the classes. Yet, there are some terms featured in both such as influenza, vaccine, and viru. The low overlapping of the most frequent terms can give us hope that we will be able to classify the instances:

    peopl      viru       com       www 
     1449      1281      1108      1108 
     case     nytim       url    vaccin 
     1104      1091      1044      1043 
   offici    infect    report influenza 
     1008      1000       975       964 
     time  outbreak     world    strain 
      918       915       867       864 
   govern   countri       dai       sai 
      845       836       835       828 

What we didn't mention before is that there are thousands of terms included in the term-document matrix here. We are faced with an example of the curse of dimensionality as logistic regression cannot handle more features than cases. We could use more complex analyses, such as ridge logistic regression. But let's be creative! We are going to use logistic regression by repeating the analysis say 100 times, with each time a different random set of predictors, say 300. We will aggregate the resulting probabilities, which will help us determine the class of our articles. This is known as ensemble learning, which we have already discovered when discussing random forests. The usefulness of ensemble learning for strong learners, such as logistic regression, is under debate, yet we will see that this works pretty well here.

We will first determine training instances and build matrices to store the predictions in the training and testing sets:

1    set.seed(1234)
2    TrainCases = sample(1:nrow(Strands),1086)
3    TrainPredictions = matrix(ncol=1086,nrow=100)
4    TestPredictions = matrix(ncol=1085,nrow=100)

We will now run the 100 logistic regression analyses. We will first determine which terms to exclude (we only want 300 terms), and then assign the columns we wish to keep (the class and the 300 terms). From these, we will create a training set and a testing set. We will then fit the model and assign the predictions for the testing and training sets for the current iteration in the respective matrices. Be aware that running the loop will take some time:

1    for (i in 1:100) {
2       UNUSED = sample(2:ncol(Strands), ncol(Strands)-300)
3       Strands2 = Strands[,-UNUSED]
4       StrandsTrain = Strands2[TrainCases,]
5       StrandsTest = Strands2[-TrainCases,]
6       model = glm(StrandsTrain$SEASONAL.FLU~., 
7          data = StrandsTrain, family="binomial")
8       TrainPredictions[i,] = t(predict(model,StrandsTrain, 
9          type="response"))
10       TestPredictions[i,] = t(predict(model,StrandsTest, 
11          type="response"))
12    }

Latest versions of R display warning messages: fitted probabilities numerically 0 or 1 occurred. The inspection of the estimates (not displayed) show this is not a problem here. But this might be a concern sometimes. It is the analyst's job to check that everything works fine. In the present case, this happened because some terms were never used in each of the classes, leading to infinitely small or large odds ratios.

Now that we have this ready, we will compute the mean of the probability predictions for each article in the training and testing datasets and assign a classification:

1    PredsTrain = colMeans(TrainPredictions)
2    PredsTrain[PredsTrain< .5] = 0
3    PredsTrain[PredsTrain>= .5] = 1
4    PredsTest = colMeans(TestPredictions)
5    PredsTest[PredsTest< .5] = 0
6    PredsTest[PredsTest>= .5] = 1

Let's first examine the reliability of the classification on the training set:

confusionMatrix(PredsTrain,StrandsTrain$SEASONAL.FLU)

The following partial output shows that the classification was almost perfect; the accuracy is above 99 percent and the kappa value is about 0.98:

Confusion Matrix and Statistics
Reference
Prediction     0    1
         0   689    5
         1     1  391
                                        
Accuracy : 0.9945        
95% CI : (0.988, 0.998)
No Information Rate : 0.6354        
P-Value [Acc > NIR] : <2e-16        
                                        
Kappa : 0.9881        

We can be very happy about the classification of the training set. But what about the testing set? Let's have a look:

confusionMatrix(PredsTest,StrandsTest$SEASONAL.FLU)

The output shows that things also went quite well in the training set; more than 86% of cases are correctly classified, and the kappa value is 0.69, which is above the 0.65 threshold we fixed. Of course, as in most cases, the classification of the training set had better reliability.

Confusion Matrix and Statistics

Reference
Prediction    0    1
         0  655   98
         1   49  283
                                          
Accuracy : 0.8645          
95% CI : (0.8427, 0.8843)
No Information Rate : 0.6488          
P-Value [Acc > NIR] : < 2.2e-16       
                                          
Kappa : 0.6936          

Classifying documents can be tricky and depends as much on your data as on how imaginative you are in circumventing the limitations of the algorithms you use.

Extracting the topics of the articles

We have seen that we can reliably classify whether the articles discuss seasonal flu or not. It is therefore likely that the topics discussed in the articles are different. Wouldn't it be nice to be able to get a better understanding of this in a straightforward manner? It turns out we can!

We first separate the two types of articles (discussing Seasonal versus Non.Seasonal flu) and remove the attribute that makes this distinction:

Seasonal = subset(Strands,SEASONAL.FLU ==1)[,-1]
Non.Seasonal = subset(Strands,SEASONAL.FLU ==0)[,-1]

We will need to convert these data frames to a tm document-term matrix. The as.wfm() and as.dtm() functions of the qdap package will help us do that:

install.packages("qdap"); library(qdap)
seasonal.tm <- as.dtm(as.wfm(t(Seasonal)))
non.seasonal.tm <- as.dtm(as.wfm(t(Non.Seasonal)))

We also need to install the topicmodels package and load it:

install.packages("topicmodels"); library(topicmodels)

We will skip the explanations of the inner working of the algorithms for lack of space and simply perform a basic analysis with default parameters.

For each document-term matrix, we will arbitrarily generate two topics. For more information about how many topics to include, the interested reader can refer to the paper How many topics? Stability analysis for topic models, by Greene and colleagues (2014).

Topics.seasonal = LDA(seasonal.tm, 2)
Topics.non.seasonal = LDA(non.seasonal.tm, 2)

We can now examine the top 10 terms associated with each of the topics in both document-term matrices:

1    Terms.seasonal = terms(Topics.seasonal, 20)
2    Terms.non.seasonal = terms(Topics.non.seasonal, 20)
3    Terms.seasonal
4    Terms.non.seasonal

The output of both document-term matrices is displayed side by side in the following screenshot:

Extracting the topics of the articles

Terms associated with the topics in the seasonal and non-seasonal flu articles

For seasonal flu, Topic 1 is associated with terms related to seasonality, prevention in a concrete sense (for example, vaccin or shot), and counts of cases of the flu. Topic 2 is also associated with terms regarding seasonality, as well as with terms regarding the action that can be taken in a more abstract sense (for example, protect or recommend) with regard to the flu. In the non-seasonal articles, both topics are related to the spread of the virus, but the first topic discusses reports on the virus, including its strain, the number of cases, and prevention. The second topic focuses on the spread of the virus with terms such as outbreak, pandem, and spread. There are some similarities and differences in the articles on both seasonal and non-seasonal flu strands.

Collecting news articles in R from the New York Times article search API

Searching the news and comparing information can take a huge amount of time. In this section, we will show how to simplify the process using R. First, of course, we need to download the recent news related to our interest. The fall of the Euro is currently much debated. What is the current news on the topic? How can we use text mining to learn more about the topic? Let's find out!

The first step is to install and load the tm.plugin.webmining package that will help us in this task:

install.packages("tm.plugin.webmining")
library(tm.plugin.webmining)

The package can download news from several generic sources such as Yahoo News, Google News, The New York Times News, and Reuters News. Due to space restrictions, we will here examine the news from The New York Times only.

First, we need to obtain a developer key from http://developer.nytimes.com/. Click on 1 Request an API key (see the top panel of the following screenshot). You will first have to register on the website and sign in. After this, you will reach the API key registration page (see the bottom panel of the screenshot). Simply enter your details and select Issue a new key for Article search API. Your key will be displayed on the next screen. Be sure to keep it safe. You are all set.

Collecting news articles in R from the New York Times article search API

Registering a key on the NY Times article search API

We start by downloading the news related to the Euro. The following code will download 100 articles:

1    nytimes_appid = "YOUR_KEY_HERE"
2    NYtimesNews <- WebCorpus(NYTimesSource("Euro",
3       appid = nytimes_appid))
4    NYtimesNews 

The output shows that 100 articles were downloaded, as set by default:

<<WebCorpus>>
Metadata:  corpus specific: 3, document level (indexed): 0
Content:  documents: 100

You might want to save the content of the articles. This can be done by simply using the writeCorpus() function. The following code will save individual text files to the working directory:

writeCorpus(NYtimesNews, path = "M:/")

There are cases in which you might want to add the newly published articles to your corpus. This can be done using the corpus.update() function. Here, the following line of code could be used to do this once:

updated = corpus.update(NYtimesNews)

We can now preprocess the text as we did before:

preprocessedNews = preprocess(NYtimesNews)

We also build the term-document matrix:

tdmNews = TermDocumentMatrix(preprocessedNews)

I saved the term-document matrix so that you can do the following with me and get the same results (results with your corpus will differ as you have not retrieved the same articles as I did). In order to load your file, type the following:

loaded_tdm = dget("tdmNews")

We can inspect the most frequent terms (those that occur over 100 times) like this:

findFreqTerms(loaded_tdm, low = 100)

The following output shows the list of terms that satisfy the query. We can notably see that Europe, Germany, and Greece are mentioned as top terms. Other mentions related to finance are also present such as bank, debt, and currency:

bank

countri

currenc

debt

econom

euro

europ

european

germani

govern

greec

greek

minist

new

percent

said

union

will

zone

  

What are the word associations we can observe in the data? Let's focus on the terms bank and greec. We will ask for correlations higher than 0.5 for bank and 0.45 for greek:

Assocs = findAssocs(loaded_tdm, terms = c("bank", "greek"), corlim = c(0.5, 0.45))

We can display the correlations in textual form by typing (the output is not displayed here):

Assocs

We can also examine relationships visually here for bank:

barplot(Assocs[[1]], ylim = c(0,1))
Collecting news articles in R from the New York Times article search API

A visualization of high term correlations related to "bank"

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.223.107.85