In this section, we discuss news mining in R. We start with a successful document classification and then discuss how to collect news articles directly from R.
In this section, we examine a particular dataset which features a term-document matrix of 2,071 press articles containing the word flu in their title. The articles were found on LexisNexis using this search term in two newspapers, The New York Times and The Guardian, between January 1980 and May 2013. For copyright reasons, we cannot include the original articles here. These have been preprocessed in a similar way to what we have seen before with another software, Rapidminer 5. In addition to the term-document matrix, the type of seasonal flu versus other (avian and swine flu)–is included in the first column of the data frame (the SEASONAL.FLU
attribute). When articles discussed seasonal flu and other strands, they were coded as other (value 0). Terms were coded as present (1) or absent (0) in each article. Let's load the dataset (make sure it is in your working directory):
Strands = read.csv("StrandsPackt.csv", header = T)
Let's examine how many articles about seasonal flu we have here:
colSums(Strands[1])
The output is 777
.
Our task here will be to predict whether the flu discussed in the article is seasonal flu or another type. Before we do this, we can examine the most frequent terms associated with each class. Remember, the first column is the class, which we exclude from the document frequencies:
Seasonal = subset(Strands, SEASONAL.FLU == 1) FreqSeasonal = colSums(Seasonal)[-1] head(FreqSeasonal[order(FreqSeasonal, decreasing=T)], 20)
The following output shows the top 20 terms in document frequency in the articles that discuss seasonal flu:
year peopl vaccin influenza 513 460 431 380 diseas com http www 362 359 357 357 nytim week get time 356 354 340 324 state url season report 314 311 302 295 viru control risk center 295 284 281 272
Let's examine other strands:
Other = subset(Strands, SEASONAL.FLU == 0) FreqOther = colSums(Strands[Strands[,1] == 0])[-1] head(FreqOther[order(FreqOther, decreasing=T)], 20)
The output of these strands follows. One can easily see that terms with most document frequency are different between the classes. Yet, there are some terms featured in both such as influenza
, vaccine
, and viru
. The low overlapping of the most frequent terms can give us hope that we will be able to classify the instances:
peopl viru com www 1449 1281 1108 1108 case nytim url vaccin 1104 1091 1044 1043 offici infect report influenza 1008 1000 975 964 time outbreak world strain 918 915 867 864 govern countri dai sai 845 836 835 828
What we didn't mention before is that there are thousands of terms included in the term-document matrix here. We are faced with an example of the curse of dimensionality as logistic regression cannot handle more features than cases. We could use more complex analyses, such as ridge logistic regression. But let's be creative! We are going to use logistic regression by repeating the analysis say 100 times, with each time a different random set of predictors, say 300. We will aggregate the resulting probabilities, which will help us determine the class of our articles. This is known as ensemble learning, which we have already discovered when discussing random forests. The usefulness of ensemble learning for strong learners, such as logistic regression, is under debate, yet we will see that this works pretty well here.
We will first determine training instances and build matrices to store the predictions in the training and testing sets:
1 set.seed(1234) 2 TrainCases = sample(1:nrow(Strands),1086) 3 TrainPredictions = matrix(ncol=1086,nrow=100) 4 TestPredictions = matrix(ncol=1085,nrow=100)
We will now run the 100 logistic regression analyses. We will first determine which terms to exclude (we only want 300 terms), and then assign the columns we wish to keep (the class and the 300 terms). From these, we will create a training set and a testing set. We will then fit the model and assign the predictions for the testing and training sets for the current iteration in the respective matrices. Be aware that running the loop will take some time:
1 for (i in 1:100) { 2 UNUSED = sample(2:ncol(Strands), ncol(Strands)-300) 3 Strands2 = Strands[,-UNUSED] 4 StrandsTrain = Strands2[TrainCases,] 5 StrandsTest = Strands2[-TrainCases,] 6 model = glm(StrandsTrain$SEASONAL.FLU~., 7 data = StrandsTrain, family="binomial") 8 TrainPredictions[i,] = t(predict(model,StrandsTrain, 9 type="response")) 10 TestPredictions[i,] = t(predict(model,StrandsTest, 11 type="response")) 12 }
Latest versions of R display warning messages: fitted probabilities numerically 0 or 1 occurred
. The inspection of the estimates (not displayed) show this is not a problem here. But this might be a concern sometimes. It is the analyst's job to check that everything works fine. In the present case, this happened because some terms were never used in each of the classes, leading to infinitely small or large odds ratios.
Now that we have this ready, we will compute the mean of the probability predictions for each article in the training and testing datasets and assign a classification:
1 PredsTrain = colMeans(TrainPredictions) 2 PredsTrain[PredsTrain< .5] = 0 3 PredsTrain[PredsTrain>= .5] = 1 4 PredsTest = colMeans(TestPredictions) 5 PredsTest[PredsTest< .5] = 0 6 PredsTest[PredsTest>= .5] = 1
Let's first examine the reliability of the classification on the training set:
confusionMatrix(PredsTrain,StrandsTrain$SEASONAL.FLU)
The following partial output shows that the classification was almost perfect; the accuracy is above 99 percent and the kappa value is about 0.98:
Confusion Matrix and Statistics Reference Prediction 0 1 0 689 5 1 1 391 Accuracy : 0.9945 95% CI : (0.988, 0.998) No Information Rate : 0.6354 P-Value [Acc > NIR] : <2e-16 Kappa : 0.9881
We can be very happy about the classification of the training set. But what about the testing set? Let's have a look:
confusionMatrix(PredsTest,StrandsTest$SEASONAL.FLU)
The output shows that things also went quite well in the training set; more than 86% of cases are correctly classified, and the kappa value is 0.69, which is above the 0.65 threshold we fixed. Of course, as in most cases, the classification of the training set had better reliability.
Confusion Matrix and Statistics Reference Prediction 0 1 0 655 98 1 49 283 Accuracy : 0.8645 95% CI : (0.8427, 0.8843) No Information Rate : 0.6488 P-Value [Acc > NIR] : < 2.2e-16 Kappa : 0.6936
Classifying documents can be tricky and depends as much on your data as on how imaginative you are in circumventing the limitations of the algorithms you use.
We have seen that we can reliably classify whether the articles discuss seasonal flu or not. It is therefore likely that the topics discussed in the articles are different. Wouldn't it be nice to be able to get a better understanding of this in a straightforward manner? It turns out we can!
We first separate the two types of articles (discussing Seasonal
versus Non.
Seasonal flu) and remove the attribute that makes this distinction:
Seasonal = subset(Strands,SEASONAL.FLU ==1)[,-1] Non.Seasonal = subset(Strands,SEASONAL.FLU ==0)[,-1]
We will need to convert these data frames to a tm
document-term matrix. The as.wfm()
and as.dtm()
functions of the qdap
package will help us do that:
install.packages("qdap"); library(qdap) seasonal.tm <- as.dtm(as.wfm(t(Seasonal))) non.seasonal.tm <- as.dtm(as.wfm(t(Non.Seasonal)))
We also need to install the topicmodels
package and load it:
install.packages("topicmodels"); library(topicmodels)
We will skip the explanations of the inner working of the algorithms for lack of space and simply perform a basic analysis with default parameters.
For each document-term matrix, we will arbitrarily generate two topics. For more information about how many topics to include, the interested reader can refer to the paper How many topics? Stability analysis for topic models, by Greene and colleagues (2014).
Topics.seasonal = LDA(seasonal.tm, 2) Topics.non.seasonal = LDA(non.seasonal.tm, 2)
We can now examine the top 10 terms associated with each of the topics in both document-term matrices:
1 Terms.seasonal = terms(Topics.seasonal, 20) 2 Terms.non.seasonal = terms(Topics.non.seasonal, 20) 3 Terms.seasonal 4 Terms.non.seasonal
The output of both document-term matrices is displayed side by side in the following screenshot:
For seasonal flu, Topic 1 is associated with terms related to seasonality, prevention in a concrete sense (for example, vaccin or shot), and counts of cases of the flu. Topic 2 is also associated with terms regarding seasonality, as well as with terms regarding the action that can be taken in a more abstract sense (for example, protect or recommend) with regard to the flu. In the non-seasonal articles, both topics are related to the spread of the virus, but the first topic discusses reports on the virus, including its strain, the number of cases, and prevention. The second topic focuses on the spread of the virus with terms such as outbreak, pandem, and spread. There are some similarities and differences in the articles on both seasonal and non-seasonal flu strands.
Searching the news and comparing information can take a huge amount of time. In this section, we will show how to simplify the process using R. First, of course, we need to download the recent news related to our interest. The fall of the Euro is currently much debated. What is the current news on the topic? How can we use text mining to learn more about the topic? Let's find out!
The first step is to install and load the tm.plugin.webmining
package that will help us in this task:
install.packages("tm.plugin.webmining") library(tm.plugin.webmining)
The package can download news from several generic sources such as Yahoo News, Google News, The New York Times News, and Reuters News. Due to space restrictions, we will here examine the news from The New York Times only.
First, we need to obtain a developer key from http://developer.nytimes.com/. Click on 1 Request an API key (see the top panel of the following screenshot). You will first have to register on the website and sign in. After this, you will reach the API key registration page (see the bottom panel of the screenshot). Simply enter your details and select Issue a new key for Article search API. Your key will be displayed on the next screen. Be sure to keep it safe. You are all set.
We start by downloading the news related to the Euro. The following code will download 100 articles:
1 nytimes_appid = "YOUR_KEY_HERE" 2 NYtimesNews <- WebCorpus(NYTimesSource("Euro", 3 appid = nytimes_appid)) 4 NYtimesNews
The output shows that 100 articles were downloaded, as set by default:
<<WebCorpus>> Metadata: corpus specific: 3, document level (indexed): 0 Content: documents: 100
You might want to save the content of the articles. This can be done by simply using the
writeCorpus()
function. The following code will save individual text files to the working directory:
writeCorpus(NYtimesNews, path = "M:/")
There are cases in which you might want to add the newly published articles to your corpus. This can be done using the corpus.update()
function. Here, the following line of code could be used to do this once:
updated = corpus.update(NYtimesNews)
We can now preprocess the text as we did before:
preprocessedNews = preprocess(NYtimesNews)
We also build the term-document matrix:
tdmNews = TermDocumentMatrix(preprocessedNews)
I saved the term-document matrix so that you can do the following with me and get the same results (results with your corpus will differ as you have not retrieved the same articles as I did). In order to load your file, type the following:
loaded_tdm = dget("tdmNews")
We can inspect the most frequent terms (those that occur over 100 times) like this:
findFreqTerms(loaded_tdm, low = 100)
The following output shows the list of terms that satisfy the query. We can notably see that Europe, Germany, and Greece are mentioned as top terms. Other mentions related to finance are also present such as bank
, debt
, and currency
:
bank |
countri |
currenc |
debt |
econom |
euro |
europ |
---|---|---|---|---|---|---|
european |
germani |
govern |
greec |
greek |
minist |
new |
percent |
said |
union |
will |
zone |
What are the word associations we can observe in the data? Let's focus on the terms bank
and greec
. We will ask for correlations higher than 0.5
for bank
and 0.45
for greek
:
Assocs = findAssocs(loaded_tdm, terms = c("bank", "greek"), corlim = c(0.5, 0.45))
We can display the correlations in textual form by typing (the output is not displayed here):
Assocs
We can also examine relationships visually here for bank:
barplot(Assocs[[1]], ylim = c(0,1))
18.223.107.85