In the previous chapter, we looked at different ways of building and fitting models on structured data. Unfortunately, these otherwise extremely useful methods are of no use (yet) when dealing with, for example, a pile of PDF documents. Hence, the following pages will focus on methods to deal with non-tabular data, such as:
Text mining is the process of analyzing natural language text; in most cases from online content, such as emails and social media streams (Twitter or Facebook). In this chapter, we are going to cover the most used methods of the tm
package—although, there is a variety of further types of unstructured data, such as text, image, audio, video, non-digital contents, and so on, which we cannot discuss for the time being.
A corpus is basically a collection of text documents that you want to include in the analytics. Use the getSources
function to see the available options to import a corpus with the tm
package:
> library(tm) > getSources() [1] "DataframeSource" "DirSource" "ReutersSource" "URISource" [2] "VectorSource"
So, we can import text documents from a data.frame
, a vector
, or directly from a uniform resource identifier with the URISource
function. The latter stands for a collection of hyperlinks or file paths, although this is somewhat easier to handle with DirSource
, which imports all the textual documents found in the referenced directory on our hard drive. By calling the getReaders
function in the R console, you can see the supported text file formats:
> getReaders() [1] "readDOC" "readPDF" [3] "readPlain" "readRCV1" [5] "readRCV1asPlain" "readReut21578XML" [7] "readReut21578XMLasPlain" "readTabular" [9] "readXML"
So, there are some nifty functions to read and parse MS Word, PDFs, plain text, or XML files among a few other file formats. The previous Reut
reader stands for the Reuters demo corpus that is bundled with the tm
package.
But let's not stick to some factory default demo files! You can see the package examples in the vignette or reference manual. As we have already fetched some textual data in Chapter 2, Getting Data from the Web, let's see how we can process and analyze that content:
> res <- XML::readHTMLTable(paste0('http://cran.r-project.org/', + 'web/packages/available_packages_by_name.html'), + which = 1)
The preceding command requires a live Internet connection and could take 15-120 seconds to download and parse the referenced HTML page. Please note that the content of the downloaded HTML file might be different from what is shown in this chapter, so please be prepared for slightly different outputs in your R session, as compared to what we published in this book.
So, now we have a data.frame
with more than 5,000 R package names and short descriptions. Let's build a corpus from the vector source of package descriptions, so that we can parse those further and see the most important trends in package development:
> v <- Corpus(VectorSource(res$V2))
We have just created a VCorpus
(in-memory) object, which currently holds 5,880 package descriptions:
> v <<VCorpus (documents: 5880, metadata (corpus/indexed): 0/0)>>
As the default print
method (see the preceding output) shows a concise overview on the corpus, we will need to use another function to inspect the actual content:
> inspect(head(v, 3)) <<VCorpus (documents: 3, metadata (corpus/indexed): 0/0)>> [[1]] <<PlainTextDocument (metadata: 7)>> A3: Accurate, Adaptable, and Accessible Error Metrics for Predictive Models [[2]] <<PlainTextDocument (metadata: 7)>> Tools for Approximate Bayesian Computation (ABC) [[3]] <<PlainTextDocument (metadata: 7)>> ABCDE_FBA: A-Biologist-Can-Do-Everything of Flux Balance Analysis with this package
Here, we can see the first three documents in the corpus, along with some metadata. Until now, we have not done much more than when in the Chapter 2, Getting Data from the Web, we visualized a wordcloud of the expression used in the package descriptions. But that's exactly where the journey begins with text mining!
18.227.134.133