Chapter 7. Unstructured Data

In the previous chapter, we looked at different ways of building and fitting models on structured data. Unfortunately, these otherwise extremely useful methods are of no use (yet) when dealing with, for example, a pile of PDF documents. Hence, the following pages will focus on methods to deal with non-tabular data, such as:

  • Extracting metrics from a collection of text documents
  • Filtering and parsing natural language texts (NLP)
  • Visualizing unstructured data in a structured way

Text mining is the process of analyzing natural language text; in most cases from online content, such as emails and social media streams (Twitter or Facebook). In this chapter, we are going to cover the most used methods of the tm package—although, there is a variety of further types of unstructured data, such as text, image, audio, video, non-digital contents, and so on, which we cannot discuss for the time being.

Importing the corpus

A corpus is basically a collection of text documents that you want to include in the analytics. Use the getSources function to see the available options to import a corpus with the tm package:

> library(tm)
> getSources()
[1] "DataframeSource" "DirSource"  "ReutersSource"   "URISource"
[2] "VectorSource"  

So, we can import text documents from a data.frame, a vector, or directly from a uniform resource identifier with the URISource function. The latter stands for a collection of hyperlinks or file paths, although this is somewhat easier to handle with DirSource, which imports all the textual documents found in the referenced directory on our hard drive. By calling the getReaders function in the R console, you can see the supported text file formats:

> getReaders()
[1] "readDOC"                 "readPDF"                
[3] "readPlain"               "readRCV1"               
[5] "readRCV1asPlain"         "readReut21578XML"       
[7] "readReut21578XMLasPlain" "readTabular"            
[9] "readXML"    

So, there are some nifty functions to read and parse MS Word, PDFs, plain text, or XML files among a few other file formats. The previous Reut reader stands for the Reuters demo corpus that is bundled with the tm package.

But let's not stick to some factory default demo files! You can see the package examples in the vignette or reference manual. As we have already fetched some textual data in Chapter 2, Getting Data from the Web, let's see how we can process and analyze that content:

> res <- XML::readHTMLTable(paste0('http://cran.r-project.org/',
+                   'web/packages/available_packages_by_name.html'),
+               which = 1)

Tip

The preceding command requires a live Internet connection and could take 15-120 seconds to download and parse the referenced HTML page. Please note that the content of the downloaded HTML file might be different from what is shown in this chapter, so please be prepared for slightly different outputs in your R session, as compared to what we published in this book.

So, now we have a data.frame with more than 5,000 R package names and short descriptions. Let's build a corpus from the vector source of package descriptions, so that we can parse those further and see the most important trends in package development:

> v <- Corpus(VectorSource(res$V2))

We have just created a VCorpus (in-memory) object, which currently holds 5,880 package descriptions:

> v
<<VCorpus (documents: 5880, metadata (corpus/indexed): 0/0)>>

As the default print method (see the preceding output) shows a concise overview on the corpus, we will need to use another function to inspect the actual content:

> inspect(head(v, 3))
<<VCorpus (documents: 3, metadata (corpus/indexed): 0/0)>>

[[1]]
<<PlainTextDocument (metadata: 7)>>
A3: Accurate, Adaptable, and Accessible Error Metrics for
Predictive Models

[[2]]
<<PlainTextDocument (metadata: 7)>>
Tools for Approximate Bayesian Computation (ABC)

[[3]]
<<PlainTextDocument (metadata: 7)>>
ABCDE_FBA: A-Biologist-Can-Do-Everything of Flux Balance
Analysis with this package

Here, we can see the first three documents in the corpus, along with some metadata. Until now, we have not done much more than when in the Chapter 2, Getting Data from the Web, we visualized a wordcloud of the expression used in the package descriptions. But that's exactly where the journey begins with text mining!

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.191.189.186