Data ingestion

Assuming the R console is ready, let's get started. For the ease of the tutorial, we will be consuming a small amount of data to demonstrate how models can be constructed:

> res <- XML::readHTMLTable(paste0('http://cran.r-project.org/', + 'web/packages/available_packages_by_name.html'), which = 1)

R comes with a bunch of functions that can be used to read different types of files. In this tutorial, we are going to use tm and XML. If you do not have the XML package installed, it can be installed using the following command:

install.packages("XML")

In order to see the supported text file formats, we can use the getReaders function:

> getReaders()
[1] "readDataframe" "readDOC"
[3] "readPDF" "readPlain"
[5] "readRCV1" "readRCV1asPlain"
[7] "readReut21578XML" "readReut21578XMLasPlain"
[9] "readTagged" "readXML"

At the time of writing this book, the snippet downloaded 12,658 R library names and their short descriptions. A new term to get familiar with is corpus, which is basically a collection of text documents that we can include in the analytics. We can use the getSources function to see the available options to import a corpus with the tm package:

> library(tm)
Loading required package: NLP
> getSources()
[1] "DataframeSource" "DirSource" "URISource" "VectorSource"
[5] "XMLSource" "ZipSource"

Building a corpus from the vector source of package descriptions downloaded from the R package lists can be done using one of the preceding options. In this case, we can go ahead and use VectorSource:

> v <- Corpus(VectorSource(res$V2))
> v
<<SimpleCorpus>>
Metadata: corpus specific: 1, document level (indexed): 0
Content: documents: 12658

This step created a VCorpus (in-memory) object that currently holds 12,658 package descriptions. We can use the inspect and head functions to view the output of processed statements. So, in order to see the first five documents in the corpus, we can run the following command:

> inspect(head(v, 5))
<<SimpleCorpus>>
Metadata: corpus specific: 1, document level (indexed): 0
Content: documents: 3
[1] Accurate, Adaptable, and Accessible Error Metrics for Predictive Models
[2] Access to Abbyy Optical Character Recognition (OCR) API
[3] Tools for Approximate Bayesian Computation (ABC)
[4] Data Only: Tools for Approximate Bayesian Computation (ABC)
[5] Array Based CpG Region Analysis Pipeline
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.135.187.106