Reading data from HTML tables

According to the traditional document formats on the World Wide Web, most texts and data are served in HTML pages. We can often find interesting pieces of information in for example HTML tables, from which it's pretty easy to copy and paste data into an Excel spreadsheet, save that to disk, and load it to R afterwards. But it takes time, it's boring, and can be automated anyway.

Such HTML tables can be easily generated with the help of the aforementioned API of the Customer Compliant Database. If we do not set the required output format for which we used XML or JSON earlier, then the browser returns a HTML table instead, as you should be able to see in the following screenshot:

Reading data from HTML tables

Well, in the R console it's a bit more complicated as the browser sends some non-default HTTP headers while using curl, so the preceding URL would simply return a JSON list. To get HTML, let the server know that we expect HTML output. To do so, simply set the appropriate HTTP header of the query:

> doc <- getURL(paste0(u, '/25ei-6bcr/rows?max_rows=5'),
+   httpheader = c(Accept = "text/html"))

The XML package provides an extremely easy way to parse all the HTML tables from a document or specific nodes with the help of the readHTMLTable function, which returns a list of data.frames by default:

> res <- readHTMLTable(doc)

To get only the first table on the page, we can filter res afterwards or pass the which argument to readHTMLTable. The following two R expressions have the very same results:

> df <- res[[1]]
> df <- readHTMLTable(doc, which = 1)

Reading tabular data from static Web pages

Okay, so far we have seen a bunch of variations on the same theme, but what if we do not find a downloadable dataset in any popular data format? For example, one might be interested in the available R packages hosted at CRAN, whose list is available at http://cran.r-project.org/web/packages/available_packages_by_name.html. How do we scrape that? No need to call RCurl or to specify custom headers, still less do we have to download the file first; it's enough to pass the URL to readHTMLTable:

> res <- readHTMLTable('http://cran.r-project.org/Web/packages/available_packages_by_name.html')

So readHTMLTable can directly fetch HTML pages, then it extracts all the HTML tables to data.frame R objects, and returns a list of those. In the preceding example, we got a list of only one data.frame with all the package names and descriptions as columns.

Well, this amount of textual information is not really informative with the str function. For a quick example of processing and visualizing this type of raw data, and to present the plethora of available features by means of R packages at CRAN, now we can create a word cloud of the package descriptions with some nifty functions from the wordcloud and the tm packages:

> library(wordcloud)
Loading required package: Rcpp
Loading required package: RColorBrewer
> wordcloud(res[[1]][, 2])
Loading required package: tm

This short command results in the following screenshot, which shows the most frequent words found in the R package descriptions. The position of the words has no special meaning, but the larger the font size, the higher the frequency. Please see the technical description of the plot following the screenshot:

Reading tabular data from static Web pages

So we simply passed all the strings from the second column of the first list element to the wordcloud function, which automatically runs a few text-mining scripts from the tm package on the text. You can find more details on this topic in Chapter 7, Unstructured Data. Then, it renders the words with a relative size weighted by the number of occurrences in the package descriptions. It seems that R packages are indeed primarily targeted at building models and applying multivariate tests on data.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.16.218.221