Scraping data from other online sources

Although the readHTMLTable function is very useful, sometimes the data is not structured in tables, but rather it's available only as HTML lists. Let's demonstrate such a data format by checking all the R packages listed in the relevant CRAN Task View at http://cran.r-project.org/web/views/WebTechnologies.html, as you can see in the following screenshot:

Scraping data from other online sources

So we see a HTML list of the package names along with a URL pointing to the CRAN, or in some cases to the GitHub repositories. To proceed, first we have to get acquainted a bit with the HTML sources to see how we can parse them. You can do that easily either in Chrome or Firefox: just right-click on the CRAN packages heading at the top of the list, and choose Inspect Element, as you can see in the following screenshot:

Scraping data from other online sources

So we have the list of related R packages in an ul (unordered list) HTML tag, just after the h3 (level 3 heading) tag holding the CRAN packages string.

In short:

  • We have to parse this HTML file
  • Look for the third-level heading holding the search term
  • Get all the list elements from the subsequent unordered HTML list

This can be done by, for example, the XML Path Language, which has a special syntax to select nodes in XML/HTML documents via queries.

Note

For more details and R-driven examples, see Chapter 4, XPath, XPointer, and XInclude of the book XML and Web Technologies for Data Sciences with R written by Deborah Nolan and Duncan Temple Lang in the Use R! series from Springer. Please see more references in the Appendix at the end of the book.

XPath can be rather ugly and complex at first glance. For example, the preceding list can be described with:

//h3[text()='CRAN packages:']/following-sibling::ul[1]/li

Let me elaborate a bit on this:

  1. We are looking for a h3 tag which has CRAN packages as its text, so we are searching for a specific node in the whole document with these attributes.
  2. Then the following-siblings expression stands for all the subsequent nodes at the same hierarchy level as the chosen h3 tag.
  3. Filter to find only ul HTML tags.
  4. As we have several of those, we select only the first of the further siblings with the index (1) between the brackets.
  5. Then we simply select all li tags (the list elements) inside that.

Let's try it in R:

> page <- htmlParse(file = 
+   'http://cran.r-project.org/Web/views/WebTechnologies.html')
> res  <- unlist(xpathApply(doc = page, path =
+   "//h3[text()='CRAN packages:']/following-sibling::ul[1]/li",
+   fun  = xmlValue))

And we have the character vector of the related 118 R packages:

> str(res)
 chr [1:118] "acs" "alm" "anametrix" "AWS.tools" "bigml" ...

XPath is really powerful for selecting and searching for nodes in HTML documents, so is xpathApply. The latter is the R wrapper around most of the XPath functionality in libxml, which makes the process rather quick and efficient. One might rather use the xpathSApply instead, which tries to simplify the returned list of elements, just like sapply does compared to the lapply function. So we can also update our previous code to save the unlist call:

> res <- xpathSApply(page, path =
+ "//h3[text()='CRAN packages:']/following-sibling::ul[1]/li",  
+   fun  = xmlValue)

The attentive reader must have noticed that the returned list was a simple character vector, while the original HTML list also included the URLs of the aforementioned packages. Where and why did those vanish?

We can blame xmlValue for this result, which we called instead of the default NULL as the evaluating function to extract the nodes from the original document at the xpathSApply call. This function simply extracts the raw text content of each leaf node without any children, which explains this behavior. What if we are rather interested in the package URLs?

Calling xpathSapply without a specified fun returns all the raw child nodes, which is of no direct help, and we shouldn't try to apply some regular expressions on those. The help page of xmlValue can point us to some similar functions that can be very handy with such tasks. Here we definitely want to use xmlAttrs:

> xpathSApply(page,
+   "//h3[text()='CRAN packages:']/following-sibling::ul[1]/li/a",
+   xmlAttrs, 'href')

Please note that an updated path was used here, where now we selected all the a tags instead of the li parents. And, instead of the previously introduced xmlValue, now we called xmlAttrs with the 'href' extra argument. This simply extracts all the href arguments of all the related a nodes.

With these primitives, you will be able to fetch any publicly available data from online sources, although sometimes the implementation can end up being rather complex.

Note

On the other hand, please be sure to always consult the terms and conditions and other legal documents of all potential data sources, as fetching data is often prohibited by the copyright owner.

Beside the legal issues, it's also wise to think of fetching and crawling data from the technical point of view of the service provider. If you start to send a plethora of queries to a server without consulting with their administrators beforehand, this action might be construed as a network attack and/or might result in an unwanted load on the servers. To keep it simple, always use a sane delay between your queries. This should be for example, a 2-second pause between queries at a minimum, but it's better to check the Crawl-delay directive set in the site's robot.txt, which can be found in the root path if available. This file also contains other directives if crawling is allowed or limited. Most of the data provider sites also have some technical documentation on data crawling; please be sure to search for Rate limits and throttling.

And sometimes we are just simply lucky in that someone else has already written the tricky XPath selectors or other interfaces, so we can load data from Web services and homepages with the help of native R packages.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.119.253.31