With the growing amount of data available from web-based sources, it is increasingly important for machine learning projects to be able to access and interact with online services. R is able to read data from online sources natively, with some caveats. Firstly, by default, R cannot access secure websites (those using the https://
rather than the http://
protocol). Secondly, it is important to note that most web pages do not provide data in a form that R can understand. The data would need to be parsed, or broken apart and rebuilt into a structured form, before it can be useful. We'll discuss the workarounds shortly.
However, if neither of these caveats applies (that is, if data are already online on a nonsecure website and in a tabular form, like CSV, that R can understand natively), then R's read.csv()
and read.table()
functions will be able to access data from the Web just as if it were on your local machine. Simply supply the full URL for the dataset as follows:
> mydata <- read.csv("http://www.mysite.com/mydata.csv")
R also provides functionality to download other files from the Web, even if R cannot use them directly. For a text file, try the readLines()
function as follows:
> mytext <- readLines("http://www.mysite.com/myfile.txt")
For other types of files, the download.file()
function can be used. To download a file to R's current working directory, simply supply the URL and destination filename as follows:
> download.file("http://www.mysite.com/myfile.zip", "myfile.zip")
Beyond this base functionality, there are numerous packages that extend R's capabilities to work with online data. The most basic of them will be covered in the sections that follow. As the Web is massive and ever-changing, these sections are far from a comprehensive set of all the ways R can connect to online data. There are literally hundreds of packages for everything from niche projects to massive ones.
For the most complete and up-to-date list of packages, refer to the regularly updated CRAN Web Technologies and Services task view at http://cran.r-project.org/web/views/WebTechnologies.html.
The RCurl
package by Duncan Temple Lang supplies a more robust way of accessing web pages by providing an R interface to the curl (client for URLs) utility, a command-line tool to transfer data over networks. The curl program acts much like a programmable web browser; given a set of commands, it can access and download the content of nearly anything available on the Web. Unlike R, it can access secure websites as well as post data to online forms. It is an incredibly powerful utility.
Precisely because it is so powerful, a complete curl tutorial is outside the scope of this chapter. Instead, refer to the online RCurl
documentation at http://www.omegahat.org/RCurl/.
After installing the RCurl
package, downloading a page is as simple as typing:
> library(RCurl) > packt_page <- ("https://www.packtpub.com/")
This will save the full text of the Packt Publishing homepage (including all the web markup) into the R character object named packt_page
. As shown in the following lines, this is not very useful:
> str(packt_page, nchar.max=200) chr "<!DOCTYPE html> <html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en"> <head> <title>Packt Publishing | Technology Books, eBooks & Videos</title> <script> data"| __truncated__
The reason that the first 200 characters of the page look like nonsense is because the websites are written using Hypertext Markup Language (HTML), which combines the page text with special tags that tell web browsers how to display the text. The <title>
and </title>
tags here surround the page title, telling the browser that this is the Packt Publishing homepage. Similar tags are used to denote other portions of the page.
Though curl is the cross-platform standard to access online content, if you work with web data frequently in R, the httr
package by Hadley Wickham builds upon the foundation of RCurl
to make it more convenient and R-like. We can see some of the differences immediately by attempting to download the Packt Publishing homepage using the httr
package's GET()
function:
> library(httr) > packt_page <- GET("https://www.packtpub.com") > str(packt_page, max.level = 1) List of 9 $ url : chr "https://www.packtpub.com/" $ status_code: int 200 $ headers : List of 11 $ all_headers: List of 1 $ cookies : list() $ content : raw [1:58560] 3c 21 44 4f ... $ date : POSIXct[1:1], format: "2015-05-24 20:46:40" $ times : Named num [1:6] 0 0.000071 0.000079 ... $ request : List of 5
Where the getURL()
function in RCurl
downloaded only the HTML, the GET()
function returns a list with site properties in addition to the HTML. To access the page content itself, we need to use the content()
function:
> str(content(packt_page, type="text"), nchar.max=200) chr "<!DOCTYPE html> <html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en"> <head> <title>Packt Publishing | Technology Books, eBooks & Videos</title> <script> data"| __truncated__
In order to use this data in an R program, it is necessary to process the page to transform it into a structured format like a list or data frame. Functions to do so are discussed in the sections that follow.
For detailed httr
documentation and tutorials, visit the project GitHub page at https://github.com/hadley/httr. The quickstart guide is particularly helpful to learn the base functionality.
Because there is a consistent structure of the HTML tags of many web pages, it is possible to write programs that look for desired sections of the page and extract them in order to compile them into a dataset. This process practice of harvesting data from websites and transforming it into a structured form is known as web scraping.
Though it is frequently used, scraping should be considered a last resort to get data from the Web. This is because any changes to the underlying HTML structure may break your code, requiring efforts to be fixed. Even worse, it may introduce unnoticed errors into your data. Additionally, many websites' terms of use agreements explicitly forbid automated data extraction, not to mention the fact that your program's traffic may overload their servers. Always check the site's terms before you begin your project; you may even find that the site offers its data freely via a developer agreement.
The rvest
package (a pun on the term "harvest") by Hadley Wickham makes web scraping a largely effortless process, assuming the data you want can be found in a consistent place within HTML.
Let's start with a simple example using the Packt Publishing homepage. We begin by downloading the page as before, using the html()
function in the rvest
package. Note that this function, when supplied with a URL, simply calls the GET()
function in Hadley Wickham's httr
package:
> library(rvest) > packt_page <- html("https://www.packtpub.com")
Suppose we'd like to scrape the page title. Looking at the previous HTML code, we know that there is only one title per page wrapped within <title>
and </title>
tags. To pull the title, we supply the tag name to the html_node()
function, as follows:
> html_node(packt_page, "title") <title>Packt Publishing | Technology Books, eBooks & Videos</title>
This keeps the HTML formatting in place, including the <title>
tags and the &
code, which is the HTML designation for the ampersand symbol. To translate this into plain text, we simply run it through the html_text()
function, as shown:
> html_node(packt_page, "title") %>% html_text() [1] "Packt Publishing | Technology Books, eBooks & Videos"
Notice the use of the %>%
operator. This is known as a pipe, because it essentially "pipes" data from one function to another. The use of pipes allows the creation of powerful chains of functions to process HTML data.
The pipe operator is a part of the magrittr
package by Stefan Milton Bache and Hadley Wickham, installed by default with the rvest
package. The name is a play on René Magritte's famous painting of a pipe (you may recall seeing it in Chapter 1, Introducing Machine Learning). For more information on the project, visit its GitHub page at https://github.com/smbache/magrittr.
Let's try a slightly more interesting example. Suppose we'd like to scrape a list of all the packages on the CRAN machine learning task view. We begin as in the same way we did it earlier, by downloading the HTML page using the html()
function. Since we don't know how the page is structured, we'll also peek into HTML by typing cran_ml
, the name of the R object we created:
> library(rvest) > cran_ml <- html("http://cran.r-project.org/web/views/MachineLearning.html") > cran_ml
Looking over the output, we find that one section appears to have the data we're interested in. Note that only a subset of the output is shown here:
<h3>CRAN packages:</h3> <ul> <li><a href="../packages/ahaz/index.html">ahaz</a></li> <li><a href="../packages/arules/index.html">arules</a></li> <li><a href="../packages/bigrf/index.html">bigrf</a></li> <li><a href="../packages/bigRR/index.html">bigRR</a></li> <li><a href="../packages/bmrm/index.html">bmrm</a></li> <li><a href="../packages/Boruta/index.html">Boruta</a></li> <li><a href="../packages/bst/index.html">bst</a></li> <li><a href="../packages/C50/index.html">C50</a></li> <li><a href="../packages/caret/index.html">caret</a></li>
The <h3>
tags imply a header of size 3, while the <ul>
and <li>
tags refer to the creation of an unordered list and list items, respectively. The data elements we want are surrounded by <a>
tags, which are hyperlink anchor tags that link to the CRAN page for each package.
With this knowledge in hand, we can scrape the links much like we did previously. The one exception is that, because we expect to find more than one result, we need to use the html_nodes()
function to return a vector of results rather than html_node()
, which returns only a single item:
> ml_packages <- html_nodes(cran_ml, "a")
Let's peek at the result using the head()
function:
> head(ml_packages, n = 7) [[1]] <a href="../packages/nnet/index.html">nnet</a> [[2]] <a href="../packages/RSNNS/index.html">RSNNS</a> [[3]] <a href="../packages/rpart/index.html">rpart</a> [[4]] <a href="../packages/tree/index.html">tree</a> [[5]] <a href="../packages/rpart/index.html">rpart</a> [[6]] <a href="http://www.cs.waikato.ac.nz/~ml/weka/">Weka</a> [[7]] <a href="../packages/RWeka/index.html">RWeka</a>
As we can see on line 6, it looks like the links to some other projects slipped in. This is because some packages are hyperlinked to additional websites; in this case, the RWeka
package is linked to both CRAN and its homepage. To exclude these results, you might chain this output to another function that could look for the /packages
string in the hyperlink.
These are simple examples that merely scratch the surface of what is possible with the rvest
package. Using the pipe functionality, it is possible to look for tags nested within tags or specific classes of HTML tags. For these types of complex examples, refer to the package documentation.
XML is a plaintext, human-readable, structured markup language upon which many document formats have been based. It employs a tagging structure in some ways similar to HTML, but is far stricter about formatting. For this reason, it is a popular online format to store structured datasets.
The XML
package by Duncan Temple Lang provides a suite of R functionality based on the popular C-based libxml2
parser to read and write XML documents. It is the grandfather of XML parsing packages in R and is still widely used.
Information on the XML
package, including simple examples to get you started quickly, can be found on the project's website at http://www.omegahat.org/RSXML/.
Recently, the xml2
package by Hadley Wickham has surfaced as an easier and more R-like interface to the libxml2
library. The rvest
package, which was covered earlier in this chapter, utilizes xml2
behind the scenes to parse HTML. Moreover, rvest
can be used to parse XML as well.
The xml2
GitHub page is found at https://github.com/hadley/xml2.
Because parsing XML is so closely related to parsing HTML, the exact syntax is not covered here. Please refer to these packages' documentation for examples.
Online applications communicate with one another using web-accessible functions known as Application Programming Interfaces (APIs). These interfaces act much like a typical website; they receive a request from a client via a particular URL and return a response. The difference is that a normal website returns HTML meant for display in a web browser, while an API typically returns data in a structured form meant for processing by a machine.
Though it is not uncommon to find XML-based APIs, perhaps the most common API data structure today is JavaScript Object Notation (JSON). Like XML, it is a standard, plaintext format, most often used for data structures and objects on the Web. The format has become popular recently due to its roots in browser-based JavaScript applications, but despite the pedigree, its utility is not limited to the Web. The ease in which JSON data structures can be understood by humans and parsed by machines makes it an appealing data structure for many types of projects.
JSON is based on a simple {key: value}
format. The { }
brackets denote a JSON object, and the key
and value
parameters denote a property of the object and the status of the property. An object can have any number of properties and the properties themselves may be objects. For example, a JSON object for this book might look something like this:
{ "title": "Machine Learning with R", "author": "Brett Lantz", "publisher": { "name": "Packt Publishing", "url": "https://www.packtpub.com" }, "topics": ["R", "machine learning", "data mining"], "MSRP": 54.99 }
This example illustrates the data types available to JSON: numeric, character, array (surrounded by [
and ]
characters), and object. Not shown are the null
and Boolean (true
or false
) values. The transmission of these types of objects from application to application and application to web browser, is what powers many of the most popular websites.
For details on the JSON format, go to http://www.json.org/.
Public-facing APIs allow programs like R to systematically query websites to retrieve results in the JSON format, using packages like RCurl
and httr
. Though a full tutorial on using web APIs is worthy of a separate book, the basic process relies on only a couple of steps—it's the details that are tricky.
Suppose we wanted to query the Google Maps API to locate the latitude and longitude of the Eiffel Tower in France. We first need to review the Google Maps API documentation to determine the URL and parameters needed to make this query. We then supply this information to the httr
package's GET()
function, adding a list of query parameters in order to apply the search address:
> library(httr) > map_search <- GET("https://maps.googleapis.com/maps/api/geocode/json", query = list(address = "Eiffel Tower"))
By typing the name of the resulting object, we can see some details about the request:
> map_search Response [https://maps.googleapis.com/maps/api/geocode/json?address=Eiffel%20T ower] Status: 200 Content-Type: application/json; charset=UTF-8 Size: 2.34 kB { "results" : [ { "address_components" : [ { "long_name" : "Eiffel Tower", "short_name" : "Eiffel Tower", "types" : [ "point_of_interest", "establishment" ] }, { ...
To access the resulting JSON, which the httr
package parsed automatically, we use the content()
function. For brevity, only a handful of lines are shown here:
> content(map_search) $results[[1]]$formatted_address [1] "Eiffel Tower, Champ de Mars, 5 Avenue Anatole France, 75007 Paris, France" $results[[1]]$geometry $results[[1]]$geometry$location $results[[1]]$geometry$location$lat [1] 48.85837 $results[[1]]$geometry$location$lng [1] 2.294481
To access these contents individually, simply refer to them using list syntax. The names are based on the JSON objects returned by the Google API. For instance, the entire set of results is in an object appropriately named results
and each result is numbered. In this case, we will access the formatted address property of the first result, as well as the latitude and longitude:
> content(map_search)$results[[1]]$formatted_address [1] "Eiffel Tower, Champ de Mars, 5 Avenue Anatole France, 75007 Paris, France" > content(map_search)$results[[1]]$geometry$location$lat [1] 48.85837 > content(map_search)$results[[1]]$geometry$location$lng [1] 2.294481
These data elements could then be used in an R program as desired.
On the other hand, if you would like to do a conversion to and from the JSON format outside the httr
package, there are a number of packages that add this functionality.
The rjson
package by Alex Couture-Beil was one of the earliest packages to allow R data structures to be converted back and forth from the JSON format. The syntax is simple. After installing the rjson
package, to convert from an R object to a JSON string, we use the toJSON()
function. Notice that the quote characters have escaped using the "
notation:
> library(rjson) > ml_book <- list(book_title = "Machine Learning with R", author = "Brett Lantz") > toJSON(ml_book) [1] "{"book_title":"Machine Learning with R", "author":"Brett Lantz"}"
To convert a JSON string into an R object, use the fromJSON()
function. Quotation marks in the string need to be escaped, as shown:
> ml_book_json <- "{ "title": "Machine Learning with R", "author": "Brett Lantz", "publisher": { "name": "Packt Publishing", "url": "https://www.packtpub.com" }, "topics": ["R", "machine learning", "data mining"], "MSRP": 54.99 }" > ml_book_r <- fromJSON(ml_book_json)
This results in a list structure in a form much like the original JSON:
> str(ml_book_r) List of 5 $ title : chr "Machine Learning with R" $ author : chr "Brett Lantz" $ publisher:List of 2 ..$ name: chr "Packt Publishing" ..$ url : chr "https://www.packtpub.com" $ topics : chr [1:3] "R" "machine learning" "data mining" $ MSRP : num 55
Recently, two new JSON packages have arrived on the scene. The first, RJSONIO
, by Duncan Temple Lang was intended to be a faster and more extensible version of the rjson
package, though they are now virtually identical. A second package, jsonlite
, by Jeroen Ooms has quickly gained prominence as it creates data structures that are more consistent and R-like, especially while using data from web APIs. Which of these packages you use is a matter of preference; all three are virtually identical in practice as they each implement a fromJSON()
and toJSON()
function.
For more information on the potential benefits of the jsonlite
package, see: Ooms J. The jsonlite package: a practical and consistent mapping between JSON data and R objects. 2014. Available at: http://arxiv.org/abs/1403.2805
3.138.34.75