Chapter 2. Getting Data from the Web

It happens pretty often that we want to use data in a project that is not yet available in our databases or on our disks, but can be found on the Internet. In such situations, one option might be to get the IT department or a data engineer at our company to extend our data warehouse to scrape, process, and load the data into our database as shown in the following diagram:

Getting Data from the Web

On the other hand, if we have no ETL system (to Extract, Transform, and Load data) or simply just cannot wait a few weeks for the IT department to implement our request, we are on our own. This is pretty standard for the data scientist, as most of the time we are developing prototypes that can be later transformed into products by software developers. To this end, a variety of skills are required in the daily round, including the following topics that we will cover in this chapter:

  • Downloading data programmatically from the Web
  • Processing XML and JSON formats
  • Scraping and parsing data from raw HTML sources
  • Interacting with APIs

Although being a data scientist was referred to as the sexiest job of the 21st century (Source: https://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century/), most data science tasks have nothing to do with data analysis. Worse, sometimes the job seems to be boring, or the daily routine requires just basic IT skills and no machine learning at all. Hence, I prefer to call this role a data hacker instead of data scientist, which also means that we often have to get our hands dirty.

For instance, scraping and scrubbing data is the least sexy part of the analysis process for sure, but it's one of the most important steps; it is also said, that around 80 percent of data analysis is spent cleaning data. There is no sense in running the most advanced machine learning algorithm on junk data, so be sure to take your time to get useful and tidy data from your sources.

Note

This chapter will also depend on extensive usage of Internet browser debugging tools with some R packages. These include Chrome DevTools or FireBug in Firefox. Although the steps to use these tools will be straightforward and also shown on screenshots, it's definitely worth mastering these tools for future usage; therefore, I suggest checking out a few tutorials on these tools if you are into fetching data from online sources. Some starting points are listed in the References section of the Appendix at the end of the book.

For a quick overview and a collection of relevant R packages for scraping data from the Web and to interact with Web services, see the Web Technologies and Services CRAN Task View at http://cran.r-project.org/web/views/WebTechnologies.html.

Loading datasets from the Internet

The most obvious task is to download datasets from the Web and load those into our R session in two manual steps:

  1. Save the datasets to disk.
  2. Read those with standard functions, such as read.table or for example foreign::read.spss, to import sav files.

But we can often save some time by skipping the first step and loading the flat text data files directly from the URL. The following example fetches a comma-separated file from the Americas Open Geocode (AOG) database at http://opengeocode.org, which contains the government, national statistics, geological information, and post office websites for the countries of the world:

> str(read.csv('http://opengeocode.org/download/CCurls.txt'))
'data.frame':  249 obs. of  5 variables:
 $ ISO.3166.1.A2                  : Factor w/ 248 levels "AD" ...
 $ Government.URL                 : Factor w/ 232 levels ""  ...
 $ National.Statistics.Census..URL: Factor w/ 213 levels ""  ...
 $ Geological.Information.URL     : Factor w/ 116 levels ""  ...
 $ Post.Office.URL                : Factor w/ 156 levels ""  ...

In this example, we passed a hyperlink to the file argument of read.table, which actually downloaded the text file before processing. The url function, used by read.table in the background, supports HTTP and FTP protocols, and can also handle proxies, but it has its own limitations. For example url does not support Hypertext Transfer Protocol Secure (HTTPS) except for a few exceptions on Windows, which is often a must to access Web services that handle sensitive data.

Note

HTTPS is not a separate protocol alongside HTTP, but instead HTTP over an encrypted SSL/TLS connection. While HTTP is considered to be insecure due to the unencrypted packets travelling between the client and server, HTTPS does not let third-parties discover sensitive information with the help of signed and trusted certificates.

In such situations, it's wise, and used to be the only reasonable option, to install and use the RCurl package, which is an R client interface to curl: http://curl.haxx.se. Curl supports a wide variety of protocols and URI schemes and handles cookies, authentication, redirects, timeouts, and even more.

For example, let's check the U.S. Government's open data catalog at http://catalog.data.gov/dataset. Although the general site can be accessed without SSL, most of the generated download URLs follow the HTTPS URI scheme. In the following example, we will fetch the Comma Separated Values (CSV) file of the Consumer Complaint Database from the Consumer Financial Protection Bureau, which can be accessed at http://catalog.data.gov/dataset/consumer-complaint-database.

Note

This CSV file contains metadata on around a quarter of a million of complaints about financial products and services since 2011. Please note that the file is around 35-40 megabytes, so downloading it might take some time, and you would probably not want to reproduce the following example on mobile or limited Internet. If the getURL function fails with a certificate error (this might happen on Windows), please provide the path of the certificate manually by options(RCurlOptions = list(cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl"))) or try the more recently published curl package by Jeroen Ooms or httr (RCurl front-end) by Hadley Wickham—see later.

Let's see the distribution of these complaints by product type after fetching and loading the CSV file directly from R:

> library(RCurl)
Loading required package: bitops
> url <- 'https://data.consumerfinance.gov/api/views/x94z-ydhh/rows.csv?accessType=DOWNLOAD'
> df  <- read.csv(text = getURL(url))
> str(df)
'data.frame':  236251 obs. of  14 variables:
 $ Complaint.ID        : int  851391 851793 ...
 $ Product             : Factor w/ 8 levels ...
 $ Sub.product         : Factor w/ 28 levels ...
 $ Issue               : Factor w/ 71 levels "Account opening ...
 $ Sub.issue           : Factor w/ 48 levels "Account status" ...
 $ State               : Factor w/ 63 levels "","AA","AE",,..
 $ ZIP.code            : int  14220 64119 ...
 $ Submitted.via       : Factor w/ 6 levels "Email","Fax" ...
 $ Date.received       : Factor w/ 897 levels  ...
 $ Date.sent.to.company: Factor w/ 847 levels "","01/01/2013" ...
 $ Company             : Factor w/ 1914 levels ...
 $ Company.response    : Factor w/ 8 levels "Closed" ...
 $ Timely.response.    : Factor w/ 2 levels "No","Yes" ...
 $ Consumer.disputed.  : Factor w/ 3 levels "","No","Yes" ...
> sort(table(df$Product))

      Money transfers         Consumer loan              Student loan 
                  965                  6564                      7400 
      Debt collection      Credit reporting   Bank account or service 
                24907                 26119                     30744 
          Credit card              Mortgage 
                34848                104704

Although it's nice to know that most complaints were received about mortgages, the point here was to use curl to download the CSV file with a HTTPS URI and then pass the content to the read.csv function (or any other parser we discussed in the previous chapter) as text.

Note

Besides GET requests, you can easily interact with RESTful API endpoints via POST, DELETE, or PUT requests as well by using the postForm function from the RCurl package or the httpDELETE, httpPUT, or httpHEAD functions— see details about the httr package later.

Curl can also help to download data from a secured site that requires authorization. The easiest way to do so is to login to the homepage in a browser, save the cookie to a text file, and then pass the path of that to cookiefile in getCurlHandle. You can also specify useragent among other options. Please see http://www.omegahat.org/RCurl/RCurlJSS.pdf for more details and an overall (and very useful) overview on the most important RCurl features.

Although curl is extremely powerful, the syntax and the numerous options with the technical details might be way too complex for those without a decent IT background. The httr package is a simplified wrapper around RCurl with some sane defaults and much simpler configuration options for common operations and everyday actions.

For example, cookies are handled automatically by sharing the same connection across all requests to the same website; error handling is much improved, which means easier debugging if something goes wrong; the package comes with various helper functions to, for instance, set headers, use proxies, and easily issue GET, POST, PUT, DELETE, and other methods. Even more, it also handles authentication in a much more user-friendly way—along with OAuth support.

Note

OAuth is the open standard for authorization with the help of intermediary service providers. This simply means that the user does not have to share actual credentials, but can rather delegate rights to access some of the stored information at the service providers. For example, one can authorize Google to share the real name, e-mail address, and so on with a third-party without disclosing any other sensitive information or any need for passwords. Most generally, OAuth is used for password-less login to various Web services and APIs. For more information, please see the Chapter 14, Analyzing the R Community, where we will use OAuth with Twitter to authorize the R session for fetching data.

But what if the data is not available to be downloaded as CSV files?

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
52.15.129.253