9
Scraping the Web

Having learned much about the basics of the architecture of the Web, we now turn to data collection in practice. In this chapter, we address three main aspects of web scraping with R. The first is how to retrieve data from the Web in different scenarios (Section 9.1). Recall Figure 1.4. The first part of the chapter looks at the stage where we try to get resources from servers into R. The principal technology to deal with in this step is HTTP. We offer a set of real-life scenarios that demonstrate how to use libcurl to gather data in various settings. In addition to examples based on HTTP or FTP communication, we introduce the use of web services (web application programming interfaces [APIs]) and a related authentication standard, OAuth. We also offer a solution for the problem of scraping dynamic content that we described in Chapter 6. Section 9.1.9 provides an introduction to Selenium, a browser automation tool that can be used to gather content from JavaScript-enriched pages.

The second part of the chapter turns to strategies for extracting information from gathered resources (Section 9.2). We are already familiar with the necessary technologies: regular expressions (Chapter 8) and XPath (Chapter 4). From a technology-based perspective, this corresponds to the second column of Figure 1.4. In this part we shed light on these techniques from a more practical perspective, providing a stylized sketch of the strategies and discuss their advantages and disadvantages. We also consider APIs once more. They are an ideal case of automated web data collection as they offer a seamless integration of the retrieval and extracting stages.

Whatever the level of difficulty for scraping information from the web, the circle of scraping remains almost always identical. The followings tasks are part of most scraping exercises:

  1. Information identification
  2. Choice of strategy
  3. Data retrieval
  4. Information extraction
  5. Data preparation
  6. Data validation
  7. Debugging and maintenance
  8. Generalization

The art of scraping lies in cleverly combining and redefining these tasks, and we can only sketch out some basic principles, either theoretically (as in Section 9.2) or by examples in the set of retrieval scenarios and case studies. In the end, questions such as “Is automation efficient?,” “Is R the right tool for my web data collection work?,” and “Is my data source of choice reliable in the long run?” are project-specific and lack a generally helpful answer.

The third part of this chapter addresses an important, but sometimes disregarded aspect of web scraping. It deals with the question of how to behave nicely on the Web as a web scraper. We are convinced that the abundance of online data is something positive and opens up new ways for understanding human interactions. Whether collecting these data is inherently positive depends in no small part on (a) the behavior of data gatherers and (b) on the purpose for which data are collected. The latter point is entirely up to you. For the former point, we offer some basic advice in Section 9.3. We discuss legal implications of web scraping, show how to take robots.txt, an informal standard for web crawler behavior, into account, and offer a practical guideline for friendly web-scraping practice.

We conclude the chapter with a glimpse of ongoing efforts for giving R more interfaces with web data and on lighthouses of web scraping more generally (Section 9.4).

A final remark before we get started: This chapter is mostly about how to build special-purpose web scrapers. In our definition scrapers are programs that grab specific content from web pages. Such information could be telephone data (see Chapter 15), data on products (see Chapter 16), or political behavior (see Chapter 12). Spiders (or crawlers or web robots), in contrast, are programs that grab and index entire pages and move around the Web following every link they can find. Most scraping work involves a spidering component. In order to extract content from webpages, we usually first download them as a whole and then continue with the extraction part. In general, however, we disregard scenarios in which the goal is to wander through the Web without a specific data collection target.

9.1 Retrieval scenarios

For the following scenarios of web data retrieval, we rely on the following set of R packages which were introduced in the first part of the book. We assume that you have loaded them for the exercises. We will indicate throughout the chapter whenever we make use of additional packages.


R> library(RCurl)
R> library(XML)
R> library(stringr)

9.1.1 Downloading ready-made files

The first way to get data from the Web is almost too banal to be considered here and actually not a case of web scraping in the narrower sense. In some situations, you will find data of interest ready for download in TXT, CSV, or any other plain-text/spreadsheet or binary format like PDF, XLS, or JPEG. R can still prove useful for such simple tasks, as (a) the data acquisition process remains reproducible in principle and (b) it may save a considerable amount of time. We picked two common examples to illustrate the benefits of using R in scenarios like these.

9.1.1.1 CSV election results data

The Maryland State Board of Elections at http://www.elections.state.md.us/ provides a rich data resource on past elections. We identified a set of comma-separated value spreadsheets that comprise information on state-, county-, and precinct-level election results for the 2012 Presidential election in Maryland in one of the page's subdirectories at http://www.elections. state.md.us/elections/2012/election_data/index.html. The targeted files are accessible via “General” hyperlinks. Suppose we want to download these files for analyses.

The links to the CSV files are scattered across several tables on the page. We are only interested in some of the documents, namely those that contain the raw election results for the general election. The page provides data on the primaries and on ballot questions, too. In order to retrieve the desired files, we want to proceed in three steps.

  1. We identify the links to the desired files.
  2. We construct a download function.
  3. We execute the downloads.

The XML package provides a neat function to identify links in an HTML document—getHTMLLinks(). We introduce this and other convenience functions from the package in greater detail in Section 9.1.4.

We use getHTMLLinks() to extract all the URLs and external file names in the HTML document that we first assign to the object url. The list of links in links comprises more entries than we are interested in, so we apply the regular expression _General.csv to retrieve the subset of external file names that point to the general election result CSVs. Finally, we store the file names in a list to be able to apply a download function to this list in the next step.


R> url <- ”http://www.elections.state.md.us/elections/2012/election_data/index.html”

R> links <- getHTMLLinks(url)

R> filenames <- links[str_detect(links, ”_General.csv”)]

R> filenames_list <- as.list(filenames)
R> filenames_list[1:3]
[[1]]
[1] ”http://www.elections.state.md.us/elections/2012/election_data/State_Legislative_Districts_2012_General.csv”

[[2]]
[1] ”http://www.elections.state.md.us/elections/2012/election_data/Allegany_County_2012_General.csv”

[[3]]
[1] ”http://www.elections.state.md.us/elections/2012/election_data/Allegany_By_Precinct_2012_General.csv

Next, we set up a function to download all the files and call the function downloadCSV(). The function wraps around the base R function download.file(), which is perfectly sufficient to download URLs or other files in standard scenarios. Our function has three arguments. filename refers to each of the entries in the filenames_list object. baseurl specifies the source path of the files to be downloaded. Along with the file names, we can thus construct the full URL of each file. We do this using str_c() and feed the result to the download.file() function. The second argument of the function is the destination on our local drive. We determine a folder where we want to store the CSV files and add the file name parameter. We tweak the download by adding (1) a condition which ensures that the file download is only performed if the file does not already exist in the folder using the file.exists() function and (2) a pause of 1 second between each file download. We will motivate these tweaks later in Section 9.3.3.


images

We apply the function to the list of CSV file names filenames_list using l_ply() from the plyr package. The function takes a list as main argument and passes each list element as argument to the specified function, in our case downloadCSV(). We can pass further arguments to the function. For baseurl we identify the path where all CSVs are located. With folder we select the local folder where want to store the files.


images

To check the results, we consider the number of downloaded files and the first couple of entries.


R> length(list.files(”./elec12_maryland”))
[1] 68
R> list.files(”./elec12_maryland”)[1:3]
[1] ”Allegany_By_Precinct_2012_General.csv”
[2] ”Allegany_County_2012_General.csv”
[3] ”Anne_Arundel_By_Precinct_2012_General.csv”

Sixty-eight CSV files have been added to the folder. We could now proceed with an analysis by importing the files into R using read.csv(). The web scraping task is thus completed and could easily be replicated with data on other elections stored on the website.

9.1.1.2 PDF legislative district maps

download.file() frequently does not provide the functionality we need to download files from certain sites. In particular, download.file() does not support data retrieval via HTTPS by default and is not capable of dealing with cookies or many other advanced features of HTTP. In such situations, we can switch to RCurl’s high-level functions which can easily handle problems like these—and offer further useful options.

As a showcase we try to retrieve PDF files of the 2012 Maryland legislative district maps, complementing the voting data from above. The maps are available at the Maryland Department of Planning's website: http://planning.maryland.gov/Redistricting/2010/legiDist.shtml.1 The targeted PDFs are accessible in a three-column table at the bottom right of the screen and named “1A,” “1B,” and so on. We reuse the download procedure from above, but specify a different base URL and regular expression to detect the desired files.


R> url <- ”http://planning.maryland.gov/Redistricting/2010/legiDist.shtml”
R> links <- getHTMLLinks(url)
R> filenames <- links[str_detect(links, ”2010maps/Leg/Districts_”)]
R> filenames_list <- str_extract_all(filenames, ”Districts.+pdf”)

The download function downloadPDF() now relies on getBinaryURL(). We allow for the use of a curl handle. We cannot specify a destination file in the getBinaryURL() function, so we store the raw data in a pdffile object first and then pass it to writeBin(). This function writes the PDF files to the specified folder. The other components of the function remain the same.


images

We execute the function with a handle that adds a User-Agent and a From header field to every call and keeps the connection alive. We could specify further options if we had to deal with cookies or other HTTP specifics.

images

Again, we examine the results by checking the number of files in the folder and the first couple of results.


R> length(list.files(”./elec12_maryland_maps”))
[1] 68

R> list.files(”./elec12_maryland_maps”)[1:3]
[1] ”Districts_10.pdf” ”Districts_11.pdf” ”Districts_12.pdf”

Everything seems to have worked out fine—68 PDF files have been downloaded. The bottom line of this exercise is that downloading plain-text or binary files from a website is one of the easiest tasks. The core tools are download.file() and RCurl’s high-level functions. getHTMLLinks() from the XML package often does a good job of identifying the links to single files, especially when they are scattered across a document.

9.1.2 Downloading multiple files from an FTP index

We have introduced an alternative network protocol to HTTP for pure file transfer, the File Transfer Protocol (FTP) in Section 5.3.2. Downloading files from FTP servers is a rewarding task for data wranglers because FTP servers host files, nothing else. We do not have to care about getting rid of HTML layout or other unwanted information. Again, RCurl is well-suited to fetch files via FTP.

Let us have a look at the CRAN FTP server to see how this works. The server has the URL ftp://cran.r-project.org/. It stores a lot of R-related data, including older R versions, CRAN task views, and all CRAN packages. Say we want to download all CRAN task view HTML files for closer inspection. They are stored at ftp://cran.r-project.org/pub/R/web/views/. Our downloading strategy is similar to the one in the last scenario.

  1. We identify the desired files.
  2. We construct a download function.
  3. We execute the downloads.

In order to load the FTP directory list into R, we assign the URL to ftp. Next, we save the list of file names to the object ftp_files with getURL().2 By setting the libcurl option dirlistonly to TRUE, we ensure that only the file names are fetched, but no further information about file size or creation date.


R> ftp <- ”ftp://cran.r-project.org/pub/R/web/views/”
R> ftp_files <- getURL(ftp, dirlistonly = TRUE)

It is sometimes the case that the default FTP mode in libcurl, extended passive (EPSV), does not work with some FTP servers. In this case, we have to add the ftp.use.epsv = FALSE option. In our example, we have successfully downloaded the list of files and stored it in a character vector, ftp_files. The information is corrupted with line feeds and carriage returns representations , however, and still contains CTV files.


R> ftp_files

[1] ”Bayesian.ctv
Bayesian.html
ChemPhys.ctv
ChemPhys.html
...”

To get rid of them we use them as splitting patterns for str_split(). We also apply a regular expression to select only the HTML files with str_extract_all():


images

An equivalent, but more elegant way to get only the HTML files would be


R> filenames_html <- getURL(ftp, customrequest = ”NLST *.html”)
R> filenames_html = str_split(filenames_html, ”\
\
”)[[1]]

This way we pass the FTP command NLST *.html to our function. This returns a list of file names in the FTP directory that end in .html. We thus exploit the libcurl option customrequest that allows changing the request method and do not have to extract the HTML files ex post.3

In the last step, we construct a function downloadFTP() that fetches the desired files from the FTP server and stores them in a specified folder. It basically follows the syntax of the downloadPDF() function from the previous section.

images

We set up a handle that disables FTP-extended passive mode and download the CRAN task HTML documents to the cran_tasks folder:

images

A quick inspection of our newly created folder reveals that the files were successfully downloaded.

images

It is also possible to upload data to an FTP server. As we do not have any rights to upload content to the CRAN server, we offer a fictional example.


R> ftpUpload(what = ”example.txt”, to = ”ftp://example.com/”, 
userpwd = ”username:password”)

To get a taste of the good old FTP times where there was no more than just data and directories, visit http://www.search-ftps.com/ or http://www.filesearching.com/ to search for existing archives. What you will find might occasionally be content of dubious quality, however.

9.1.3 Manipulating URLs to access multiple pages

We usually care little about the web addresses of the sites we visit. Sometimes we might access a web page by entering a URL into our browser, but more frequently we come to a site through a search engine. Either way, once we have accessed a particular site we move around by clicking on links, but do not take note of the fact that the URL changes when accessing the various sites on the same server. We already know that directories on a web server are comparable to the folders on our local hard drive. Once we realize that the directories of the website follow specific systematics, we can make use of this fact and apply it in web scraping by manipulating the URL of a site. Compared with other retrieval strategies, URL manipulation is a “quick and dirty” approach, as we usually do not care about the internal mechanisms that create URLs (e.g., GET forms).

Imagine we would like to collect all press releases from the organization Transparency International. Check out the organization's press releases under the heading “News” at http://www.transparency.org/news/pressreleases/. Now select the year 2011 from the drop-down menu. Notice how the statement year/2011 is appended to the URL. We can apply this observation and call up the press releases from 2010 by changing the figure in the URL. As expected, the browser now displays all press releases from 2010, starting with releases in late December. Notice how the webpage displays 10 hits for each search. Click on “Next” at the bottom of the page. We find that the URL is appended with the statement P10. Apparently, we are able to select specific results pages by using multiples of 10. Let us try this by choosing the fourth site of the 2010 press releases by selecting the directory http://www.transparency. org/news/pressreleases/year/2010/P30. In fact, we can wander through the pages by manipulating the URL instead of clicking on HTML buttons.

Now let us capitalize on these insights and implement them in small scraper. We proceed in five steps.

  1. We identify the running mechanism in the URL syntax.
  2. We retrieve links to the running pages.
  3. We download the running pages.
  4. We retrieve links to the entries on the running pages.
  5. We download the single entries.

We begin by constructing a function that returns a list of URLs for every page in the index. We have already identified the running mechanism in the URL syntax—a P and a multiple of 10 is attached to the base URL for every page other than the first one. To know how many of these pages exist, we retrieve the total number of pages from the bottom line on the base page, which reads “Page x of X”. “X” is the total number of pages. We fetch the number with the XPath command //div[@id=’Page’]/strong[2] and use the result (total_pages) to construct a vector add_url with string additions to the base URL. The first entries are stored on the base URL page which does not need an addition. Therefore, we construct X − 1 snippets to be added to the base URL. We store this number 10 times, as the index runs from 10 to X * 10, rather than from 1 to X in max_url and merge it with /P10 and store it in the object add_url.

images

Next, we construct the full URLs and put them in a list. To fetch entries from the first page as well, we add the base URL to the list. Everything is wrapped into a function getPageURLs() that returns the URLs of single index pages as a list.

images

images

Applying the function yields


R> url <- ”http://www.transparency.org/news/pressreleases/year/2010”
R> urls_list <- getPageURLs(url)
R> urls_list[1:3]
[[1]]
[1] ”http://www.transparency.org/news/pressreleases/year/2010/P10”

[[2]]
[1] ”http://www.transparency.org/news/pressreleases/year/2010/P20”

[[3]]
[1] ”http://www.transparency.org/news/pressreleases/year/2010/P30

In the third step, we construct a function to download each index page. The function takes the returned list from getPageURLs(), extracts the file names, and writes the HTML pages to a local folder.

Notice that we have to add a file name for the base URL index manually because the regular expression ”/P.+” which identifies the file names does not apply here. This is done in the fourth line of the function. As usual, the download is conducted with getURL:

images

We perform the download with l_ply to download the files stored in the baselinks_list list elements.

images

Sixteen files have been downloaded. Now we parse the downloaded index files to identify the links to the individual press releases. The getPressURLs() function works as follows. First, we parse the documents into a list. We retrieve all links in the documents using getHTMLLinks(). Finally, we extract only those links that refer to one of the press releases. To do so, we apply the regular expression ”http.+/pressrelease/” which uniquely identifies the releases and stores them in a list.

images

Applying the function we retrieve a list of links to roughly 150 press releases.


R> press_urls_list <- getPressURLs(folder = ”tp_index_2010”)
R> length(press_urls_list)
[1] 152

The press releases are downloaded in the last step. The function works similarly to the one that downloaded the index pages. Again, we first retrieve the file names of the press releases based on the full URLs. We apply the rather nasty regular expression [⁁//][[:alnum:]_.]+$. We download the press release files with getURL() and store them in the created folder.

images

We apply this function using

images

All 152 files have been downloaded successfully. To process the press releases, we would have to parse them similar to the getPressURLs() function and extract the text. Moreover, to accomplish the task that was specified at the beginning of the section we would also have to generalize the functions to loop over the years on the website but the underlying ideas do not change.

In scenarios where the range of URLs is not as clear as in the example described above, we can make use of the url.exists() function from the RCurl package. It works analogously to file.exists() and indicates whether a given URL exists, that is, whether the server responds without an error.

In many web scraping exercises, we can apply URL manipulation to easily access all the sites that we are interested in. The downside of this type of access to a website is that we need a fairly intimate knowledge of the website and of the websites’ directories in order to perform URL manipulations. This is to say that URL manipulation cannot be used to write a crawler for multiple websites as the specific manipulations must be developed for each website.

9.1.4 Convenient functions to gather links, lists, and tables from HTML documents

The XML package provides powerful tools for parsing XML-style documents. Yet it offers more commands that considerably ease information extraction tasks in the web-scraping workflow. The functions readHTMLTable(), readHTMLList(), and getHTMLLinks() help extract data from HTML tables, lists, and internal as well as external links. We illustrate their functionality with a Wikipedia article on Niccolò Machiavelli, an “Italian historian, politician, diplomat, philosopher, humanist, and writer” (Wikipedia 2014).

The first function we will inspect is getHTMLlinks() which serves to extract links from HTML documents. To illustrate the flexibility of the convenience functions, we prepare several objects. The first object stores the URL for the article (mac_url), the second stores the source code (mac_source), the third stores the parsed document (mac_parsed), and the fourth and last object (mac_node) holds only one node of the parsed document, namely the <p> node that includes the introductory text.


R> mac_url <- ”http://en.wikipedia.org/wiki/Machiavelli”
R> mac_source <- readLines(mac_url, encoding = ”UTF-8”)
R> mac_parsed <- htmlParse(mac_source, encoding = ”UTF-8”)
R> mac_node <- mac_parsed[”//p”][[1]]

All of these representations of an HTML document (URL, source code, parsed document, and a single node) can be used as input for getHTMLLinks() and the other convenience functions introduced in this section.


R> getHTMLLinks(mac_url)[1:3]
[1] ”/w/index.php?title=Machiavelli&redirect=no”
[2] ”/wiki/Machiavelli_(disambiguation)”
[3] ”/wiki/File:Portrait_of_Niccol%C3%B2_Machiavelli_by_Santi_di_Tito.jpg”

R> getHTMLLinks(mac_source)[1:3]
[1] ”/w/index.php?title=Machiavelli&redirect=no”
[2] ”/wiki/Machiavelli_(disambiguation)”
[3] ”/wiki/File:Portrait_of_Niccol%C3%B2_Machiavelli_by_Santi_di_Tito.jpg”

R> getHTMLLinks(mac_parsed)[1:3]
[1] ”/w/index.php?title=Machiavelli&redirect=no”
[2] ”/wiki/Machiavelli_(disambiguation)”
[3] ”/wiki/File:Portrait_of_Niccol%C3%B2_Machiavelli_by_Santi_di_Tito.jpg”

R> getHTMLLinks(mac_node)[1:3]
[1] ”/wiki/Help:IPA_for_Italian” ”/wiki/Renaissance_humanism”
[3] ”/wiki/Renaissance”

We can also supply XPath expressions to restrict the returned documents to specific subsets, for example, only those links of class extiw.

images

getHTMLLinks() retrieves links from HTML as well as names of external files. We already made use of the latter feature in Section 9.1.1. An extension of getHTMLLinks() is getHTMLExternalFiles(), designed to extract only links that point to external files which are part of the document. Let us use the function along with its xpQuery parameter. We restrict the set of returned links to those mentioning Machiavelli to hopefully find a URL that links to a picture.

images

The first three results look promising; they all point to image files stored on the Wikimedia servers.

The next convenient function is readHTMLList() and as the name already suggests, it extracts list elements (see Section 2.3.7). Browsing through the article we find that under Discourses on Livy several citations from the work are pooled as an unordered list that we can easily extract. Note that the function returns a list object where each element corresponds to a list in the HTML. As the citations are the tenth list within the HTML, we figured this out by eyeballing the output of readHTMLList() and we use the index operator [[10]].


R> readHTMLList(mac_source)[[10]][1:3]
[1] ””In fact, when there is combined under the same constitution a prince, a nobility, and the power of the people, then these three powers will watch and keep each other reciprocally in check.” Book I, Chapter II”
[2] ””Doubtless these means [of attaining power] are cruel and destructive of all civilized life, and neither Christian, nor even human, and should be avoided by every one. In fact, the life of a private citizen would be preferable to that of a king at the expense of the ruin of so many human beings.” Bk I, Ch XXVI”
[3] ””Now, in a well-ordered republic, it should never be necessary to resort to extra-constitutional measures. ...” Bk I, Ch XXXIV”

The last function of the XML package we would like to introduce at this point is readHTMLTable(), a function to extract HTML tables. Not only does the function locate tables within the HTML document, but also transforms them into data frames. As before, the function extracts all tables and stores them in a list. Whenever the extracted HTML tables have information that can be used as name, they are stored as named list item. Let us first get an overview of the tables by listing the table names.

images

There are ten tables; two of them are labeled. Let us extract the last one to retrieve personal information on Machiavelli.

images

A powerful feature of readHTMLList() and readHTMLTable() is that we can define individual element functions using the elFun argument. By default, the function applied to each list item (<li>) and each cell of the table (<td>), respectively, is xmlValue(), but we can specify other functions that take XML nodes as arguments. Let us use another HTML table to demonstrate this feature. The first table of the article gives an overview of Machiavelli's personal information and, in the seventh and eighth rows, lists persons and schools of thought that have influenced him in his thinking as well as those that were influenced by him.


R> readHTMLTable(mac_source, stringsAsFactors = F)[[1]][7:8, 1]
[1] ”Influenced by
Xenophon, Plutarch, Tacitus, Polybius, Cicero, Sallust, Livy, Thucydides”
[2] ”Influenced
Political Realism, Bacon, Hobbes, Harrington, Rousseau, Vico, Edward Gibbon, David Hume, John Adams, Cuoco, Nietzsche, Pareto, Gramsci, Althusser, T. Schelling, Negri, Waltz, Baruch de Spinoza, Denis Diderot, Carl Schmitt”

In the HTML file, the names of philosophers and schools of thought are also linked to the corresponding Wikipedia articles, but this information gets lost by relying on the default element function. Let us replace the default function by one that is designed to extract links— getHTMLLinks(). This allows us to extract all links for influential and influenced thinkers.

images

images

Extracting links, tables, and lists from HTML documents are ordinary tasks in web scraping practice. These functions save a lot of time or otherwise we would have to spend on constructing suited XPath expressions and keeping our code tidy.

9.1.5 Dealing with HTML forms

Forms are a classical feature of user–server interaction via HTTP on static websites. They vary in size, layout, input type, and other parameters—just think about all the search bars you have used, the radio buttons you have slided, the check marks you have set, the user names and passwords typed in, and so on. Forms are easy to handle with a graphical user interface like a browser, but a little more difficult when they have to be disentangled in the source code. In this section, we will cover the general approach to master forms with R. In the end you should be able to recognize forms, determine the method used to pass the inputs, the location where the information is sent, and how to specify options and parameters for sending data to the servers and capture the result.

We will consider three different examples throughout this section to learn how to prepare your R session, approach forms in general, use the HTTP GET method to send forms to the server, use POST with url-encoded or multipart body, and let R automatically generate functions that use GET or POST with adequate options to send form data.

Filling out forms in the browser and handling them from within R differs in many respects, because much of the work that is usually done by the browser in the background has to be specified explicitly. Using a browser, we

  1. fill out the form,
  2. push the submit, ok, start, or the like! button.
  3. let the browser execute the action specified in the source code of the form and send the data to the server,
  4. and let the browser receive the returned resources after the server has evaluated the inputs.

In scraping practice, things get a little more complicated. We have to

  1. recognize the forms that are involved,
  2. determine the method used to transfer the data,
  3. determine the address to send the data to,
  4. determine the inputs to be sent along,
  5. build a valid request and send it out, and
  6. process the returned resources.

In this section, we use functions from the RCurl, XML, stringr, and the plyr packages. Furthermore, we specify an object that captures debug information along the way so that we can check for details if something goes awry (see Section 5.4.3 for details). Additionally, we specify a curl handle with a set of default options—cookiejar to enable cookie management, followlocation to follow page redirections which may be triggered by the POST command, and autoreferer to automatically set the Referer request header when we have to follow a location redirect. Finally, we specify the From and User-Agent header manually to stay identifiable:

images

Another preparatory step is to define a function that translates lists of XML attributes into data frames. This will come in handy when we are going to evaluate the attributes of HTML form elements of parsed HTML documents. The function we construct is called xmlAttrsToDF() and takes two arguments. The first argument supplies a parsed HTML document and the second an XPath expression specifying the nodes from which we want to collect the attributes. The function extracts the nodes’ attributes via xpathApply() and xmlAttrs() and transforms the resulting list into a data frame while ensuring that attribute names do not get lost and that each attribute value is stored in a separate column:

images

9.1.5.1 GETting to grips with forms

To presenting how to generally approach forms and specifically how to handle forms that demand HTTP GET, we use WordNet. WordNet is a service provided by Princeton University at http://wordnetweb.princeton.edu/perl/webwn. Researchers at Princeton have built up a database of synonyms for English nouns, verbs, and adjectives. They offer their data as an online service. The website relies on an HTML form to gather the parameters and send a request for synonyms—see Princeton University (2010a) for further details and Princeton University (2010b) for the license.

Let us browse to the page and type in a word, for example, data. Hitting the Search WordNet button results in a change to the URL which now contains 13 parameters.

images

We have been redirected to another page, which informs us that data is a noun and that it has two semantic meanings.

From the fact that the URL is extended with a query string when submitting our search term we can infer that the form uses the HTTP GET method to send the data to the server. But let us verify this conclusion. To briefly recap the relevant facts from Chapter 2: HTML forms are specified with the help of <form> nodes and their attributes. The <form> nodes’ attributes define the specifics of the data transfer from client to server. <input> nodes are nested in <form> nodes and define the kind of data that needs to be supplied to the form.

We can either use view source code feature of a browser to check out the attributes of the form nodes, or we use R to get the information. This time we do the latter. First, we load the page into R and parse it.

images

Let us have a look at the form node attributes to learn the specifics of sending data to the server. We use the xmlAttrsToDF() that we have set up above for this task.

images

There are two HTML forms on the page, one called f and the other change. The first form submits the search terms to the server while the second takes care of submitting further options on the type and range of data being returned. For the sake of simplicity, we will ignore the second form.

With regard to the specifics of sending the data, the attribute values tell us that we should use the HTTP method GET (method) and send it to webwn (action) which is the location of the form we just downloaded and parsed. The enctype parameter with value multipart/form-data comes as a bit of a surprise. It refers to how content is encoded in the body of the request. As GET explicitly does not use the body to transport data, we disregard this option.

The next task is to get the list of input parameters. When GET is used to send data, we can easily spot the parameters sent to the server by inspecting the query string added to the URL. But those parameters might only be a subset of all possible parameters. We therefore use xmlAttrsToDF() again to get the full set of inputs and their attributes.

images

images

As suggested by the long query string added to the URL after searching for our first search term, we get a list of 13 input nodes. Recall that there was only one input field on the page—the text field where we specified the search term. Inspecting the inputs reveals that 11 of the input fields are of type hidden, that is, input fields which cannot be manipulated by the user. Moreover, input fields of type submit are hidden from user manipulation as well, so there is only one parameter left for us to take care of. It turns out that the other parameters are used for submitting options to the server and have nothing to do with the actual search. To make simple search requests, the s parameter is sufficient.

Combining the informations on HTTP method, request location, and parameters, we can now build an adequate request by using one of RCurl’s form functions. As the HTTP method to send data to the server is GET, we use getForm(). Since the location to which we send the request remains the same, we can reuse the URL we used before. As parameter we only supply the s parameter with a value equal to the search term that we want to get synonyms for.


R> html_form_res <- getForm(uri = url, curl = handle, s = ”data”)
R> parsed_form_res <- htmlParse(html_form_res)
R> xpathApply(parsed_form_res, ”//li”, xmlValue)
[[1]]
[1] ”S: (n) data, information (a collection of facts from which conclusions may be drawn) ”statistical data””

[[2]]
[1] ”S: (n) datum, data point (an item of factual information derived from measurement or research) ”

Let us also have a look at the header information we supply by inspecting the information stored in the info object with the debugGatherer() function and reset it afterwards.


R> cat(str_split(info$value()[”headerOut”], ”
”)[[1]])
GET /perl/webwn HTTP/1.1
Host: wordnetweb.princeton.edu
Accept: */*
from: [email protected]
user-agent: R version 3.0.2 (2013-09-25), x86_64-w64-mingw32

GET /perl/webwn?s=data HTTP/1.1
Host: wordnetweb.princeton.edu
Accept: */*
from: [email protected]
user-agent: R version 3.0.2 (2013-09-25), x86_64-w64-mingw32
R> info$reset()

We find that the requests for fetching the form information and sending the form data are nearly identical, except that in the latter case the query string ?s=data is appended to the requested resource.

The same could have been achieved by supplying a URL with appended query string and a call to getURL():


R> url <- ”http://wordnetweb.princeton.edu/perl/webwn?s=data”
R> html_form_res <- getURL(url = url, curl = handle)

9.1.5.2 POSTing forms

Forms that use the HTTP method POST are in many respects identical to forms that use GET. They key difference between the two methods is that with POST, the information is transferred in the body of the request. There are two common styles for transporting data in the body, either as url-encoded or as multipart. While the former is efficient for text data, the latter is better suited for sending files. Thus, depending on the purpose of the form, one or the other POST style is expected. The next two sections will show how to handle POST forms in practice. The first example deals with a url-encoded body and the second one showcases sending multipart data.

POST with url-encoded body

In the first example, we use a form from http://www.read-able.com. The website offers a service that evaluates the readability of webpages and texts. As before, we use the precomposed handle to retrieve the page and directly parse and save it.


R> url <- ”http://read-able.com/”
R> form <- htmlParse(getURL(url = url, curl = handle))

Looking for <form> nodes reveals that there are two forms in the document. An examination of the site reveals that the first is used to supply a URL to evaluating a webpage's readability, and the second form allows inputting text directly.

images

There is no enctype specified in the attributes of the second form, so we expect the server to accept both encoding styles. Because url-encoded bodies are more efficient for text data, we will use this style to send the data.

An inspection of the second form's input fields indicates that there seem to be no inputs other than the submit button.


R> xmlAttrsToDF(form, ”//form[2]//input”)

Looking at the entire source code of the form, we find that there is a textarea node that gathers text to be sent to the server.

images

images

Its name attribute is directInput which serves as parameter name for sending the text. Let us use a famous quote about data found at http://www.goodreads.com/ to check its readability.


R> sentence <- ””It is a capital mistake to theorize before one has 
data. Insensibly one begins to twist facts to suit theories, instead 
of theories to suit facts.” -- Arthur Conan Doyle, Sherlock Holmes”

We send it to the read-able server for evaluation. Within the call to postForm() we set style to ”POST” for an url-encoded transmission of the data.

images

Most of the results are presented as HTML tables as shown below.

images

All in all, with a Grade Level of 6.6, 12- to 13-year-old children should be able to understand Sherlock Holmes’ dictum. Let us check out the header information that was sent to the server.


R> cat(str_split(info$value()[”headerOut”], ”
”)[[1]])
GET / HTTP/1.1
Host: read-able.com
Accept: */*
from: [email protected]
user-agent: R version 3.0.2 (2013-09-25), x86_64-w64-mingw32

POST /check.php HTTP/1.1
Host: read-able.com
Accept: */*
from: [email protected]
user-agent: R version 3.0.2 (2013-09-25), x86_64-w64-mingw32
Content-Length: 277
Content-Type: application/x-www-form-urlencoded

The second header confirms that the data have been sent via POST, using the following url-encoded body.4


R> cat(str_split(info$value()[”dataOut”], ”
”)[[1]])
directInput=%22It%20is%20a%20capital%20mistake%20to%20theorize%20
before%20one%20has%20data%2E%20Insensibly%20one%20begins%20to%20
twist%20facts%20to%20suit%20theories%2C%20instead%20of%20theories%20
to%20suit%20facts%2E%22%20%2D%2D%20Arthur%20Conan%20Doyle%2C%20
Sherlock%20Holmes

R> info$reset()
POST with multipart-encoded body

The second example considers a POST with a multipart-encoded body. Fix Picture (http://www.fixpicture.org/) is a web service to transform image files from one format to another. In our example we will transform a picture from PNG format to PDF.

Let us begin by retrieving a picture in PNG format and save it to our disk.


R> url <- ”r-datacollection.com/materials/http/sky.png”
R> sky <- getBinaryURL(url = url, curl = handle)
R> writeBin(sky, ”sky.png”)

Next, we collect the main page of Fix Picture including the HTML form.


R> url <- ”http://www.fixpicture.org/”
R> form <- htmlParse(getURL(url = url, curl = handle))

We check out the attributes of the form nodes.

images

We find that there is only one form on the page. The form expects data to be sent with POST and a multipart-encoded body. The list of possible inputs is extensive, as we can not only transform the picture from one format to another but also flip and rotate it, restrict it to grayscale, or choose the quality of the new format. For the sake of simplicity, we restrict ourselves to a simple transformation from one format to another.

images

The important input is the image. The upload-file value for the class attribute in one of the <input> nodes suggests that we supply the file's content under this name.

There is no input node for selecting the format of the output file. Inspecting the source code reveals that a select node is enclosed in the form. Select elements allow choosing between several options which are supplied as option nodes:

images

The name attribute of the select node indicates under which name (format) the value—listed within the option nodes should be sent to the server.

images

Disregarding all other possible options, we are ready to send the data along with parameters to the server. For RCurl to read the file and send it to the server, we have to use RCurl’s fileUpload() function that takes care of providing the correct information for the underlying libcurl library. The following code snippet sends the data to the server.

images

The result is not the transformed file itself but another HTML document from which we extract the link to the file.


R> doc <- htmlParse(res)
R> link <- str_c(url, xpathApply(doc, ”//a/@href”, as.character)[[1]])

We download the transformed file and write it to our local drive.


R> resImage <- getBinaryURL(link, curl = handle)
R> writeBin(resImage, ”sky.pdf”, useBytes = TRUE)

The result is the PNG picture transformed to PDF format. Last but not least let us have a look at the multipart body with the data that have been sent via POST:


R> cat(str_split(info$value()[”dataOut”], ”
”)[[1]])
----------------------------30059d14e820
Content-Disposition: form-data; name=”image”; filename=”sky.png”
Content-Type: image/png

[[BINARY DATA]]
----------------------------30059d14e820
Content-Disposition: form-data; name=”format”

pdf
----------------------------30059d14e820--

The [[BINARY DATA]] snippet indicates binary data that cannot be properly displayed with text. Finally, we reset the info slot again.


R> info$reset()

9.1.5.3 Automating form handling—the RHTMLForms package

The tools we have introduced in the previous paragraphs can be adapted to specific cases to handle form interactions. One shortcoming is that the interaction requires a lot of manual labor and inspection of the source code. One attempt to automate some of the necessary steps is the RHTMLForms package (Temple Lang et al. 2012). It was designed to automatically create functions that fill out forms, select the appropriate HTTP method to send data to the server, and retrieve the result. The RHTMLForms package is not hosted on CRAN. You can install it by supplying the location of the repository.


R> install.packages(”RHTMLForms”, repos = ”http://www.omegahat.org/R”, type = ”source”)
R> library(RHTMLForms)

The basic procedure of RHTMLForms works as follows:

  1. We use getHTMLFormDescription() on the URL where the HTML form is located and save its results in an object—let us call it forms.
  2. We use createFunction() on the first item of the forms object and save the results in another object, say form_function.
  3. formFunction() takes input fields as options to send them to the server and return the result.

Let us go through this process using WordNet again. We start by gathering the form description information and creating the form function.

images

Having created formFunction(), we use it to send form data to the server and retrieve the results.


R> html_form_res <- formFunction(s = ”data”, .curl = handle)
R> parsed_form_res <- htmlParse(html_form_res)
R> xpathApply(parsed_form_res,”//li”, xmlValue)
[[1]]
[1] ”S: (n) data, information (a collection of facts from which conclusions may be drawn) ”statistical data””

[[2]]
[1] ”S: (n) datum, data point (an item of factual information derived from measurement or research) ”

Let us have a look at the function we just created.

images

Although it might look intimidating at first, it is easier than it looks because most of the options are for internal use. The options are set automatically and we can disregard them—.reader, .formDescription, elements, .url, url, and .cleanArgs. We are already familiar with some of the options like .curl and .opts. In fact, when looking at the call to formFunction() above you will notice that the same handler was used as before and the updation of info was successful. That is because under the hood of these functions all requests are made with the RCurl functions getForm() and postForm() so that we can expect .opts and .curl to work in the same way as when using pure RCurl functions.

The last set of options are the names of the inputs we fill in and send to the server. In our case, createFunction() correctly recognized o0 to o8 and h as inputs that need not be manipulated by users. The elements argument stores the default values, but in contrast to the input that stores the search term—s—which got an option with the same name, createFunction() did not create arguments for formFunction() that allow specifying values for o0, o1, and so on, as they are not necessary for the POST command.

The RHTMLForms packages might sound like they simplify interactions with HTML forms to a great extent. While it is true that we save some of the actual coding, the interactions still require a fairly intimate knowledge of the form in order to be able to interact with it. This is to say that it is difficult to interact sensibly with a form if you do not know the type of input and output for a form.

9.1.6 HTTP authentication

Not all places on the Web are accessible to everyone. We have learned in Section 5.2.2 that HTTP offers authentication techniques which restrict content from unauthorized users, namely basic and digest authentication. Performing basic authentication with R is straightforward with the RCurl package.

As a short example, we try to access the “solutions” section at www.r-datacollection.com/ materials/solutions. When trying to access the resources with our browser, we are confronted with a login form (see Figure 9.1). In R we can pass username and password to the server with libcurl’s userpwd option. Base64 encoding is performed automatically.

images

Figure 9.1 Screenshot of HTTP authentication mask at http://www.r-datacollection.com/ materials/solutions


R> url <- ”www.r-datacollection.com/materials/solutions”
R> cat(getURL(url, userpwd = ”teacher:sesame”, followlocation = TRUE))
solutions coming soon

The userpwd option also works for digest authentication, and we do not have to manually deal with nonces, algorithms, and hash codes—libcurl takes care for these things on its own.

To avoid storing passwords in the code, it can be convenient to put them in the .Rprofile file, as R reads it automatically with every start (see Nolan and Temple Lang 2014, p. 295).


R> options(RDataCollectionLogin = ”teacher:sesame”)

We can retrieve and use the password using getOption().


R> getURL(url, userpwd = getOption(”RDataCollectionLogin”), 
followlocation = TRUE)

9.1.7 Connections via HTTPS

The secure transfer protocol HTTPS (see Section 5.3.1) becomes increasingly common. In order to retrieve content from servers via HTTPS, we can draw on libcurl/RCurl which support SSL connections. In fact, we do not have to care much about the encryption and SSL negotiation details, as they are handled by libcurl in the background by default.

Let us consider an example. The Inter-university Consortium for Political and Social Research (ICPSR) at the University of Michigan provides access to a huge archive of social science data. We are interested in just a tiny fraction of it—some meta-information on survey variables. At https://www.icpsr.umich.edu/icpsrweb/ICPSR/ssvd/search, the ICPSR offers a fielded search for variables. The search mask allows us to specify variable label, question text or category label, and returns a list of results with some snippets of information. What makes this page a good exercise is that it has to be accessed via HTTPS, as the URL in the browser reveals. In principle, connecting to websites via HTTPS can be just as easy as this


R> url <- ”https://www.icpsr.umich.edu/icpsrweb/ICPSR/ssvd/search”
R> getURL(url)
Error: SSL certificate problem, verify that the CA cert is OK. Details:
error:14090086:SSL routines:SSL3_GET_SERVER_CERTIFICATE:certificate verify failed

Setting up a successful connection does not seem to always be straightforward. The error message states that the server certificate signed by a trusted certificate authority (CA)—necessary to prove the server's identity—could not be verified. This error could indicate that the server should not be trusted because it is not able to provide a valid proof of its identity. In this case, however, the reason for this error is different and we can easily remedy the problem. What libcurl tries to do when connecting to a server via HTTPS is to access the locally stored library of CA signatures to validate the server's first response. On some systems—ours included—libcurl has difficulties finding the relevant file (cacert.pem) on the local drive. We therefore have to specify the path to the file manually and hand it to our gathering function with the argument cainfo. We can either supply the directory where our browser stores its library of certificates or use the file that comes with the installation of RCurl.

images

Alternatively, we can update the bundle of CA root certificates. A current version can be accessed at http://curl.haxx.se/ca/cacert.pem. In cases where validation of the server certificate still fails, we can prevent libcurl from trying to validate the server altogether. This is done with the ssl.verifypeer argument (see Nolan and Temple Lang 2014, p. 300).


R> res <- getURL(url, ssl.verifypeer = FALSE)

This might be a potentially risky choice if the server is in fact not trustworthy. After all, it is the primary purpose of HTTPS to provide means to establish secure connections to a verified server.

Returning to the example, we examine the GET form with which we can query the ICPSR database. The action parameter reveals that the GET refers to /icpsrweb/ICPSR/ ssvd/variables. The <input> elements are variableLabel, questionText, and categoryLabel. We re-specify the target URL in u_action and set up a curl handle. It stores the CA signatures and can be used across multiple requests. Finally, we formulate a getForm() call searching for questions that contain ‘climate change’ in their label, and extract the number of results from the query.

images

This is a minimal evaluation of our search results. We could easily extract more information on the single questions and query other question specifics, too.

9.1.8 Using cookies

Cookies are used to allow HTTP servers to re-recognize clients, because HTTP itself is a stateless protocol that treats each exchange of request and response as though it were the first (see Section 5.2.1). With RCurl and its underlying libcurl library, cookie management with R is quite easy. All we have to do is to turn it on and keep it running across several requests with the use of a curl handle—setting and sending the right cookie at the right time is managed in the background.

In this section, we draw on functions from the packages RCurl, XML, and stringr for HTTP client support, HTML parsing, and XPath queries as well as convenient text manipulation. Furthermore, we create an object info that logs information on exchanged information between our client and the servers we connect to. We also create a handle that is used throughout this section.

images

The most important option for this section is the first argument in the handle— cookiejar = ””. Specifying the cookiejar option even without supplying a file name for the jar—a place to store cookie information in—activates cookie management by the handle. The two options to follow (followlocation and autoreferer) are nice-to-have options that preempt problems which might occur due to redirections to other resources. The remaining options are known from above.

The general approach for using cookies with R is to rely on RCurl’s cookie management by reusing a handle with activated cookie management, like the one specified above, in subsequent requests.

9.1.8.1 Filling an online shopping cart

Although cookie support is most likely needed for accessing webpages that require logins in practice, the following example illustrates cookies with a bookshop shopping cart at Biblio, a page that specializes in finding and ordering used, rare, and out-of-print books.

Let us browse to http://www.biblio.com/search.php?keyisbn=data and put some books into our cart. For the sake of simplicity, the query string appended to the URL already issues a search for books with data as keyword. Each time we select a book for our cart by clicking on the add to cart button, we are redirected to the cart (http://www.biblio.com/cart.php). We can go back to the search page, select another book and add it to the cart.

To replicate this from within R, we first define the URL leading to the search results page (search_url) as well as the URL leading to the shopping cart (cart_url) for later use.


R> search_url <- ”www.biblio.com/search.php?keyisbn=data”
R> cart_url <- ”www.biblio.com/cart.php

Next, we download the search results page and directly parse and save it in searchPage.


R> search_page <- htmlParse(getURL(url = search_url, curl = handle))

Adding items to the shopping cart is done via HTML forms.

images

images

We extract the book IDs to later add items to the cart.


R> xpath <- ”//div[@class=’order-box’][position()<4]/form/input[@name=’bid’]/@value”
R> bids <- unlist(xpathApply(search_page, xpath, as.numeric))
R> bids
[1] 652559100 453475278 468759385

Now we add the first three items from the search results page to the shopping cart by sending the necessary information (bid, add, and int) to the server. Notice that by passing the same handle to the request via the curl option, we automatically add received cookies to our requests.

images

Finally, we retrieve the shopping cart and check out the items that have been stored.


R> cart <- htmlParse(getURL(url = cart_url, curl = handle))
R> clean <- function(x) str_replace_all(xmlValue(x), ”(	)|(

)”, ””)
R> cat(xpathSApply(cart, ”//div[@class=’title-block’]”, clean))

DATA
by Hill, Anthony (ed)
Developing Language Through Design and Technology
by DATA
Guide to Design and technology Resources
by DATA

As expected, there are three items stored in the cart. Let us consider again the headers sent with our requests and received from the server. We first issued a request that did not contain any cookies.


R> cat(str_split(info$value()[”headerOut”], ”
”)[[1]][1:13])

GET /search.php?keyisbn=data HTTP/1.1
Host: www.biblio.com
Accept: */*
from: [email protected]
user-agent: R version 3.0.2 (2013-09-25), x86_64-w64-mingw32

The server responded with the prompt to set two cookies—one called vis, the other variation.


R> cat(str_split(info$value()[”headerIn”], ”
”)[[1]][1:14])

HTTP/1.1 200 OK
Server: nginx
Date: Thu, 06 Mar 2014 10:27:23 GMT
Content-Type: text/html; charset=UTF-8
Transfer-Encoding: chunked
Connection: keep-alive
Keep-Alive: timeout=60
Set-Cookie: vis=language%3Ade%7Ccountry%3A6%7Ccurrency%3A9%7Cvisitor%3AVrCZ...; expires=Tue, 05-Mar-2019 10:27:21 GMT; path=/; domain=.biblio.com; httponly
Set-Cookie: variation=res_a; expires=Fri, 07-Mar-2014 10:27:21 GMT; path=/; domain=.biblio.com; httponly
Vary: User-Agent,Accept-Encoding
Expires: Fri, 07 Mar 2014 10:27:23 GMT
Cache-Control: max-age=86400
Cache-Control: no-cache

Our client responded with a new request, now containing the two cookies.


R> cat(str_split(info$value()[”headerOut”], ”
”)[[1]][1:13])

GET /cart.php?bid=652559100&add=1&int=keyword%5Fsearch HTTP/1.1
Host: www.biblio.com
Accept: */*
Cookie: variation=res_a; vis=language%3Ade%7Ccountry%3A6%7Ccurrency%3A9%7Cvisitor%3AVrCZz...
from: [email protected]
user-agent: R version 3.0.2 (2013-09-25), x86_64-w64-mingw32

If we had failed to supply the cookies, our shopping cart would have remained empty. The following request is identical to the request made above—we use the same handler and code—except that we use cookielist = ”ALL” to reset all cookies collected so far.


R> cart <- htmlParse(getURL(url = cart_url, curl = handle, 
cookielist = ”ALL”))
R> clean <- function(x) str_replace_all(xmlValue(x), ”(	)|(

)”, ””)
R> cat(xpathSApply(cart, ”//div[@class=’title-block’]”, clean))

Consequently, the cart is returned empty because without cookies the server has no way of knowing which actions, like adding items to the shopping cart, have been taken so far.

9.1.8.2 Further tricks of the trade

The approach from above—define and use a handle with enabled cookie management and let RCurl and libcurl take care of further details of HTTP communication—will be sufficient in most cases. Nevertheless, sometimes more control of the specifics is needed. In the following we will go through some further features in handling cookies with RCurl.

We have specified cookiejar = ”” in the previous section to activate automatic cookie management. If a file name is supplied to this option, for example, cookiejar = ”cookies.txt”, all cookies are stored in this file whenever cookielist = ”FLUSH” is specified as option to an RCurl function using the handle or via curlSetOpt().


R> handle <- getCurlHandle(cookiejar = ”cookies.txt”)
R> res <- getURL(”http://httpbin.org/cookies/set?k1=v1&k2=v2”, curl = handle)
R> handle <- curlSetOpt(cookielist = ”FLUSH”, curl = handle)

An example of a cookie file looks as follows:


R> readLines(”cookies.txt”)
[1] ”# Netscape HTTP Cookie File”
[2] ”# http://curl.haxx.se/rfc/cookie_spec.html”
[3] ”# This file was generated by libcurl! Edit at your own risk.”
[4] ””
[5] ”httpbin.org	FALSE	/	FALSE	0	k2	v2”
[6] ”httpbin.org	FALSE	/	FALSE	0	k1	v1”

We can use the information in the file to get a set of initial cookies using the cookiefile option.


R> new_handle <- getCurlHandle(cookiefile = ”cookies.txt”)

Besides writing collected cookies to a file, we can also clear the list of cookies collected so far with cookielist=”ALL”.

images

Last but not least, although RCurl and libcurl will handle cookies set via HTTP reliably if cookies are set by other technologies, for example, by JavaScript—it is necessary to provide some cookies manually. We can do this by providing the cookie option with the exact specification of the contents of cookies.

images

9.1.9 Scraping data from AJAX-enriched webpages with Selenium/Rwebdriver

We learned in Chapter 6 that accessing particular information in webpages may be impeded when a site employs methods for dynamic data requests, especially through XHR objects. We illustrated that in certain situations this problem can be circumvented through the use of Web Developer Tools which can reveal the target source from which AJAX-enriched webpages query their information. Unfortunately, this approach does not constitute a universal solution to all extraction problems where dynamic data requests are involved. For one reason, the source may not be so easily spotted as in the stylized examples that we introduced but requires time-consuming investigation of the respective code and considerably more knowledge about JavaScript and the XHR object. Another problem that renders this approach infeasible is that AJAX is frequently not directly responsible for accessing a specific data source but only interacts with an intermediate server-side scripting language like PHP. PHP allows evaluating queries and sending requests to a database, for example, a MySQL database (see Chapter 7), and then feeds the returned data back to the AJAX callback function and into the DOM tree. Effectively, such an approach would conceal the target source and eliminate the option of directly accessing it.

In this section, we introduce a generalized approach to cope with dynamically rendered webpages by means of browser control. The idea is the following: Instead of bypassing web browsers, we leverage their capabilities of interpreting JavaScript and formulating changes to the live DOM tree by directly including them into the scraping process. Essentially, this means that all communication with a webpage is routed through a web browser session to which we send and from which we receive information. There are numerous programs which allow such an approach. Here, we introduce the Selenium/Webdriver framework for browser automation (Selenium Project 2014a,b) and its implementation in R via the Rwebdriver package. We start by presenting the problems caused by a running example. We then turn to illustrating the basic idea behind Selenium/Webdriver, explain how to install the Rwebdriver package, and show how to direct commands to the browser directly from the R command line. Using the running example, we discuss the implemented methods and how we can leverage them for web scraping.

9.1.9.1 Case study: Federal Contributions Database

As a running example we try to obtain data from a database on financial contributions to US parties and candidates. The data have originally been collected and published by OpenSecrets.org under a non-restrictive license (Center for Responsive Politics 2014). A sample of the data has been fed to a database that can be accessed at http://r-datacollection.com/ materials/selenium/dbQuery.html. As always, we start by trying to learn the structure of the page and the way it requests and handles information of interest. The tool of choice for this task are browser-implemented Web Developer Tools which were introduced in Section 6.3. Let us go through the following steps:

  1. Open a new browser window and go to http://r-datacollection.com/materials/selenium/ dbQuery.html. In the Network tab of your Web Developer Tools you should spot that opening the page has triggered requests of three additional files: dbQuery.html which includes the front end HTML code as well as the auxiliary JavaScript code, jquery-1.8.0.min.js which is the jQuery library, and bootstrap.css, a style sheet. The visual display of the page should be more or less similar to the one shown in Figure 9.2.
  2. Choose input values from the scroll-down menus and click the submit button. Upon clicking, your Network tab should indicate the request of a file named getEntry.php?y=2013&m=01&r=&p=&t=T or similar, depending on the values you have picked.
  3. Take a look again at the page view to ensure that an HTML table has been created at the lower end of the page. While it is not directly obvious where this information comes from, usually a request to a PHP file is employed to fetch information from an underlying MySQL database using the parameter value pairs transmitted in the URL to construct the query to the database. This complicates extraction matters, since working directly with the database is usually not possible because we do not have the required access information and are thus restricted to working with the retrieved output from the PHP file.
images

Figure 9.2 The Federal Contributions database

9.1.9.2 Selenium and the Rwebdriver package

Selenium/Webdriver is an open-source suite of software with the primary purpose of providing a coherent, cross-platform framework for testing applications that run natively in the browser. In the development of web applications, testing is a necessary step to establish expected functionality of the application, minimize potential security and accessibility issues, and guarantee reliability under increased user traffic. Before the creation of Selenium this kind of testing had been carried out manually—a tedious and error-prone undertaking. Selenium solves this problem by providing drivers to control browser behavior such as clicks, scrolls, swipes, and text inputs. This enables programmatic approaches to the problem by using a scripting language to characterize sequences of user behaviors and report if the application fails.

Selenium's capability to drive interactions with the webpage through the browser is of more general use besides testing purposes. Since it allows to remote-control the browser, we can work with and request information directly from the live DOM tree, that is, how the visual display is presented in the browser window. Accessing Selenium functionality from within R is possible via the Rwebdriver package. It is available from a GitHub repository and can be installed with the install_github() function from the devtools package (Wickham and Chang 2013).


R> library(devtools)
R> install_github(repo = ”Rwebdriver”, username = ”crubba”)
Getting started with Selenium Webdriver

Using Selenium requires initiating the Selenium Java Server. The server is responsible for launching and killing browsers as well as receiving and interpreting the browser commands. The communication with the server from inside the programming environment works via simple HTTP messages. To get the server up and running, the Selenium server file needs to be downloaded from http://docs.seleniumhq. org/download/ to the local file system. The server file follows the naming convention (selenium-server-standalone-<version-number>.jar).5 In order to initiate the server, open the system prompt, change the directory to the jar-file location, and execute the file.

images

The console output should resemble the one printed in Figure 9.3. The server is now initiated and waits for commands. The system prompt may be minimized and we can turn our attention back to the R console. Here, we first load the Rwebdriver as well as the XML package.

images

Figure 9.3 Initializing the Selenium Java Server


R> library(Rwebdriver)
R> library(XML)

The first step is to create a new browser window. This can be accomplished through the start_session() function which requires passing the address of the Java server—by default http://localhost:4444/wd/hub/. Additionally, we pass firefox to the browser argument to instruct the server to produce a Firefox browser window.


R> start_session(root = ”http://localhost:4444/wd/hub/”, browser = 
”firefox”)

Once the command is executed, the Selenium API opens a new Firefox window to which we can now direct browser requests.

Using Selenium for web scraping

We now return to the running example and explore some of Selenium's capabilities. Note that we are not introducing all functionality of the package but focus our attention to functions most commonly used in the web scraping process. For a full list of implemented methods, see Table 9.1.

Let us assume we wish to access the database through its introductory page at http://www.r-datacollection.com/materials/selenium/intro.html. To direct the browser to a specific webpage, we can use post.url() with specified url parameter.


R> post.url(url = ”http://www.r-datacollection.com/materials/selenium/intro.html”)

The browser should respond and display the intro webpage. When a page forwards the browser to another page, it can be helpful to retrieve the current browser URL, since this may differ from the one that was specified in the query. We can obtain the information through the get.url() command.


R> get.url()

[1]”http://r-datacollection.com/materials/selenium/intro.html”

The returned output is a standard character vector. To pull the page title of the browser window, use page_title().


R> page_title()

[1]”The Federal Contributions Database”

To arrive at the form for querying the database, we need to perform a click on the enter button at the bottom right. Performing clicks with Selenium requires a two-step process. First, we need to create an identifier for the button element. Selenium allows specifying such an identifier through multiple ways. Since we already know how to work with XPath expressions (see Chapter 4), we will employ this method. By using the Web Developer Tools, we can obtain the following XPath expression for the button element, /html/body/div/div[2]/form/input. When we pass the XPath expression as a string to the element_xpath_find() function, we are returned the corresponding element ID from the live DOM. Let us go ahead and save the ID in a new object called buttonID.


R> buttonID <- element_xpath_find(value = ”/html/body/div/div[2]/
form/input”)

The second step is to actually perform the left-mouse click on the identified element. For this task, we make use of element_click(), and pass buttonID as the ID argument.


R> element_click(ID = buttonID)

This causes the browser to change the page to the one displayed in Figure 9.2. Additionally, you might have observed a pop-up window opening upon clicking the button. The occurrence of pop-ups generates a little complication, since they cause Selenium to switch the focus of its activate window to the newly opened pop-up. To return focus to the database page, we need to first obtain all active window handles using window_handles().


R> allHandles <- window_handles()

To change the focus back on the database window, you can use the window_change() function and pass the window handle that corresponds to the correct window. In this case, it is the first element in allHandles.6


R> window_change(allHandles[1])

Now that we have accessed the database page, we can start to query information from it. Let us try to fetch contribution records for Barack Obama from January 2013. To accomplish this task, we change the value in the Month field. Again, this requires obtaining the ID for the Month input field. From the Web Developer Tools we learn that the following XPath expression is appropriate: ’//*[@id=”yearSelect”]’. At the same time, we save the IDs for the month and the recipient text field.


R> yearID <- element_xpath_find(value = ’//*[@id=”yearSelect”]’)
R> monthID <- element_xpath_find(value = ’//*[@id=”monthSelect”]’)
R> recipID <- element_xpath_find(value = ’//*[@id=”recipientSelect”]’)

In order to change the year, we perform a mouse click on the year field by executing element_click() with the appropriate ID argument.


R> element_click(yearID)

Next, we need to pass the keyboard input that we wish to enter into the database field. Since we are interested in records from the year 2013, we use the keys() function with the first argument set to the correct term.


R> keys(”2013”)

In a similar fashion, we do the same for the other fields.7


R> element_click(monthID)
R> keys(”January”)
R> element_click(recipID)
R> keys(”Barack Obama”)

We can now send the query to the database with a click on the submit button. Again, we first identify the button using XPath and pass the corresponding ID element to the clicking function.


R> submitID <- element_xpath_find(value = ’//*[@id=”yearForm”]
/div/button’)
R> element_click(submitID)

This action should have resulted in a new HTML table being displayed at the bottom of the page. To obtain this information, we can extract the underlying HTML code from the live DOM tree and search the code for a table.


R> pageSource <- page_source()
R> moneyTab <- readHTMLTable(pageSource, which = 1)

With a few last processing steps, we can bring the information into a displayable format.

images

Table 9.1 Overview of Selenium methods (v.0.1)

Command Arguments Output
start_session() root, browser Creates a new session
quit_session() Closes session
status() Queries the server's current status
active_sessions() Retrieves information on active sessions
post.url() url Opens new url
get.url() Receives URL from current webpage
element_find() by, value Finds elements by method and the value
element_xpath_find() value Finds elements corresponding to XPath string value
element_ptext_find() value Finds elements corresponding to text string value
element_css_find() value Finds elements corresponding to CSS selector string value
element_click() ID, times, button Clicks on element ID
element_clear() ID Clears input value from element ID’s text field
page_back() times One page backward
page_forward() times One page forward
page_refresh() Refreshes current webpage
page_source() Receives HTML source string
page_title() Receives webpage title string
window_handle() Returns handle of the activated window
window_handles() Returns all window handles in current session
window_change() handle Changes focus to window with handle
window_close() handle Closes window with handle
get_window_size() handle Returns vector of current window size
post_window_size() size, handle Posts a new window size for window handle
get.window_position() handle Returns x,y coordinates of window handle
post_window_position() position, handle Changes coordinates of window handle
key terms Post keyboard term values
Concluding remarks

The web scraping process laid out in this section departs markedly from the techniques and tools we have previously outlined. As we have seen, Selenium provides a powerful framework and a way for working with dynamically rendered webpages when simple HTTP-based approaches fail. It helps keep in mind that this flexibility comes with a cost, since the browser itself takes some time to receive the request, process it, and render the page. This has the potential to slow down the extraction process drastically, and we therefore advise users to use Selenium only for tasks where other tools are unfit. We oftentimes find using Selenium most helpful for describing transitions between multiple webpages and posting clicks and keyboard commands to a browser window, but once we encounter solid URLs, we switch back to the R-based HTTP methods outlined previously for speed purposes.

Besides the Rwebdriver package there is a package called Relenium which resembles the package introduced in this chapter (Ramon and de Villa 2013). Although Relenium provides a more straightforward initiation process of the Selenium server, it has, at the time of writing, a more limited functionality.

9.1.10 Retrieving data from APIs

We have mentioned APIs in passing when introducing XML, JSON, and other fundamentals. Generally, APIs encompass tools which enable programmers to connect their software with “something else.” They are useful in programming software that relies on external soft- or hardware because the developers do not have to go into the details of external soft- or hardware mechanics.

When we talk about APIs in this book, we refer to web services or (web) APIs, that is, interfaces with web applications. We treat the terms “API” and “web service” synonymously, although the term API encompasses a much larger body of software. The reason why APIs are of importance for web data collection tasks is that more and more web applications offer APIs which allow retrieving and processing data. With the rise of Web 2.0, where web APIs provided the basis for many applications, application providers recognized that data on the Web are interesting for many web developers. As APIs help make products more popular and might, in the end, generate more advertising revenues, the availability of APIs has rapidly increased.

The general logic of retrieving data from web APIs is simple. We illustrate it in Figure 9.4. The API provider sets up a service that grants access to data from the application or to the application itself. The API user accesses the API to gather data or communicate with the application. It may be necessary to write wrapper software for convenient data exchange with the web service. Wrappers are functions that handle details of API access and data transformation, for example, from JSON to R objects. The modus operandi of APIs varies—we shortly discuss the popular standards REST and SOAP further below. APIs provide data in various formats. JSON has probably become the most popular data exchange format of modern web APIs, but XML is still frequently used, and any other formats such as HTML, images, CSVs, and binary data files are possible.

images

Figure 9.4 The mechanics of web APIs

APIs are implemented for developers and thus must be understandable to humans. Therefore, an extensive documentation of features, functions, and parameters is often part of an API. It gives programmers an overview of the content and form of information an API provides, and what information it expects, for example, via queries.

Standardization of APIs helps programmers familiarize themselves with the mechanics of an API quickly. There are several API standards or styles, the more popular ones being REST and SOAP. It is important to note that in order to tap web services with R, we often do not have to have any deeper knowledge of these techniques—either because others have already programmed a handy interface to these APIs or because our knowledge about HTTP, XML, and JSON suffices to understand the documentation of an API and to retrieve the information we are looking for. We therefore consider them just briefly.

REST stands for Representational State Transfer (Fielding 2000). The core idea behind REST is that resources are referenced (e.g., via URLs) and representations of these resources are exchanged. Representations are actual documents like an HTML, XML, or JSON file. One might think of a conversation on Twitter as a resource, and this resource could be represented with JSON code or equally valid representations in other formats. This sounds just like what the World Wide Web is supposed to be—and in fact one could say that the World Wide Web itself conforms to the idea of REST. The development of REST is closely linked to the design of HTTP, as the standard HTTP methods GET, POST, PUT, and DELETE are used for the transfer of representations. GET is the usual method when the goal is to retrieve data. To simplify matters, the difference between a GET request of a REST API and a GET request our browser puts to a server when asking for web content is that (a) parameters are often well-documented and (b) the response is simply the content, not any layout information. POST, PUT, and DELETE are methods that are implemented when the user needs to create, update, and delete content, respectively. This is useful for APIs that are connected to personal accounts, such as APIs from social media platforms like Facebook or Twitter. Finally, a RESTful API is an API that confirms to the REST constraints. The constraints include the existence of a base URL to which query parameters can be added, a certain representation (JSON, XML,...), and the use of standard HTTP methods.

Another web service standard we sometimes encounter is SOAP, originally an acronym for Simple Object Access Protocol. As the technology is rather difficult to understand and implement, it is currently being gradually superseded by REST. SOAP-based services are frequently offered in combination with a WSDL (Web Service Description Language) document that fully describes all the possible methods as well as the data structures that are expected and returned by the service. WSDL documents themselves are based on XML and can therefore be interpreted by XML parsers. The resulting advantage of SOAP-based web services is that users can automatically construct API call functions for their software environment based on the WSDL, as the API's functionality is fully described in the document. For more information on working with SOAP in R, see Nolan and Temple Lang (2014, Chapter 12). The authors provide the SSOAP package that helps work with SOAP and R (Temple Lang 2012b) by transforming the rules documented in a WSDL document into R functions and classes.8 Generating wrapper functions on-the-fly has the advantage that programs can easily react to API changes. However, as the SOAP technology is becoming increasingly uncommon, we focus on REST-based services in this section.

Using a RESTful API with R can be very simple and not very different from what we have learned so far regarding ordinary GET requests. As a toy example we consider Yahoo's Weather RSS Feed, which is documented at http://developer.yahoo.com/weather/. It provides information on current weather conditions at any given place on Earth as well as a five-day forecast in the form of an RSS file, that is, an XML-style document (see Section 3.4.3). The feed basically delivers the data part of what is offered at http://weather.yahoo.com/. We could use the API to generate our own forecasts or to build an R-based weather gadget. According to the Terms of Use in the documentation, the feeds are provided free of charge for personal, non-commercial uses.

Making requests to the feed is pretty straightforward when studying the documentation. All we have to specify is the location for which we want to get a feedback from the API (the w parameter) and the preferred degrees unit (Fahrenheit or Celsius; the u parameter). The location parameter requires a WOEID code, the Where On Earth Identifier. It is a 32-bit identifier that is unique for every geographic entity (see http://developer.yahoo.com/geo/ geoplanet/guide/concepts.html). From a manual search on the Yahoo Weather application, we find that the WOEID of Hoboken, New Jersey, is 2422673. Calling the feed is simply done using the HTTP GET syntax. We already know how to do this in R. We specify the API's base URL and make a GET request to the feed, providing the w parameter with the WOEID and the u parameter for degrees in Celsius.


R> feed_url <- ”http://weather.yahooapis.com/forecastrss”
R> feed <- getForm(feed_url, .params = list(w = ”2422673”, u = ”c”))

As the retrieved RSS feed is basically just XML content, we can parse it with XML’s parsing function.


R> parsed_feed <- xmlParse(feed)

The original RSS file is quite spacious, so we only provide the first and last couple of lines.

images

images

We can process the parsed XML object using standard XPath expressions and convenience functions from the XML package. As an example, we extract the values of current weather parameters which are stored in a set of attributes.

images

We also build a small data frame that contains the forecast statistics for the next 5 days.

images

Processing the result from a REST API query is entirely up to us if no R interface to a web service exists. We could also construct convenient wrapper functions for the API calls. Packages exist for some web services which offer convenience functions to pass R objects to the API and get back R objects. Such functions are not too difficult to create once you are familiar with an API's logic and the data technology that is returned. Let us try to construct such a wrapper function for the Yahoo Weather Feed example.

There are always many ways to specify wrapper functions for existing web services. We want to construct a command that takes a place's name as main argument and gives back the current weather conditions or a forecast for the next few days. We have seen above that the Yahoo Weather Feed needs a WOEID as input. To manually search for the corresponding WOEID of a place and then feed it to the function seems rather inconvenient, so we want to automate this part of the work as well. Indeed, there is another API that does this work for us. At http://developer.yahoo.com/geo/geoplanet/ we find a set of RESTful APIs subsumed under the label Yahoo! GeoPlanet which offer a range of services. One of these services returns the WOEID of a specific place.

http://where.yahooapis.com/v1/places.q('northfield%20mn%20usa')?appid=[yourappidhere]

The URL contains the query parameter appid. We have to obtain an app ID to be able to use this service. Many web services require registration and sometimes even involve a sophisticated authentication process (see next section). In our case we just have to register for the Yahoo Developer Network to obtain an ID. We register our application named RWeather at Yahoo. After providing the information, we get the ID and can add it to our API query. In order to be able to reuse the ID without having to store it in the code, we save it in the R options:9


R> options(yahooid = ”t.2cnduc0BqpWb7qmlc14vEk8sbL7LijbHoKS.utZ0”)

The call to the WOEID API is as follows. We start with the base URL and add the place we are looking for in the URL's placeholder between the parentheses. The sprintf() function is useful because it allows pasting text within another string. We just have to mark the string placeholder with %s.


R> baseurl <- ”http://where.yahooapis.com/v1/places.q(’%s’)”
R> woeid_url <- sprintf(baseurl, URLencode(”Hoboken, NJ, USA”))

Notice also that we have to encode the place name with URL encoding (see Section 5.1.2).

http://where.yahooapis.com/v1/places.q('Hoboken,%20NJ,%20USA')

Next we formulate a GET call to the API. We add our Yahoo app ID which we retrieve from the options. The service returns an XML document which we directly parse into an object named parsed_woeid.


R> parsed_woeid <- xmlParse((getForm(woeid_url, appid = getOption(”yahooid”))))

The XML document itself looks as follows.

images

There are several WOEIDs stored in the document, one for the country, one for the state, and one for the town itself. We can extract the town WOEID with an XPath query on the retrieved XML file. Note that the document comes with namespaces. We access the <locality1> element where the WOEID is stored with the XPath expression //*[local-name()=’locality1’] which addresses the document's local names.


R> woeid <- xpathSApply(parsed_woeid, ”//*[local-name()=’locality1’]”, xmlAttrs)[2,]
R> woeid
           woeid
”2422673”

Voilà, we have retrieved the corresponding WOEID. Recall that our goal was to construct one function which returns the results of a query to Yahoo's Weather Feed in a useful R format. We have seen that such a function has to wrap around not only one, but two APIs—the WOEID returner and the actual Weather Feed. The result of our efforts, a function named getWeather(), are displayed in Figure 9.5.

images

Figure 9.5 An R wrapper function for Yahoo's Weather Feed

The wrapper function splits into five parts. The first reports errors if the function's arguments ask—to determine if current weather conditions or a forecast should be reported—and temp—to set the reported degrees Celsius or Fahrenheit—are wrongly specified. The second part (get woeid) replicates the call to the WOEID API which we have considered in detail above. The third part (get weather feed) uses the WOEID and makes a call to Yahoo's Weather Feed. The fourth part (get current conditions) is evaluated if the user asks for the current weather conditions at a given place. We have stored some condition parameters in a data frame conds and input the results into a single sentence—not very useful if we want to post-process the data, but handy if we just want to know what the weather is like at the moment.10 If a forecast is requested, the function's fifth part is activated. It constructs a data frame from the forecasts in the XML document and returns it, along with a short message.

Let us try out the function. First, we ask for the current weather conditions in San Francisco.


R> getWeather(place = ”San Francisco”, ask = ”current”, temp = ”c”)
The weather in San Francisco, CA, United States is cloudy. Current 
temperature is 9 degrees C.

This call was successful. Note that Yahoo's Weather API is tolerant concerning the definition of the place. If place names are unique, we do not have to specify the state or country. If the place is not unique (e.g., “Springfield”), the API automatically picks a default option. Next, we want to retrieve a forecast for the weather in San Francisco.

images

We could easily expand the function by adding further parameters or returning more useful R objects. This example served to demonstrate how REST-based web services work in general and how easy it is to tap them from within R. There are many more useful APIs on the Web. At http://www.programmableweb.com/apis we get an overview of thousands of web APIs. Currently, there are more than 11,000 web APIs listed as well as over 7,000 mashups, that is, applications which make use and combine existing content from APIs. We provide some additional advice on finding useful data sources, including APIs, in Section 9.4.

9.1.11 Authentication with OAuth

Many web services are open to anybody. In some cases, however, APIs require the user to register and provide an individual key when making a request to the web service. Authentication is used to trace data usage and to restrict access. Related to authentication is authorization. Authorization means granting an application access to authentication details. For example, if you use a third-party twitter client on your mobile device, you have authorized the app to use your authentication details to connect to your Twitter account. We have learned about HTTP authentication methods in Section 5.2.2. APIs often require more complex authentication via a standard called OAuth.

OAuth is an important authorization standard serving a specific scenario. Imagine you have an account on Twitter and regularly use it to inform your friends about what is currently on your mind and to stay up to date about what is going on in your network. To stay tuned when you are on the road, you use Twitter on your mobile phone. As you are not satisfied with the standard functions the official Twitter application offers, you rely on a third-party client app (e.g., Tweetbot), an application that has been programmed by another company and that offers additional functionality. In order to let the app display the tweets of people you follow and give yourself the opportunity to tweet, you have to grant it some of your rights on Twitter. What you should never want to do is to hand out your access information, that is, login name and password, to anybody—not even the Twitter client. This is where OAuth comes into play. OAuth differs from other authentication techniques in that it distinguishes between the following three parties:

  • The service or API provider. The provider implements OAuth for his service and is responsible for the website/server which the other parties access.
  • The data owners. They own the data and control which consumer (see next party) is granted access to the data, and to what extent.
  • The data consumer or client. This is the application which wants to make use of the owner's data.

When we are working with R, we usually take two of the roles. First, we are data owners when we want to authorize access to data from our own accounts of whatever web service. Second, we are data consumers because we program a piece of R software that should be authorized to access data from the API.

OAuth currently exists in two flavors, OAuth 1.0 and OAuth 2.0 (Hammer-Lahav 2010; Hardt 2012). They differ in terms of complexity, comfort, and security.11 However, there have been controversies on the question whether OAuth is indeed more secure and useful than its predecessor.12 As users, we usually do not have to make the choices between the two standards; hence, we do not go into more into detail here. OAuth’s official website can be found at http://oauth.net/. More information, including a beginner's guide and tutorials, are available at http://hueniverse.com/oauth/.

How does authorization work in the OAuth framework? First of all, OAuth distinguishes between three types of credentials: client credentials (or consumer key and secret), temporary credentials (or request token and secret), and token credentials (or access token and secret). Credentials serve as a proof for legitimate access to the data owner's information at various stages of the authorization process. Client credentials are used to register the application with the provider. Client credentials authenticate the client. When we use R to tap APIs, we usually have to start with registering an application at the provider's homepage which we could call “My R-based program” or similar. In the process of registration, we retrieve client credentials, that is, a consumer key and secret that is linked with our (and only our) application. Temporary credentials prove that an application's request for access tokens is executed by an authorized client. If we set up our application to access data from a resource owner (e.g., our own Twitter account), we have to obtain those temporary credentials, that is, a request token and secret, first. If the resource owner agrees that the application may access his/her data (or parts of it), the application's temporary credentials are authorized. They now can be exchanged for token credentials, that is, an access token and secret. For future requests to the API, the application now can use these access credentials and the user does not have to provide his/her original authentication information, that is, username and password, for this task.

The fact that several different types of credentials are involved in OAuth authorization practice makes it clear that this is a more complicated process that encompasses several steps. Fortunately, we can rely on R software that facilitates OAuth registration. The ROAuth package (Gentry and Lang 2013) provides a set of functions that help specify registration requests from within R. A simplified OAuth registration interface is provided by the httr package (Wickham 2012). We illustrate OAuth authentication in R with the commands from the httr package.

oauth_endpoint() is used to define OAuth endpoints at the provider side. Endpoints are URLs that can be requested by the application to gain tokens for various steps of the authorization process. These include an endpoint for the request token—the first, unauthenticated token—and the access token to exchange the unauthenticated for the authenticated token.

oauth_app() is used to create an application. We usually register an application manually at the API provider's website. After registration we obtain a consumer key and secret. We copy and paste both into R. The oauth_app() function simple bundles the consumer key and secret to a list that can be used to request the access credentials. While the consumer key has to be specified in the function, we can let the function fetch the consumer secret automatically from the R environment by placing it there in the APPNAME_CONSUMER_SECRET option. The benefit of this approach is that we do not have to store the secret in our R code.

oauth1.0_token() and oauth2.0_token() are used to exchange the consumer key and secret (stored in an object created with the oauth_app() function) for the access key and secret. The function tries to retrieve these credentials from the access endpoint specified with oauth_endpoint().

Finally, sign_oauth1.0() and sign_oauth2.0() are used to create a signature from the received access token. This signature can be added to API requests from the registered application—we do not have to pass our username and password.

We demonstrate by example how OAuth registration is done using Facebook's Graph API. The API grants access to publicly available user information and—if granted by the user— selected private information. The use of the API requires that we have a Facebook account. We first have to register an application which we want to grant access to our profile. We go to https://developers.facebook.com and sign in using our Facebook authentication information. Next, we create a new application by clicking on Apps and Create a new app. We have to provide some basic information and pass a check to prove that we are no robot. Now, our application RDataCollectionApp is registered. We go to the app's dashboard to retrieve information on the app, that is, the App ID and the App secret. In OAuth terms, these are consumer key and consumer secret.

Next, we switch to R to obtain the access key. Using httr’s functionality, we start by defining Facebook's OAuth endpoints. This works with the oauth_endpoint() function.

images

We bundle the consumer key and secret of our app in one object with the oauth_app() function. Note that we have previously dumped the consumer secret in the R environment with Sys.setenv(FACEBOOK_CONSUMER_SECRET = ”3983746230hg8745389234...”) to keep this confidential information out of the R code. oauth_app() automatically retrieves it from the environment and writes it to the new fb_app object.


R> fb_app <- oauth_app(”facebook”, ”485980054864321”)

Now we have to exchange the consumer credentials with the access credentials. Facebook's Graph API uses OAuth 2.0, so we have to use the oauth2.0_token() function. However, before we execute it, we have to do some preparations. First, we add a website URL to our app account in the browser. We do this in the Settings section by adding a website and specifying a site URL. Usually this should work with the URL http://localhost:1410/ but you can also call oauth_callback() to retrieve the callback URL the for the OAuth listener, a web server that listens for the provider's OAuth feedback.13 Second, we define the scope of permissions to apply for. A list of possible permissions can be found at https://developers.facebook. com/docs/facebook-login/permissions/. We pick some of those and write them into the permissions object.


R> permissions <- ”user_birthday, user_hometown, user_location, 
user_status, user_checkins, friends_birthday, friends_hometown, 
friends_location, friends_relationships, friends_status, friends_
checkins, publish_actions, read_stream, export_stream”

Now we can ask for the access credentials. Again, we use httr’s oauth2.0_token() command to perform OAuth 2.0 negotiations.


R> fb_token <- oauth2.0_token(facebook, fb_app, scope = permissions, 
type = ”application/x-www-form-urlencoded”)

starting httpd help server ... done
Waiting for authentication in browser...
Authentication complete.

During the function call we approve the access in the browser. The authentication process is successful. We use the received access credentials to generate a signature object.


R> fb_sig <- sign_oauth2.0(fb_token)

We are now ready to access the API from within R. Facebook's web service provides a large range of functions. Fortunately, there is an R package named Rfacebook that makes the API easily accessible (Barberá 2014). For example, we can access publicly available data from Facebook users with

images

The package also allows us to access information about our personal network.

images

It provides a lot more useful functions. For a more detailed tutorial, check out http:// pablobarbera.com/blog/archives/3.html.

9.2 Extraction strategies

We have learned several methods to gather data from the Web. There are three standard procedures. Scraping with HTTP and extracting information with regular expressions, information extraction via XPath queries, and data gathering using APIs. They should usually be preferred over each other in ascending order (i.e., scraping with regular expressions is least preferable and gathering data via an API is most preferable), but there will be situations where one of the approaches is not applicable or some of the techniques have to be combined. It thus makes sense to become familiar with all of them.

In the following, we offer a general comparison between the different approaches. Each scraping scenario is different, so some of the advantages or disadvantages of a method may not apply for your task. Besides, as always, there is more than one way to skin a cat.

If data on a site are not provided for download in ready-made files or via an API, scraping them off the screen is often the only alternative. With regular expressions and XPath queries we have introduced two strategies to extract information from HTML or XML code. We continue discussing both techniques according to some practical criteria which become relevant in the process of web scraping, like robustness, complexity, flexibility, or general power. Note that these elaborations primarily target static HTML/XML content.

9.2.1 Regular expressions

Figure 9.6 provides a schema of the scraping procedure with regular expressions. In step ①, we identify information on-site that follows a general pattern. The decision to use regular expressions to scrape data or another approach depends on our intuition whether the information is actually generalizable to a regular expression. In some cases, the data can be described by means of a regular expression, but the pattern cannot distinguish from other irrelevant content on the page. For example, if we identify important information wrapped in <b> tags, this can be difficult to distinguish from other information marked with <b> tags. If data need to be retained in their context, regular expressions also have a rough ride.

images

Figure 9.6 Scraping with regular expressions

Step ② is to download the websites. Many of the methods described in Section 9.1 might prove useful here. Additionally, regular expressions can already be of help in this step. They could be used to assemble a list of URLs to be downloaded, or for URL manipulation (see Section 9.1.3).

In step ③, the downloaded content is imported into R. When pursuing a purely regex-based scraping strategy, this is done by simply reading the content as character data with the readLines() or similar functions. When importing the textual data, we have to be exceptionally careful with the encoding scheme used for the original document, as we want to avoid applying regular expressions to get rid of encoding errors. If you start using str_replace() in order to get ä, ó, or ç, you are likely to have forgotten specifying the encoding argument in the readLines() or the parsing function (see Section 8.3). Incidentally, regular expressions do not make use of an HTML or XML DOM, so we do not need to parse the documents. In fact, documents parsed with htmlParse() or xmlParse() cannot be accessed with regular expressions directly. If we use regular expressions in combination with an XPath approach, we first parse the document, extract information with XPath queries and finally modify the retrieved content with regular expressions.

Step ④ is the crucial one for successful web scraping with regular expressions. One has to develop one or more regular expressions which extract the relevant information. The syntax of regular expressions as implemented in R often allows a set of different solutions. The problem is that these solutions may not differ in the outcome they produce for a certain sample of text to be regexed, but they can make a difference for new data. This makes debugging very complex. There are some useful tools which help make regex development more convenient, for example, http://regex101.com/ or http://www.regexplanet.com/.14 These pages offer instant feedback to given regular expressions on sample text input, which makes the process of regex programming more interactive. In general, we follow one of two strategies in regular expression programming. The first is to start with a special case and to work toward a more general solution that captures every piece. For example, only one bit of information is matched with a regular expression—this is the information itself, as characters match themselves. The second strategy is to start with a general expression and introduce restrictions or exceptions that limit the number of matched strings to the desired sample. One could label the first approach the “inductive” and the second one “deductive.” The “deductive” approach is probably more efficient because it starts at an abstract level—and regular expressions are often meant to be abstract—, but usually requires more knowledge about regular expressions. Another feasible strategy which could be located between the two is to start with several rather different pieces of information to be matched and find the pick lock that fits for all of them.

As soon as the regular expression is programmed, extracting the information is the next step (⑤). As shown in Chapter 8, the stringr package (Wickham 2010) is enormously useful for this purpose, possibly in combination with apply-like functions (the native R functions or those provided by the plyr package (Wickham 2011)) for efficient looping over documents.

In the last step ⑥, the code has to be debugged. It is likely that applying regular expressions on the full sample of strings reveals further problems, like false positives, that is, information that has been matched should not be matched, or false negatives, that is, some information to be matched is not matched. It is sometimes necessary to split documents or delete certain parts before regexing them to exclude a bunch of false positives a priori.

9.2.1.1 Advantages of regular expressions for web scraping

What are the advantages of scraping with regular expressions? In the opinion of many seasoned web scrapers, there are not too many. Nevertheless, we think that there are circumstances under which a purely regex-based approach may be superior to any other strategy.

Regular expressions do not operate on context-defining parts of a document. This can be an advantage over an XPath strategy when the XML or HTML document is malformed. When DOM parsers fail, regular expressions, ignorant as they are of DOM structure, continue to search for information. Moreover, to retrieve information from a heterogeneous set of documents, regular expressions can deal with them as long as they can be converted to a plain-text format. Generally, regular expressions are powerful for parsing unstructured text.

String patterns can be the most efficient way to identify and extract content in a document. Imagine a situation where you want to scrape a list of URLs which are scattered across a document and which share a common string feature like a running index. It is possible to identify these URLs by searching for anchor tags, but you would have to sort out the URLs you are looking for in a second step by means of a regular expression.

Regular expression scraping can be faster than XPath-based scraping, especially when documents are large and parsing the whole DOM consumes a lot of time. However, the speed argument cannot be generalized, and the construction of regular expressions or XPath queries is also an aspect of speed. And after all, there are usually more important arguments than speed.

Finally, regular expressions are a useful instrument for data-cleansing purposes as they enable us to get extracted information in the desired shape.

9.2.1.2 Disadvantages of regular expressions for web scraping

As soon as information in a document are connected, and should remain so after harvesting, regular expressions are stretched to their limits. Data without context are often rather uninteresting. Think back to the introductory example from Chapter 8. It was already a complex task to extract telephone numbers from an unstructured document, but to match the corresponding names is often hardly possible if a document does not follow a specific structure. When scraping information from web pages, sticking to regular expressions as a standard scraping tool means ignoring the virtue that sites are hierarchically or sometimes even semantically structured by construction. Markup is structure, and while it is possible to exploit markup with regular expressions, elements which are anchored in the DOM can usually be extracted more efficiently by means of XPath queries.

Besides, regular expressions are difficult to master. Building regular expressions is a brain-teaser. It is sometimes very challenging to identify and then formulate the patterns of information we need to extract. In addition, due to their complexity it is hard to read what is going on in a regular expression. This makes it hard to debug regex scraping code when one has not looked at the scraper for a while.

Many scraping tasks cannot be solved with regular expressions because the content to be scraped is simply too heterogeneous. This means that it cannot be abstracted and formulated as a generalized string pattern. The structure of XML/HTML documents is inherently hierarchical. Sometimes this hierarchy implies different levels of observations in your final dataset. It can be a very complex task to disentangle these information with regular expressions alone. If regular expressions make use of nodes that structure the document, the regex strategy soon becomes very fragile. Incremental changes in the document structure can break. We have observed that such errors can be fixed more easily when working with XPath.

The usefulness of regular expressions depends not least on the kind of information one is looking for. If content on websites can be abstracted by means of a general string pattern, regular expressions probably should be used, as they are rather robust toward changes in the page layout. And even if you prefer to work with XPath, regular expressions are still an important tool in the process. First, when parsing fails, regular expressions can constitute a “last line of defense.” Second, when content has been scraped, the desired information is often not available in atomic pieces but is still raw text. Regular expressions are extraordinarily useful for data-cleaning tasks, such as string replacements, string deletions, and string splits.

In the third part of the book, we provide an application that relies mainly on regular expressions to scrape data from web resources. In Chapter 13, we try to convert an unstructured table—a “format” we sometimes encounter on the Web—into an appropriate R data structure.

9.2.2 XPath

Although the specifics of scraping with XPath are different from regex scraping, the road maps are quite similar. We have sketched the path of XPath scraping in Figure 9.7. First, we identify the relevant information that is stored in an XML/HTML document and is therefore accessible with XPath queries (step ①). In order to identify the source of information, we can inspect the source code in our browser and rely on Web Developer Tools.

images

Figure 9.7 Scraping with XPath

Step ② is equivalent to the regular expressions scraping approach. We download the required resources to our local drive. In principle, we could bypass this step and instantly parse the document “from the webpage.” Either way, the content has to be fetched from the server, and by first storing it locally, we can repeat the parsing process without having to scrape documents multiple times.

In step ③, we parse the downloaded documents using the parsers from the XML package. We suggest addressing character-encoding issues in this step—the later we resolve potential encoding problems, the more difficult it gets. We have learned about different techniques of document subsetting and parsing (e.g., SAX parsing methods)—which method we chose depends upon the requirements or restrictions of the data resources.

Next, we extract the actual information in step ④ by developing one or more XPath queries. The more often you work with XPath, the more intuitive this step becomes. For a start, we recommend two basic procedures. The first is to construct XPath expressions with SelectorGadget (see Section 4.3.3). It returns an expression that usually works, but is likely not the most efficient way to express what you want. The other strategy is the do-it-yourself method. We find it most intuitive to pursue a “backwards induction” approach here. This means that we start by defining where the actual information is located and develop the expression from there on, usually up the tree, until the expression uniquely identifies the information we are looking for. One could also label this a bottom-up search procedure—regardless of how we name it, it helps construct expressions that are slim and potentially more robust to major changes in the document structure.

Once we have constructed suitable XPath expressions, extracting the information from the documents (step ⑤) is easy. We apply the expression with adequate functions from the XML package. The most promising procedure is to use xpathSApply() in combination with one of the XML extractor functions (see Table 4.4). In practice, steps ④ and ⑤ are not distinct. Finding adequate XPath expressions is a continuous trial-and-error process and we frequently jump between expression construction and information extraction. Additionally, XML extractor functions often produce not as clear-cut results as we wish them to be, and bringing the pieces of information into shape takes more than one iteration. Imagine, for example, that we want to extract reviews from a shopping website. While each of these reviews could be stored in a leaf in the DOM, we may want to extract more information that is part of the text, either in a manifest (words, word counts) or latent (sentiments, classes) manner. We can draw upon regular expressions or more advanced text mining algorithms to gather information at this level.

In the final step ⑥, we have to debug and maintain the code. Again, this is not literally the last step, but part of an iterative process.

9.2.2.1 Advantages of XPath

We have stressed that we prefer XPath over regular expressions for scraping content from static HTML/XML. XPath is the ideal counterpart for working with XML-style files, as it was explicitly designed for that purpose. This makes it the most powerful, flexible, easy to learn and write, and robust instrument to access content in XML/HTML files.

More specifically, the fact that XPath was designed for XML documents makes queries intuitive to write and read. This is all the more true when you compare it with regular expressions, which are defined on the basis of content, not context. As context follows a clearly defined structure, XPath queries are easier, especially for common cases.

XPath is an expressive language, as it allows the scraping task to be substantially informed about a node of interest using a diverse set of characteristics. We can use a node's name, numeric or character features, its relation to other nodes, or content-like attributes. Single nodes can be uniquely defined. Additionally, working with XPath is efficient because it allows returning node sets with comparatively minimal code input.

As this strategy mainly relies on structural features of documents, XPath queries are robust to content changes. Certain content is fundamentally heterogeneous, such as press releases, customer reviews, and Wikipedia entries. As long as the fundamental architecture of a page remains the same, an XPath scraping strategy remains valid.

9.2.2.2 Disadvantages of XPath

Although XPath is generally superior for scraping tasks compared with regular expressions, there are situations where XPath scraping fails.

When the parser fails, that is, when it does not produce a valid representation of the document, XPath queries are essentially useless. While our browser may be tolerant toward broken HTML documents and still interpret them correctly, our R XML parser might not. If we work on non-XML-style data, XPath expressions are of no help either.

Complementary to the advantages of regular expression scraping for clearly defined patterns in a fragile environment, using XPath expressions to extract information is difficult when the context is highly variable, for example when the layout of a webpage is constantly altered.

9.2.3 Application Programming Interfaces

There is little doubt that gathering data from APIs/web services is the gold standard for web data collection. Scraping data from HTML websites is often a difficult endeavor. We first have to identify in which slots of the HTML tree the relevant data are stored and how to get rid of everything else that is not needed. APIs provide exactly the information we need, without any redundant information. They standardize the process of data dissemination, but also retain control for the provider over who accesses what data. Developers use different programming languages and use data for many different purposes. Web services allow providing standardized formats that most programming languages can deal with.

We illustrate the data collection procedure with APIs and R in Figure 9.8. In step ①, we have to find an API and become familiar with the terms of use or limits and the available methods. Commercial APIs can be very restrictive or offer no data at all if you do not pay a monthly fee, so you should find out early what you get for which payment. And do not invest time for nothing—not all web services are well-maintained. Before you start to program wrappers around an existing API, check whether the API is regularly updated. The API directory at http://www.programmableweb.com/apis also indicates when services are deprecated or moved to another place.

images

Figure 9.8 Data collection with APIs

Steps ② and ③ are optional. Some web services require the users to register. Authentication or authorization methods can be quite different. Sometimes it suffices to register by email to obtain an individual key that has to be delivered with every request. Other ways of authentication are based on user/password combinations. Authentication via OAuth as described in Section 9.1.11 can be even more complex.

In step ④, we formulate a call to the API to request the resources. If we are lucky, we can draw upon an existing set of R functions that provide an interface to the API. We suggest some possible repositories which may offer the desired piece of R software that helps work with an API in Section 9.4. However, as the number of available web services increases quickly, chances are that we have to program our own R wrapper. Wrappers are pieces of software which wrap around existing software—in our case to be able to use R functions to call an API and to make the data we retrieve from an API accessible for further work in R.

In step ⑤, we process the incoming data. How we do this depends upon the data format delivered by the web service. In Chapter 3, we have learned how to use tools from the XML and jsonlite packages to parse XML and JSON data and eventually convert them to R objects. R packages which provide ready-made interfaces to web services (see Section 9.4) usually take care of this step and are therefore exceptionally handy to use.

As always, we should regularly check and debug our code (step ⑥). Be aware of the fact that API features and guidelines can change over time.

9.2.3.1 Advantages of working with APIs

The advantages of web services over the other techniques stem from the fact that tapping APIs is in fact not web scraping. Many of the disadvantages of screen scraping, malformed HTML, other robustness to legal issues, do not apply to data collection with web services. As a result, we can draw upon clean data structures and have higher trust in the collection outcomes.

Further, by registering an application for an API we make an agreement between provider and user. In terms of stability, chances are higher that databases from maintained APIs are updated regularly. When scraping data from HTML, we are often less sure about this. Some APIs provide exclusive access to content which we could not otherwise access. In terms of transparency, as data access procedures are standardized across many computer languages, the data collection process of projects based on data from web services can be replicated in other software environments as well.

As the focus of web services is on the delivery of data, not layout, our code is generally more robust. Web services usually satisfy a certain demand and we are often not the only ones interested in the data. If many people create interfaces to the API from various programming environments, we can benefit from this “wisdom of the crowds” and adding robustness to our code.

To sum up, APIs provide important advantages which make them—if available—the source of choice for any project that involves automated online data collection.

9.2.3.2 Disadvantages of working with APIs

The fact that the overwhelming majority of resources on the Web are still not accessed by web APIs motivates large parts of this book. This is no drawback of web services as a tool for data collection per se but merely reflects the fact that there are more data sources on the Web people like to work with than data providers who are willing to offer neat access to their databases.

Although there are not many general disadvantages of using APIs for automated data collection, relying on web service infrastructure can have its own drawbacks. Data providers can decide to limit their API's functionality from one day to the other, as has happened with popular social media APIs.

From the R perspective, we have to acknowledge that we work in a software environment that is not naturally connected to the data formats which plop out of ordinary web services.

However, the advantages of web services often easily outweigh the disadvantages, and even more so because the potential disadvantages do not necessarily apply to every web service and some of the drawbacks can partly be attributed to the other approaches as well.

9.3 Web scraping: Good practice

9.3.1 Is web scraping legal?

In the disclaimer of the book (see p. xix), we noted a caveat concerning the unauthorized use or reproduction of somebody else's intellectual work. As Dreyer and Stockton (2013) put it: “Scraping inherently involves copying, and therefore one of the most obvious claims against scrapers is copyright infringement.” Our advice is to work as transparently as possible and document the sources of your data at any time. But even if one follows these rules, where is the line between legal scraping of public content and violations of copyright or other infringements of the law? Even for lawyers who are devoted to Internet issues the case of web crawling seems to be a difficult matter. Additionally, as the prevailing law varies across countries, we are unfortunately not able to give a comprehensive overview of what is legal in which context. To get an impression of what currently seems to be regarded as illegal, we offer some anecdotal evidence on past decisions. It should be clear, however, that you should not rely on any of these cases to justify your doings.

Most of the prominent legal cases involve commercial interests. The usual scenario is that one company crawls information from another company, processes, and resells it. In the classical case eBay v. Bidder's Edge,15 eBay successfully defended itself against the use of bots on their website. Bidder's Edge (BE), a company that aggregated auction listings, had used automated programs to crawl information from different auction sites. Users could then search listings on their webpage instead of posing many requests to each of the auction sites. According to the verdict,16

BE accessed the eBay site approximate 100,000 times a day. (...) eBay allege[d] that BE activity constituted up to 1.53% of the number of requests received by eBay, and up to 1.10% of the total data transferred by eBay during certain periods in October and November of 1999.

Further,

eBay allege[d] damages due to BE's activity totaling between $ 45,323 and $ 61,804 for a ten month period including seven months in 1999 and the first three months in 2000.

The defendant did not steal information that was not public to anyone, but harmed the plaintiff by causing a considerable amount of traffic on its servers. eBay also complained about the use of deep links, that is, links that directly refer to content that is stored somewhere “deeply” on the page. By using such links clients are able to circumvent the usual process of a website visit.

In another case, Associated Press v. Meltwater, scraper's rights were also curtailed.17 Meltwater is a company that offers software which scrapes news information based on specific keywords. Clients can order summaries on topics which contain excerpts of news articles. Associated Press (AP) argued that their content was stolen by Meltwater and that they need to license before distributing it. The judge's argument in favor of the AP was that Meltwater is rather a competitor of AP than an ordinary search engine like Google News. From a more distant perspective, it is hard to see the difference to other news-aggregating services like Google News (Essaid 2013b; McSherry and Opsahl 2013).

A case which was settled out of court was that of programmer Pete Warden who scraped basic information from Facebook users’ profiles (Warden 2010). His idea was to use the data to offer services that help manage communication and networks across services. He described the process of scraping as “very easy” and in line with the robots.txt (see next section), an informal web bot guideline Facebook had put on its pages. After he had put a first visualization of the data on his blog, Facebook contacted and pushed him to delete the data. According to Warden, “Their contention was robots.txt had no legal force and they could sue anyone for accessing their site even if they scrupulously obeyed the instructions it contained. The only legal way to access any web site with a crawler was to obtain prior written permission” (Warden 2010).

In the tragic case of Aaron Swartz, the core of contention was scientific work, not commercial reuse. Swartz, who co-created RSS (see Section 3.4.3), Markdown, and Infogami (a predecessor of Reddit), was arrested in 2011 for having illegally downloaded millions of articles from the article archive JSTOR. The case was dismissed after Swartz’ suicide in January 2013 (United States District Court District of Massachusetts 2013).

In an interesting, thoughtful comment, Essaid (2013a) points out that the jurisdiction on the issue of web scraping has changed direction several times over the last years, and there seem to be no clear criteria about what is allowed and what is not, not even in a single judicial system like the United States. Snell and Care (2013) deliver further anecdotal evidence and put court decisions in the context of legal theories.

The lesson to be learned from these disconcerting stories is that it is not clear which actions that can be subsumed under the “web scraping” label are actually illegal and which are not. In this book we focus on very targeted examples, and republishing content for a commercial purpose is a much more severe issue than just downloading pages and using them for research or analysis. Most of the litigations we came across involved commercial intentions. The Facebook v. Warden case has shown, however, that even following informal rules like those documented in the robots.txt does not guard against prosecution. But after all, as Frances Irwing from ScraperWiki puts it, “Google and Facebook effectively grew up scraping,” and if there were significant restrictions on what data can be scraped then the Web would look very different today.18

In the next sections, we describe how to identify unofficial web scraping rules and how to behave in general to minimize the risk of being put in a difficult position.

9.3.2 What is robots.txt?

When you start harvesting websites for your own purposes, you are most likely only a small fish in the gigantic data ocean. Besides you, web robots (also named “crawlers,” “web spiders,” or just “bots”) are hunting for content. Not all of these automatic harvesters act malevolently. Without bots, essential services on the Web would not work. Search engines like Google or Yahoo use web robots to keep their indices up-to-date. However, maintainers of websites sometimes want to keep at least some of their content prohibited from being crawled, for example, to keep their server traffic in check. This is what the robots.txt file is used for. This “Robots Exclusion Protocol” tells the robots which information on the site may be harvested.

The robots.txt emerged from a discussion on a mailing list and was initiated by Martijn Koster (1994). The idea was to specify which information may or may not be accessed by web robots in a text file stored in the root directory of a website (e.g., www.r-datacollection.com/ robots.txt). The fact that robots.txt does not follow an official standard has led to inconsistencies and uncontrolled extensions of the grammar. There is a set of rules, however, that is followed by most robots.txt on the Web. Rules are listed bot by bot. A set of rules for the Googlebot robot could look as follows:

images

This tells the Googlebot robot, specified in the User-agent field, not to crawl content from the subdirectories /images/ and /private/. Recall from Section 5.2.1 that we can use the User-Agent field to be identifiable. Well-behaved web bots are supposed to look for their name in the list of User-Agents in the robots.txt and obey the rules. The Disallow field can contain partial or full URLs. Rules can be generalized with the asterisk (*).

images

This means that any robot that is not explicitly recorded is disallowed to crawl the /private/ subdirectory. A general ban is formulated as

images

The single slash / encompasses the entire website. Several records are separated by one or more empty lines.

images

A frequently used extension of this basic set of rules is the use of the Allow field. As the name already states, such fields list directories which are explicitly accepted for scraping. Combinations of Allow and Disallow rules enable webpage maintainers to exclude directories as a whole from crawling, but allow specific subdirectories or files within this directory to be crawled.

images

Another extension of the robots.txt standard is the Crawl-delay field which asks crawlers to pause between requests for a certain number of seconds. In the following robots.txt, Googlebot is allowed to scrape everything except one directory, while all other users may access everything but have to pause for 2 seconds between each request.19

images

One problem of using robots.txt is that it can become quite voluminous for large webpages with multiple subdirectories and files. In addition, the way some crawlers work makes them ignorant to the centralized robots.txt. A disaggregated alternative to robots.txt is the robots <meta> tag which can be stored in the header of an HTML file.

images

A well-behaved robot will refrain from indexing a page that contains this <meta> tag because of the noindex value in the content attribute and will not try to follow any link on this page because of the nofollow value in the content attribute.

This book is not about web crawling, but focuses on retrieving content from specific sites with a specific purpose. But what if we still have to scrape information from several sites and do not want to manually inspect every single robots.txt file to program a well-behaved web scraper? For this purpose, we wrote a program that parses robots.txt by means of regular expressions and helps identify specific User-agents and corresponding rules of access. The program is displayed in Figure 9.9.

images

Figure 9.9 R code for parsing robots.txt files

The robotsCheck() program reads the robots.txt which is specified in the first argument, robotstxt. We can specify the bot or User-agent with the second argument, useragent. Further, the function can return allowed and disallowed directories or files. This is specified with the dirs parameter. We do not discuss this program in greater detail here, but it can easily be extended so that a robot stops scraping pages that are stored in one of the disallowed directories.

We test the program on the robots.txt file of Facebook. First, we specify the link to the file.


R> facebook_robotstxt <- ”http://www.facebook.com/robots.txt

Next, we retrieve the list of directories that is prohibited from being crawled by any bot which is not otherwise listed. If we create our own bot, this is most likely the set of rules we have to obey.


R> robotsCheck(robotstxt = facebook_robotstxt, useragent = ”*”, dirs = ”disallowed”)
This bot is blocked from the site.

Facebook generally prohibits crawling from its pages. Just to see how the program works, we make another call for a bot named “Yeti.”

images

Facebook disallows the “Yeti” bot to access a set of directories. It is important to say that robots.txt has little to do with a firewall against robots or any other protection mechanism. It does not prevent a website from being crawled at all. Rather, it is an advice from the website maintainer.

To the best of our knowledge, there is no law which explicitly states that robots.txt contents must not be disregarded. However, we strongly recommend that you have an eye on it every time you work with a new website, stay identifiable and in case of doubt contact the owner in advance.

If you want to learn more about web robots and how robots.txt works, the page http://www.robotstxt.org/ is a good start. It provides a more detailed explanation of the syntax and a useful collection of Frequently Asked Questions.

9.3.3 Be friendly!

Not everything that can be scraped should be scraped, and there are more and less polite ways of doing it. The programs you write should behave nicely, provide you with the data you need, and be efficient—in this order. We suggest that if you want to gather data from a website or service, especially when the amount of data is considerable, try to stick to our etiquette manual for web scraping. It is shown in Figure 9.10.

images

Figure 9.10 An etiquette manual for web scraping

As soon as you have identified potentially useful data on the Web, you should look for an “official” way to gather the data. If you are lucky, the publisher provides ready-made files of the data which are free to download or offers an API. If an API is available, there is usually no reason to follow any of the other scraping strategies. APIs enable the provider to keep control over who retrieves which data, how much of it, and how often.

As described in Section 9.2.2, accessing an API from within R usually requires one or more wrapper functions which pose requests to the API and process the output. If such wrappers already exist, all you have to do is to become familiar with the program and use it. Often, this requires the registration of an application (see Section 9.1.11). Be sure to document the purpose of your program. Many APIs restrict the user to a certain amount of API calls per day or similar limits. These limits should generally be obeyed.

If there is no API, there might still be a more comfortable way of getting the data than scraping them. Depending on the type and structure of the data, it can be reasonable to assume that there is a database behind it. Virtually any data that you can access via HTTP forms is likely to be stored in some sort of database or at least in a prestructured XML. Why not ask proprietors of data first whether they might grant you access to the database or files? The larger the amount of data you are interested in, the more valuable it is for both providers and you to communicate your interests in advance. If you just want to download a few tables, however, bothering the website maintainer might be a little over the top.

Once you have decided that scraping the data directly from the page is the only feasible solution, you should consider the Robots Exclusion Protocol if there is any. The robots.txt is usually not meant to block individual requests to a site, but to prevent a webpage to be indexed by a search engine or other meta search applications. If you want to gather information from a page that documents disallowance of web robot activity in its robots.txt, you should reconsider your task. Do you plan to scrape data in a bot-like manner? Has your task the potential to do the web server any harm? In case of doubt, get into contact with the page administrator or take a look at the terms of use, if there are any. Ensure that your plans are with no ill intent, and stay identifiable with an adequate use of the identifying HTTP header fields.

If what you are planning is neither illegal nor has the potential to harm the provider in any way, there are still some scraping dos and don'ts you should consider with care.

As an example, we construct a small scraping program step-by-step, implementing all techniques from the bouquet of friendly web scraping. Say we want to keep track of the 250 most popular movies as rated by users of the Internet Movie Database (IMDb). The ranking is published at http://www.imdb.com/chart/top. Although the techniques implemented in this example are a bit over the top as we do not actually scrape large amounts of data, the procedure is the same for more voluminous tasks.

Suppose we have already worked through the checklist of questions of Figure 9.10 (as of March 2014, there is no IMDb API) and have decided that there is no alternative to scraping the content to work with the data. An inspection of IMDb's robots.txt reveals that robots are officially allowed to work in the /chart subdirectory.

The standard scraping approach using the RCurl package would be something like

images

The first rule is to stay identifiable. We have learned in Chapter 5 how this can be done. When sending requests via HTTP, we can use the User-agent and From header fields. Therefore, we respecify the GET request as


R> getURL(url, useragent = str_c(R.version$platform, R.version$version.
string, sep = ”, ”), httpheader = c(from = ”[email protected]”))

The second rule is to reduce traffic. To do so, we should accept compressed files. One can specify which content codings to accept via the Accept-Encoding header field. If we leave this field unspecified, the server delivers files in its preferred format. Therefore, we do not have to specify the preferred compression style, which would probably be gzip, manually. The XML parser which is used in the XML package can deal with gzipped XML documents. We do not have to respecify the parsing command—the xmlParse() function automatically detects compression and uncompresses the file first.

Another trick to reduce traffic is applicable if we scrape the same resources multiple times. What we can do is to check whether the resource has changed before accessing and retrieving it. There are several ways to do so. First, we can monitor the Last-Modified response header field and make a conditional GET request, that is, access the resources only if the file has been modified since the last access. We can make the call conditional by delivering an If-Modified-Since or, depending on the mechanics of the function, If-Unmodified-Since request header field. In the IMDb example, this works as follows. First, we define a curl handle with the debugGatherer() function to be able to track our HTTP communication. Because we want to modify the HTTP header along the way, we store the standard headers for identifying ourselves in an extra object to use and redefine it.


R> info <- debugGatherer()
R> httpheader <- list(from = ”[email protected]”, ’user-
agent’ = str_c(R.version$version.string, ”, ”, R.version$platform))
R> handle <- getCurlHandle(debugfunc = info$update, verbose = TRUE)

We define a new function getBest() that helps extract the best movies from the IMDb page.


R> getBest <- function(doc) readHTMLTable(doc)[[1]][, 1:3]

Applying the function results in a data frame of the top 250 movies. To be able to analyze it in a later step, we store it on our local drive in an .Rdata file called bestFilms.Rdata if it does not exist already.

images

Now we want to update the file once in a while if and only if the IMDb page has been changed since the last time we updated the file. We do that by using the If-Modified-Since header field in the HTTP request.


R> httpheader$”If-Modified-Since” <- ”Tue, 04 Mar 2014 10:00:00 GMT”
R> best_doc <- getURL(url, curl = handle, httpheader = httpheader)

It becomes a little bit more complicated if we want to use the time stamp of our .Rdata file's last update. For this we have to extract the date and supply it in the right format to the If-Modified-Since header field. As the extraction and transformation of the date into the format expected in HTTP request is cumbersome, we solve the problem once and put it into two functions: httpDate() and file.date()—see Figure 9.11. You can download the function from the book's webpage with


R> writeLines(str_replace_all(getURL(”http://www.r-datacollection.com/materials/http/HTTPdate.r”),”
”,””),”httpdate.r”)
images

Figure 9.11 Helper functions for handling HTTP If-Modified-Since header field

Let us source the functions into our session and extract the date of last modification for our best films data file with a call to file.date().


R> source(”http://www.r-datacollection.com/materials/http/HTTPdate.r”)

R> (last_mod <- file.date(”bestFilms.Rdata”))
[1] ”2014-03-11 15:00:31 CET”

Now we can pass the date to the If-Modified-Since header field by making use of httpDate().


R> httpheader$”If-Modified-Since” <- httpDate(last_mod)
R> best_doc <- getURL(url, curl = handle, httpheader = httpheader)

Via getCurlInfo() we can gather information on the last request and control the status code.


R> getCurlInfo(handle)$response.code
[1] 200

If the status code of the response equals 200, we extract new information and update our data file. If the server responds with the status code 304 for “not modified” we leave it as is.

images

Using the If-Modified-Since header is not without problems. First, it is not clear what the Last-Modified response header field actually means. We would expect the server to store the time the file was changed the last time. If the file contains dynamic content, however, the header field could also indicate the last modification of one of its component parts. In fact, in the example the IMDb website is always delivered with a current time stamp, so the file will always be downloaded—even if the ranking has not changed. We should therefore first monitor the updating frequency of the Last-Modified header field before adapting our scraper to it. Another problem can be that the server does not deliver a Last-Modified at all, even though HTTP/1.1 servers should provide it (see Fielding et al. 1999, Chapters 14.25, 14.28, and 14.29).

Another strategy is to retrieve only parts of a file. We can do this by specifying the libcurl option range which allows defining a byte range. If we know, for example, that the information we need is always stored at the very beginning of a file, like a title, we could truncate the document and specify our request function with range = ”1-100” to only receive the first 100 bytes of the document. The drawbacks of this approach are that not all servers support this feature and we cut a document in two, making it not inaccessible with XPath.

In another scenario, we might want to download specific files from an index of files, but only those which we have not been downloaded so far. We implement a check if the file already exists on the local drive and start the download only if we have not already retrieved it. The following generic code snippet shows how to do this. Say we have generated a vector of HTML file names which are stored on a page like www.example.com with filenames <- c(”page1.html”, ”page2.html”, page3.html). We can initiate a download of the files that have not yet been downloaded with:

images

The file.exists() function checks if the file already exists. If not, we download it. To know in advance how many files are new, we can compare the two sets of file names—the ones to be downloaded and the ones that are already stored in our folder—like this


R> existing_files <- list.files(”directory_of_scraped_files”)
R> online_files <- vector_of_online_files
R> new.files <- setdiff(online_files, existing_files)

The list.files() function lists the names of files stored in a given directory. The setdiff() function compares the content of two vectors and returns the asymmetric difference, that is, elements that are part of the first vector but not of the second. Note that these code snippets works properly only if we download websites that carry a unique identifier in the URL that remains constant over time, for example, a date, and if it is reasonable to assume that the content of these pages has not changed, while the set of pages has.

We also do not want to bother the server with multiple requests. This is partly because many requests per second can bring smaller servers to their knees, and partly because webmasters are sensitive to crawlers and might block us if our scraper behaves this way. With R it is straightforward to restrict our scraper. We simply suspend execution of the request for a time. In the following, a scraping function is programmed to process a stack of URLs and is executed only after a pause of one second, which is specified with the Sys.sleep() function.

images

There is no official rule how often a polite scraper should access a page. As a rule of thumb we try to make no more than one or two requests per second if Crawl-delay in the robots.txt does not dictate more modest request behavior.

Finally, writing a modest scraper is not only a question of efficiency but also of politeness. There is often no reason to scrape pages daily or repeat the same task over and over again. Although bandwidth costs have sunken over the years, server traffic still means real costs to website maintainers. Our last piece of advice for creating well-behaved web scrapers is therefore to make scrapers as efficient as possible. Practically, this means that if you have a choice between several formats, choose the lightweight one. If you have to scrape from an HTML page, it could prove useful to look for a “print version” or a “text only” version, which is often much lighter HTML than the fully designed page. This helps both you to extract content and the server who provides the resources. More generally, do not “overscrape” pages. Carefully select the resources you want to exploit, and leave the rest untouched. In addition, monitor your scraper regularly if you use it often. Webpage design can change quickly, rendering your scraping approach useless. A broken scraper may still consume bandwidth without any payoff.20

One final remark. We do not think that there is a reason to feel generally bad for scraping content from the Web. In all of the cases we present in this book this has nothing to do with stealing any private property or cheap copying of content. Ideally, processing scraped information comes with real added value.

9.4 Valuable sources of inspiration

Before starting to set up a scraping project, it is worthwhile to do some research on things others have done. This might help with specific problems, but the Internet is also full of more general inspirations for scraping applications and creative work with freely available data. In the following, we would like to point you to some resources and projects we find extraordinarily useful or inspiring.

The CRAN Task View on web technologies (http://cran.r-project.org/web/views/ WebTechnologies.html) provides a very useful overview of what is possible with R in terms of accessing and parsing data from the Web. You will see that not all of the available packages are covered in this book, which is partly due to the fact that the community is currently very active in this field, and partly because we intentionally tried to focus on the most useful pieces of software. It might be a good exercise to set up an automated scraper that checks for updates of this site from time to time.

GitHub is a hosting service for software projects or rather, users who publish their ongoing coding work (https://github.com/). It is not restricted to any programming language, so one can find many users who publish R code. Hadley Wickham and Winston Chang have provided the handy CRAN package devtools (Wickham and Chang 2013) which makes it easy to install R software that is not published on CRAN but on GitHub using the install_github() function.

rOpenSci (http://ropensci.org/) is a fascinating project that aims at establishing convenient connections between R and existing science or science-related APIs. Their motto is nothing less than “Wrapping all science APIs.” This implies a philosophy of “meta-sharing”: The contributors to this project share and maintain software that helps accessing open science data. As we have shown in Section 9.1.10, maintenance of API access is indeed an important topic. The project's website provides R packages which serve as interfaces to several data repositories, full-texts of journals and altmetrics data. Some of the packages are also available on CRAN, some are stored on GitHub. To pick some examples, the rgbif package provides access to the Global Biodiversity Information Facility API which covers several thousand datasets on species and organisms (Chamberlain et al. 2013). The RMendeley package offers access to the personal Mendeley database (Boettiger and Temple Lang 2012). And with the rfishbase package it is possible to access the database from www.fishbase.org via R (Boettiger et al. 2014). Further, the site offers a potentially helpful overview of R packages that enable access to science APIs but that are not affiliated with rOpenScihttp://ropensci.org/related/index.html. It is well worth browsing this list to find R wrappers for APIs of popular sites such as Google Maps, the New York Times, the NHL Real Time Scoring System Database, and many more. All in all, the rOpenSci team works on an important goal for future scientific practice—the proliferation and accessibility of open data.

Large parts of what we can do with R and web scraping would likely not be possible without the work of the “Omega Project for Statistical Computing” at http://www.omegahat.org/. The project's core group is basically a Who is Who in R’s core development team with Duncan Temple Lang being its most diligent contributor. With the creation of packages like RCurl and XML the project laid the foundation to R-Web communication. Today, the project makes available an impressive list of (not only) R-based software for interaction with web services and database systems. Not all of them are updated regularly or are of immediate use for standard web scraping tasks, but a look at the page is indispensable before any attempt to program a new interface to whatever web service. Chances are that it has been already done and published on this site. Many of the packages are also extensively discussed in an impressive new book by Nolan and Temple Lang (2014), which is well worth a read.

Summary

In this chapter, we demonstrated the practical use of the techniques from the book's first part—HTTP, HTML, regular expressions, and others—to retrieve information from webpages. Web scraping is more of a skill than a science. It is difficult to generalize web scraping practice, as every scenario is different. In the first part of this chapter, we picked some of the more common scenarios you might encounter when collecting data from the Web in an automated manner. If you felt overwhelmed from the vast amount of fundamental web technologies in the first part of the book, you might have been surprised how easy it is in many scenarios to gather data from the Web with R by relying on powerful network client interfaces like RCurl, convenient packages for string processing like stringr, and easy-to-implement parsing tools as provided by the XML package.

Regarding information extraction from web documents we sketched three broad strategies. Regular expression scraping, XPath scraping, and data collection with interfaces to web APIs. You will figure out for yourself which strategy serves your needs best in which scenarios as soon as you become more experienced in web scraping. Our description of the general procedure to automate data collection with each of the strategies may serve as a guideline. One intention of our discussion of advantages and disadvantages of each of the strategies was, however, to clarify that there is no single best web scraping strategy, and it pays of to be familiar with each of the presented techniques.

We dedicated the last section of this chapter to an important topic, the good practice of web scraping. Collecting data from websites is nothing inherently evil—successful business models are based on massive online data collection and processing. However, some formal and less formal rules we can and should obey exist. We have outlined an etiquette that gives some rules of behavior when scraping the Web.

Having worked through this chapter you have learned the most important tools to gather data from the Web with R. We discuss some more tricks of the trade in Chapter 11. If you deal with text data, information extraction can be a more sophisticated matter. We present some technical advice on how to handle text in R and to estimate latent classes in texts in Chapter 10.

Further reading

Many of the tutorials and how-to guides for web scraping with R which can be found online are rather case-specific and do not help much to decide which technique to use, how to behave nicely, and so on. With regard to the foundations of R tools to tap web resources, the recently published book by Nolan and Temple Lang (2014) offers great detail, especially regarding the use of RCurl and other packages which are not published on CRAN but serve specific, yet potentially important tasks in web scraping. They also provide a more extensive view on REST, SOAP, and XML-RPC. If you want to learn more about web services that rely on the REST technology on the theoretical side, have a look at Richardson et al. (2013). Cerami (2002) offers a more general picture of web services.

During the writing of this book, we found some books on practical web scraping inspiring, interesting, or simply fun to read, and do not want to withhold them from you. “Webbots, Spiders, and Screen Scrapers” by Schrenk (2012) is a fun-to-read introduction to scraping and web bot programming with PHP and Curl. The focus is clearly on the latter, so if you are interested in web bots and spiders, this book might be a good start. “Spidering Hacks” by Hemenway and Calishain (2003) is a comprehensive collection of applications and case studies on various scraping tasks. Their scraping workhorse is Perl, but the described hacks serve as good inspiration for programming R-based scrapers, too. Finally, “Baseball Hacks” by Adler (2006) is practically a large case study on scraping and data science mostly based on Perl (for the scraping part) and R (for data analysis). If you find the baseball scenario entertaining, Adler's hands-on book is a good companion on your way into data science.

Problems

  1. What are important tools and strategies to build a scraper that behaves nicely on the Web?

  2. What is an good extraction strategy for HTML lists on static HTML pages? Explain your choice.

  3. Imagine you want to collect data on the occurrence of earthquakes on a weekly basis. Inform yourself about possible online data sources and develop a data collection strategy. Consider (1) an adequate scraping strategy, (2) a strategy for information extraction (if needed), and (3) friendly data collection behavior on the Web.

  4. Reconsider the CSV file download function in Section 9.1.1. Replicate the download procedure with the data files for the primaries of the 2010 Gubernatorial Election.

  5. Scraping data from Wikipedia, I: The Wikipedia article at http://en.wikipedia.org/wiki/ List_of_cognitive_biases provides several lists of various types of cognitive biases. Extract the information stored in the table on social biases. Each of the entries in the table points to another, more detailed article on Wikipedia. Fetch the list of references from each of these articles and store them in an adequate R object.

  6. Scraping data from Wikipedia, II: Go to http://en.wikipedia.org/wiki/List_of_MPs_elected_in_the_United_Kingdom_general_election,_1992 and extract the table containing the elected MPs int the United Kingdom general election of 1992. Which party has most Sirs?

  7. Scraping data from Wikipedia, III: Take a look at http://en.wikipedia.org/wiki/List_of_national_capitals_of_countries_in_Europe_by_area and extract the geographic coordinates of each European country capital. In a second step, visualize the capitals on a map. The code from the example in chapter 1 might be of help.

  8. Write your own robots.txt file providing the following rules: (a) no Google bot is allowed to scrape your web site, and (b) scraping your /private folder is generally not allowed.

  9. Reconsider the R-based robots.txt parser on Figure 9.9. Use it as a start to construct a program that makes any of your scrapers follow the rules of the robots.txt on any site. The function has to fulfill the following tasks: (a) identification of the robots.txt on any given host if there is one, (b) check if a specific User-Agent is listed or not, (c) check if the path to be scraped is disallowed or not, and (d) adhere to the results of (a), (b), and (c). Consider scraping allowed if the robots.txt is missing.

  10. Google Search allows the user to tune her request with a set of parameters. Make use of these parameters and set up a program that regularly informs you about new search results for your name.

  11. Reconsider the Yahoo Weather Feed from Section 9.1.10.

    1. Check out the wrapper function displayed in Figure 9.5 and rebuild it in R.
    2. The API returns a weather code that has not been evaluated so far (see also the last column in the table on page 238). Read the API's documentation to figure out what the code stands for and implement the result in the feedback of the wrapper function.
  12. The CityBikes API at http://api.citybik.es/ provides free access to a global bike sharing network. Choose a bike sharing service in a city of your choice and build an R interface to it. The interface should enable the user to get information about the list of stations and the number of available bikes at each of the stations. For a more advanced extension of this API, implement a feature such that the function automatically returns the station closest to a given geo-coordinate.

  13. The New York Times provides a set of APIs at http://developer.nytimes.com/docs. In order to use them, you have to sign up for an API key. Construct an R interface to their best-sellers search API which can retrieve the current best-seller list and transform the incoming JSON data to an R data frame.

  14. Let us take another look at the Federal Contributions Database.

    1. Find out what happens when the window is not changed back from the pop-up window. Does the code still work?
    2. Write a script building on the code outlined above that downloads all contributions to Republication candidates from 2007 to 2011.
    3. Download all contributions from March 2011, but have the data returned in a plot. Try to extract the amount and party information from the plot.
  15. Apply Rwebdriver to other example files introduced in this book:

    1. fortunes2.html
    2. fortunes3.html
    3. rQuotes.php
    4. JavaScript.html

Notes

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.133.158.230