Web scraping made easy with rvest

By now, I hope you do agree on the importance of understanding something about HTML and HTTP. Now, it's time to trust a query language of greater importance on text mining, XPath. Imagine that you have to navigate through a web page. Looking at the page, you can easily locate the content you are aiming for. How do you tell the computer to look at this same content?

XPath will help you to find and retrieve only some specific content that you are targeting at a web page. Let's say that you want to know all the packages available at CRAN, but calling available.packages() is too easy for you. You want to scrape this content directly from CRAN's web page using rvest.

Follow to the CRAN web page right into the packages list (https://cran.r-project.org/web/packages/available_packages_by_name.html). Navigate it with your browser. You may find something similar to the following screenshot:

Figure 4.4: CRAN web page

Using Chrome's developer tools, we can get the address that covers the content we wish for. Open the developer tools (Ctrl + Shift + I for Windows's users) and follow these steps:

Click the Select an element in the page to inspect icon (or press Ctrl + Shift + C for Windows and Ubuntu users).
Hover the mouse over the area with the content you want. Once the desirable content gets highlighted, just left-click it.
A line at the Elements tab will be highlighted; right-click it.
Select Copy.
Select Copy XPath.
Here we have got /html/body/table. Although this seems like a lot of steps, all of them are very simple. They are illustrated in the following screenshot:

Figure 4.5: Getting the XPath

With xpath at hand, we can draw some R code, just like this:

if(!require(rvest)){install.packages('rvest')}
library(rvest)

url <- 'https://cran.r-project.org/web/packages/available_packages_by_name.html'
xpath <- '/html/body/table//a'
dt <- read_html(url) %>%
  html_nodes(xpath = xpath)

The preceding code will look for the rvest package and try to install it in case it's missing. Following this, we store the web page address (url) and the XPath (xpath) as strings. The latter string is a little bit tricky. The XPath I originally found was /html/body/table/tbody/a. For some reason, rvest won't deal well with the tbody part, so I removed it from the original string, and what I got was /html/body/table//a.

After storing both URL and XPath, pipe the read_html() and html_nodes() functions from the rvest package to get our data (dt). Do not forget to assign the XPath string into its proper argument inside html_noces(), the xpath argument. Let's check what we got from this:

head(dt)
# {xml_nodeset (6)}
# [1] <a href="../../web/packages/A3/index.html">A3</a>
# [2] <a href="../../web/packages/abbyyR/index.html">abbyyR</a>
# [3] <a href="../../web/packages/abc/index.html">abc</a>
# [4] <a href="../../web/packages/abc.data/index.html">abc.data</a>
# [5] <a href="../../web/packages/ABC.RAP/index.html">ABC.RAP</a>
# [6] <a href="../../web/packages/ABCanalysis/index.html">ABCanalysis</a>

As we can see, package names are in between the <a href="..."..> </a> command.

This is a way to set hyperlinks using HTML. To erase these unintended texts, we can use regex reference at the gsub() function:

dt <- dt %>% 
  gsub(pattern = '<(.[^>]*)>',
       replacement = '')

Using regular expressions, we could use gsub() to cut off all the text that fitted inside <>.

A pretty useful cheat sheet for regular expressions in R can be found in the RStudio page (https://www.rstudio.com/wp-content/uploads/2016/09/RegExCheatsheet.pdf). Simply Google regex r cheat sheet if this link isn't useful.

If you examine the dt object again, we will see that only package names remain:

head(dt)
# [1] "A3" "abbyyR" "abc" "abc.data" "ABC.RAP" "ABCanalysis"

We could then compare that packages that we got by web scraping with the list that R gives using the setdiff() function:

r_list <- available.packages()[,'Package']
setdiff(r_list,dt)
setdiff(dt,r_list)

The first line is storing all the packages' names coming from the available.packages() function into an object called r_list, a list of packages given by R. The following lines are respectively returning the packages that are into r_list but not in dt, and the packages that are in dt but not into r_list.

If you care enough to install a package from r_list that is missing in dt, everything will go fine. That's because R will get it from another repository rather than CRAN. On the other hand, if you try to install a package returned by setdiff(dt,r_list), it will not work. These packages are from older versions of R and might not be available for the current one.

This section was meant to give a quick overview of the general way text can be retrieved from the web using the rvest package. Along the way, XPath and alternative ways to manipulate strings were introduced. Next, we will see how to retrieve text from Twitter using R.

Table of Contents for Web scraping made easy with rvest

Create new playlist

Sign In

Sign Up

Table of Contents for
Web scraping made easy with rvest