Chapter 14. Web Scraping

R provides a platform with easy access to statistical computing and data analysis. Given a data set, it is handy to perform data transformation and apply analytic models and numeric methods with either flexible data structures or high performance, as discussed in previous chapters.

However, the input data set is not always as immediately available as tables provided by well-organized commercial databases. Sometimes, we have to collect data by ourselves. Web content is an important source of data for a wide range of research fields. To collect (scrape or harvest) data from the Internet, we need appropriate techniques and tools. In this chapter, we'll introduce the basic knowledge and tools of web scraping, including:

  • Looking inside web pages
  • Learning CSS and XPath selector
  • Analyzing HTML code and extracting data

Looking inside web pages

Web pages are made to present information. The following screenshot shows a simple web page located at data/simple-page.html that has a heading and a paragraph:

Looking inside web pages

All modern web browsers support such web pages. If you open data/simple-page.html with any text editor, it will show the code behind the web page as follows:

<!DOCTYPE html> 
<html> 
<head> 
  <title>Simple page</title> 
</head> 
<body> 
  <h1>Heading 1</h1> 
  <p>This is a paragraph.</p> 
</body> 
</html> 

The preceding code is an example of HTML (Hyper Text Markup Language). It is the most widely used language on the Internet. Different from any programming language to be finally translated into computer instructions, HTML describes the layout and content of a web page, and web browsers are designed to render the code into a web page according to web standards.

Modern web browsers use the first line of HTML to determine which standard is used to render the web page. In this case, the latest standard, HTML 5, is used.

If you read through the code, you'll probably notice that HTML is nothing but a nested structure of tags such as <html>, <title>, <body>, <h1>, and <p>. Each tag begins with <tag> and is closed with </tag>.

In fact, these tags are not arbitrarily named, nor are they allowed to contain other arbitrary tags. Each has a specific meaning to the web browser and is only allowed to contain a subset of tags, or even none.

The <html> tag is the root element of all HTML. It most commonly contains <head> and <body>. The <head> tag usually contains <title> to show on the title bar and browser tabs and other metadata of the web page, while <body> plays the main role in determining the layout and contents of the web page.

In the <body> tag, tags can be nested more freely. The simple page only contains a level-1 heading (<h1>) and a paragraph (<p>) while the following web page contains a table with two rows and two columns:

Looking inside web pages

The HTML code behind the web page is stored in data/single-table.html:

<!DOCTYPE html> 
<html> 
<head> 
  <title>Single table</title> 
</head> 
<body> 
  <p>The following is a table</p> 
  <table id="table1" border="1"> 
    <thead> 
      <tr> 
        <th>Name</th> 
        <th>Age</th> 
      </tr> 
    </thead> 
    <tbody> 
      <tr> 
        <td>Jenny</td> 
        <td>18</td> 
      </tr> 
      <tr> 
        <td>James</td> 
        <td>19</td> 
      </tr> 
    </tbody> 
  </table> 
</body> 
</html> 

Note that a <table> tag is structured row by row: <tr> represents a table row, <th> a table header cell, and <td> a table cell.

Also notice that an HTML element such as <table> may have additional attributes in the form of <table attr1="value1" attr2="value2">. The attributes are not arbitrarily defined. Instead, each has a specific meaning according to the standard. In the preceding code, id is the identifier of the table and border controls its border width.

The following page looks different from the previous ones in that it shows some styling of contents:

Looking inside web pages

If you take a look at its source code at data/simple-products.html, you'll find some new tags such as <div> (a section), <ul> (unrecorded list), <li> (list item), and <span> (also a section used for applying styles); additionally, many HTML elements have an attribute called style to define their appearance:

<!DOCTYPE html> 
<html> 
<head> 
  <title>Products</title> 
</head> 
<body> 
  <h1 style="color: blue;">Products</h1> 
  <p>The following lists some products</p> 
  <div id="table1" style="width: 50px;"> 
    <ul> 
      <li> 
        <span style="font-weight: bold;">Product-A</span> 
        <span style="color: green;">$199.95</span> 
      </li> 
      <li> 
        <span style="font-weight: bold;">Product-B</span> 
        <span style="color: green;">$129.95</span> 
      </li> 
      <li> 
        <span style="font-weight: bold;">Product-C</span> 
        <span style="color: green;">$99.95</span> 
      </li> 
    </ul> 
  </div> 
</body> 
</html> 

Values in style is written in the form of property1: value1; property2: value2;. However, the styles of the list items are a bit redundant because all product names share the same style and this is also true for all product prices. The following HTML at data/products.html uses CSS (Cascading Style Sheets) instead to avoid redundant styling definitions:

<!DOCTYPE html> 
<html> 
<head> 
  <title>Products</title> 
  <style> 
    h1 { 
      color: darkblue; 
    } 
    .product-list { 
      width: 50px; 
    } 
    .product-list li.selected .name { 
      color: 1px blue solid; 
    } 
    .product-list .name { 
      font-weight: bold; 
    } 
    .product-list .price { 
      color: green; 
    } 
  </style> 
</head> 
<body> 
  <h1>Products</h1> 
  <p>The following lists some products</p> 
  <div id="table1" class="product-list"> 
    <ul> 
      <li> 
        <span class="name">Product-A</span> 
        <span class="price">$199.95</span> 
      </li> 
      <li class="selected"> 
        <span class="name">Product-B</span> 
        <span class="price">$129.95</span> 
      </li> 
      <li> 
        <span class="name">Product-C</span> 
        <span class="price">$99.95</span> 
      </li> 
    </ul> 
  </div> 
</body> 
</html> 

Note that we add <style> in <head> to declare a global stylesheet in the web page. We also switch style to class for content elements (div, li, and span) to use those pre-defined styles. The syntax of CSS is briefly introduced in the following code.

Match all <h1> elements:

h1 { 
  color: darkblue; 
} 

Match all elements with the product-list class:

.product-list { 
  width: 50px; 
} 

Match all elements with the product-list class, and then match all nested elements with the name class:

.product-list .name { 
  font-weight: bold; 
} 

Match all elements with the product-list class, then match all nested <li> elements with the selected class, and finally match all nested elements with the name class:

.product-list li.selected .name { 
  color: 1px blue solid; 
} 

Note that simply using style cannot achieve this. The following screenshot shows the rendered web page:

Looking inside web pages

Each CSS entry consists of a CSS selector (for example, .product-list) to match HTML elements and the styles (for example, color: red;) to apply. CSS selectors are not only used to apply styling, but are also commonly used to extract contents from web pages so the HTML elements of interest are properly matched. This is an underlying technique behind web scraping.

CSS is much richer than demonstrated in the preceding code. For web scraping, we use the following examples to show the most commonly used CSS selectors:

Syntax

Match

*

All elements

h1, h2, h3

<h1>,<h2>,<h3>

#table1

<* id="table1">

.product-list

<* class="product-list">

div#container

<div id="container">

div a

<div><a> and <div><p><a>

div > a

<div><a> but not<div><p><a>

div > a.new

<div><a class="new">

ul > li:first-child

First <li> in<ul>

ul > li:last-child

Last <li> in<ul>

ul > li:nth-child(3)

3rd <li> in<ul>

p + *

Next element of <p>

img[title]

<img> with title attribute

table[border=1]

<table border="1">

In each level, tag#id.class[] can be used with tag, #id.class, and [] optionally. For more information on CSS selectors, visit https://developer.mozilla.org/en-US/docs/Web/CSS/CSS_Selectors. To learn more about HTML tags, visit http://www.w3schools.com/tags/.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.148.144.228