Scraping information from websites

Governments or jurisdictions around the world are increasingly embracing the importance of open data, which aims to increase citizen involvement and informed decision-making, and also aims to make policies more open to public scrutiny. Some examples of open data initiatives around the world include https://www.data.gov/ (United States of America), https://data.gov.uk/ (United Kingdom), and https://data.gov.hk/en/ (Hong Kong).

These data portals often provide an API for programmatic access of data. However, an API is not available for some datasets, hence we need to rely on good old web scraping techniques to extract information from websites.

Beautiful Soup (https://www.crummy.com/software/BeautifulSoup/) is an incredibly useful package for scraping information from websites. Basically, everything marked with an HTML tag can be scraped with this wonderful package. Scrapy is also a good package for web scraping, but it is more like a framework for writing powerful web crawlers. So if you just need to fetch a table from the page, Beautiful Soup offers simpler procedures.

We are going to use Beautiful Soup version 4.6 throughout this chapter. To install Beautiful Soup 4, we can once again rely on PyPI:

pip install beautifulsoup4

The US unemployment rates and earnings by educational attainment data (2017) is available from the following website: https://www.bls.gov/emp/ep_table_001.htm. Currently, Beautiful Soup does not handle HTML requests. So we need to use the urllib.request or requests package to fetch a web page for us. Among the two options, the requests package is arguably easier to use due to its higher-level HTTP client interface. If requests is not available on your system, we can install that through PyPI:

pip install requests

Let's take a look at the web page before we write the web scraping code. If we use Google Chrome to visit the Bureau of Labor Statistics website, we can inspect the HTML code corresponding to the table we need:

Next, expand <div id="bodytext" class="verdana md"> until you can see <table class="regular" cellspacing="0" cellpadding="0" xborder="1">...</table>. When you put your mouse over the HTML code, the corresponding section on the page will be highlighted:

After further expanding the HTML code of the <table>, we can see that the column names are defined in the <thead>...</thead> section, while the table content is defined in the <tbody>...</tbody> section.

In order to instruct Beautiful Soup to scrape the information we need, we need to give clear directions to it. We can right-click on the relevant section in the code inspection window and copy the unique identifier in the format of a CSS selector:

Let's try to get the CSS selectors for thead and tbody, and use the BeautifulSoup.select() method to scrape the respective HTML code:

import requests
from bs4 import BeautifulSoup

# Specify the url
url = "https://www.bls.gov/emp/ep_table_001.htm"

# Query the website and get the html response
response = requests.get(url)

# Parse the returned html using BeautifulSoup
bs = BeautifulSoup(response.text)

# Select the table header by CSS selector
thead = bs.select("#bodytext > table > thead")[0]

# Select the table body by CSS selector
tbody = bs.select("#bodytext > table > tbody")[0]

# Make sure the code works
print(thead)

You will be greeted by the HTML code of the table headers:

<thead> 
<tr> 
<th scope="col"><p align="center" valign="top"><strong>Educational attainment</strong></p></th> 
<th scope="col"><p align="center" valign="top">Unemployment rate (%)</p></th> 
<th scope="col"><p align="center" valign="top">Median usual weekly earnings ($)</p></th> 
</tr> 
</thead>

Next, we are going to find all instances of <th></th>, which contains the name of each column. We will build a dictionary of lists with headers as keys to hold the data:

# Get the column names
headers = []

# Find all header columns in <thead> as specified by <th> html tags
for col in thead.find_all('th'):
   headers.append(col.text.strip())

# Dictionary of lists for storing parsed data
data = {header:[] for header in headers}

Finally, we parse the remaining rows of the table and convert the data to a pandas DataFrame:

import pandas as pd

# Parse the rows in table body
for row in tbody.find_all('tr'):
    # Find all columns in a row as specified by <th> or <td> html tags
    cols = row.find_all(['th','td'])
 
    # enumerate() allows us to loop over an iterable, 
    # and return each item preceded by a counter
    for i, col in enumerate(cols):
        # Strip white space around the text
        value = col.text.strip()
 
        # Try to convert the columns to float, except the first column
        if i > 0:
            value = float(value.replace(',','')) # Remove all commas in string
 
        # Append the float number to the dict of lists
        data[headers[i]].append(value)

# Create a data frame from the parsed dictionary
df = pd.DataFrame(data)

# Show an excerpt of parsed data
df.head()

We should now be able to reproduce the first few rows of the main table:

	Educational attainment	Median usual weekly earnings ($)	Unemployment rate (%)
0	Doctoral degree	1743.0	1.5
1	Professional degree	1836.0	1.5
2	Master's degree	1401.0	2.2
3	Bachelor's degree	1173.0	2.5
4	Associate's degree	836.0	3.4

The main HTML table has been formatted as a structured pandas DataFrame. We can now proceed to visualize the data.

Table of Contents for Scraping information from websites

Create new playlist

Sign In

Sign Up

Table of Contents for
Scraping information from websites