Extracting data from HTML using requests and BeautifulSoup

In this section, we will request and parse HTML source code. We will be using the requests library to make Hyper Text Transfer Protocol (HTTP) requests and retrieve the HTML source code, and BeautifulSoup to parse and extract the text content.

We will, however, encounter a common obstacle: websites may request certain information from the server only after initial page-load using JavaScript. As a result, a direct HTTP request will not be successful. To sidestep this type of protection, we will use a headless browser that retrieves the website content as a browser would:

from bs4 import BeautifulSoup
import requests

# set and request url; extract source code
url = "https://www.opentable.com/new-york-restaurant-listings"
html = requests.get(url)
html.text[:500]

' <!DOCTYPE html><html lang="en"><head><meta charset="utf-8"/><meta http-equiv="X-UA-Compatible" content="IE=9; IE=8; IE=7; IE=EDGE"/> <title>Restaurant Reservation Availability</title> <meta name="robots" content="noindex" > </meta> <link rel="shortcut icon" href="//components.otstatic.com/components/favicon/1.0.4/favicon/favicon.ico" type="image/x-icon"/><link rel="icon" href="//components.otstatic.com/components/favicon/1.0.4/favicon/favicon-16.png" sizes="16x16"/><link rel='

Now we can use BeautifulSoup to parse the HTML content, and then look for all span tags with the class associated with the restaurant names that we obtain by inspecting the source code, rest-row-name-text (see GitHub repo for linked instructions to examine website source code):

# parse raw html => soup object
soup = BeautifulSoup(html.text, 'html.parser')

# for each span tag, print out text => restaurant name
for entry in soup.find_all(name='span', attrs={'class':'rest-row-name-
text'}):
    print(entry.text)

Wade Coves
Alley
Dolorem Maggio
Islands
...

Once you have identified the page elements of interest, BeautifulSoup makes it easy to retrieve the contained text. If you want to get the price category for each restaurant, you can use:

# get the number of dollars signs for each restaurant
for entry in soup.find_all('div', {'class':'rest-row-pricing'}):
    price = entry.find('i').text

When you try to get the number of bookings, however, you just get an empty list because the site uses JavaScript code to request this information after the initial loading is complete:

soup.find_all('div', {'class':'booking'})
[]

Table of Contents for Extracting data from HTML using requests and BeautifulSoup

Create new playlist

Sign In

Sign Up

Table of Contents for
Extracting data from HTML using requests and BeautifulSoup