Retrieving fare data with advanced web scraping

In previous chapters, we've seen how to use the Requests library to retrieve web pages. As I've said before, it is a fantastic tool, but unfortunately, it won't work for us here. The page we want to scrape is entirely AJAX-based. Asynchronous JavaScript (AJAX) is a method for retrieving data from a server without having to reload the page. What this means for us is that we'll need to use a browser to retrieve the data. While that might sound like it would require an enormous amount of overhead, there are two libraries that, when used together, make it a lightweight task.

The two libraries are Selenium and ChromeDriver. Selenium is a powerful tool for automating web browsers, and ChromeDriver is a browser. Why use ChromeDriver rather than Firefox or Chrome itself? ChromeDriver is what's known as a headless browser. This means it has no user interface. This keeps it lean, making it ideal for what we're trying to do.

To install ChromeDriver, you can download the binaries or source from https://sites.google.com/a/chromium.org/chromedriver/downloads. As for Selenium, it can be pip installed.

We'll also need another library called BeautifulSoup to parse the data from the page. If you don't have that installed, you should pip install that now as well.

With that done, let's get started. We'll start out within the Jupyter Notebook. This works best for exploratory analysis. Later, when we've completed our exploration, we'll move on to working in a text editor for the code we want to deploy. This is done in following steps:

  1. First, we import our routine libraries, as shown in the following code snippet:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt 
%matplotlib inline 
  1. Next, make sure you have installed BeautifulSoup and Selenium, and downloaded ChromeDriver, as mentioned previously. We'll import those now in a new cell:
from bs4 import BeautifulSoup 
from selenium import webdriver 
 
# replace this with the path of where you downloaded chromedriver 
chromedriver_path = "/Users/alexcombs/Downloads/chromedriver" 
 
browser = webdriver.Chrome(chromedriver_path) 

Notice that I have referenced the path on my machine where I have downloaded ChromeDriver. Note that you will have to replace that line with the path on your own machine.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.142.36.146