Synchronous web-scraping

The synchronous scraper only uses Python standard libraries such as urllib. It downloads the home page of three popular sites and a fourth site whose loading time can be delayed to simulate a slow connection. It prints the respective page sizes and the total running time.

Here's the code for the synchronous scraper located at src/extras/sync.py:

"""Synchronously download a list of webpages and time it""" 
from urllib.request import Request, urlopen 
from time import time 
 
sites = [ 
    "http://news.ycombinator.com/", 
    "https://www.yahoo.com/", 
    "http://www.aliexpress.com/", 
    "http://deelay.me/5000/http://deelay.me/", 
] 
 
 
def find_size(url): 
    req = Request(url) 
    with urlopen(req) as response: 
        page = response.read() 
        return len(page) 
 
 
def main(): 
    for site in sites: 
        size = find_size(site) 
        print("Read {:8d} chars from {}".format(size, site)) 
 
 
if __name__ == '__main__': 
    start_time = time() 
    main() 
    print("Ran in {:6.3f} secs".format(time() - start_time))

On a test laptop, this code took 17.1 seconds to run. It is the cumulative loading time of each site. Let's see how asynchronous code runs.

Table of Contents for Synchronous web-scraping

Create new playlist

Sign In

Sign Up

Table of Contents for
Synchronous web-scraping