Asynchronous web-scraping

This asyncio code requires an installation of a few Python asynchronous network libraries, such as aiohttp and aiodns. They are mentioned in the docstring.

Here's the code for the asynchronous scraper at src/extras/async.py; it is structured to be as close as possible to the synchronous version so that it's easier to compare:

"""Asynchronously download a list of webpages and time it 
 
Dependencies: Make sure you install aiohttp 
 
pip install aiohttp aiodns 
 
""" 
import asyncio 
import aiohttp 
from time import time 
 
sites = [ 
    "http://news.ycombinator.com/", 
    "https://www.yahoo.com/", 
    "http://www.aliexpress.com/", 
    "http://deelay.me/5000/http://deelay.me/", 
] 
 
 
async def find_size(session, url): 
    async with session.get(url) as response: 
        page = await response.read() 
        return len(page) 
 
 
async def show_size(session, url): 
    size = await find_size(session, url) 
    print("Read {:8d} chars from {}".format(size, url)) 
 
 
async def main(loop): 
    async with aiohttp.ClientSession() as session: 
        tasks = [] 
        for site in sites: 
            tasks.append(loop.create_task(show_size(session, site))) 
        await asyncio.wait(tasks) 
 
 
if __name__ == '__main__': 
    start_time = time() 
    loop = asyncio.get_event_loop() 
    loop.run_until_complete(main(loop)) 
    print("Ran in {:6.3f} secs".format(time() - start_time))

The main function is a coroutine that triggers the creation of a separate coroutine for each website. Then, it waits until all these triggered coroutines are completed. As a best practice, the web session object is passed to avoid recreating new sessions for each page.

The total running time of this program on the same test laptop is 7.5 s. This is a speedup of 2.3x on a single core. This surprising result can be better understood if we can visualize how the time was spent, as shown in the following diagram:

A simplistic representation comparing tasks in the synchronous and asynchronous scrapers

The Synchronous scraper is easy to understand. Each task is waiting for the previous task to complete. Each task needs very little CPU time and the majority of the time is spent waiting for the data to arrive from the network. As a result, the tasks cascade sequentially like a waterfall.

On the other hand, the Asynchronous scraper starts the first task and, as soon as it starts waiting for I/O, it switches to the next task. The CPU is hardly idle as the execution goes back to the event loop as soon as the waiting starts. Eventually, the I/O completes in the same amount of time, but due to the multiplexing of activity, the overall time taken is drastically reduced.

In fact, the asynchronous code can be sped up further. The standard asyncio event loop is written in pure Python and provided as a reference implementation. You can consider faster implementations such as uvloop to speed things up further.

Table of Contents for Asynchronous web-scraping

Create new playlist

Sign In

Sign Up

Table of Contents for
Asynchronous web-scraping