Beyond Beautiful Soup

In this example, we used the BS4 library to parse static HTML for us. Beautiful Soup is an invaluable library for dealing with occasionally messy HTML, but when it comes to large scales and dynamic pages, it simply won't suffice. For production scraping in large quantities, perhaps on a regular basis, it is a good idea to utilize the Scrapy (https://scrapy.org/) package. Scrapy is an entire framework for downloading HTML, parsing data, pulling data, and then storing it. One of its killer features is that it can run asynchronously – for example, while it is waiting for one page to load, it can switch to processing another, automatically. Because of that, Scrapy's scrapers are significantly faster on large lists of websites. At the same time, its interface is more expressive for a developer, as it is explicitly designed for scraping.

Depending on your goal, other alternatives may be available as well. For example, a marvelous package, newspaper (https://newspaper.readthedocs.io/en/latest/), can collect data (articles) from news websites – it has built-in configurations for the most popular ones, but also performs reasonably well on others with little guidance (for example, it can find the body of an article on the page).

Now, the elephant in the room is that in all those cases, we only use static content – HTML that is returned by the server when you get data from a URL. However, many modern sites actively add content after the initial page is loaded – for example, they add an infinite scroll by pulling new pieces of HTML while you scroll, a few steps ahead of you. If you need this information, none of those tools will suffice. In this case, you'll have to emulate the browser, rendering pages as if you were actually looking at them, running JavaScript, and so on. For this, people use Selenium – a browser automation system that can emulate a full-blown browser.

Selenium can pull a page, render the browser into an image, perform actions (for example, click on a button or scroll), and get you the resulting HTML. There is only one caveat – the process will be incredibly slow in this case. A more modern alternative is to use Splash, a JavaScript renderer written in the Lua language. It is somewhat faster than Selenium and does have integration with Scrapy, but is not as mature.

Scraping is no easy task. Ever-changing websites, data with no single structure, and developers trying to prevent you from scraping can make your work hard. In this environment, it is especially important to pick the right tool and design your code to be as clean and fault-tolerant as possible.

Table of Contents for Beyond Beautiful Soup

Create new playlist

Sign In

Sign Up

Table of Contents for
Beyond Beautiful Soup