How it works

A scraping framework has two main independent components. The spider crawls the site and the pipeline processes the scraped data, as shown in this diagram:

These two components are independent because the spider is dependent on the format and structure of the website whereas the pipeline is dependent on the structure of the persisted data.

The spider works as follows: it takes a URL as the entry point (for example, the landing page), extracts all the links present on the page, and crawls to the next page. On the next page, the spider will repeat the crawling process until it reaches a predefined level of depth. On a forum, usually the depth of pages is between three and five due to the standard structure such as Topics | Conversations | Threads, which means the spider usually has to travel three to five levels of depth to actually reach the conversational data.

The data pipeline process handles the post-crawled data. It is in charge of validations, cleaning, normalization, and finally persisting data for further analysis. The usual steps in the pipeline are to verify that the required fields are present in the data points and that the structure of the data is coherent (for example, the email actually resembles an email and such, because websites can change the structure of their information, so we must make sure that we are not storing flawed data).

Another very handy feature of Scrapy is that it records the history of all the URLs already scraped, which means that we can restart spiders and not have to recrawl all the URLs that have already been treated.

Table of Contents for How it works

Create new playlist

Sign In

Sign Up

Table of Contents for
How it works