Scrapy is a web crawling framework written in Python used for crawling websites with effective and minimal coding. According to the official website of Scrapy (https://scrapy.org/), it is "An open source and collaborative framework for extracting the data you need from websites. In a fast, simple, yet extensible way."
Scrapy provides a complete framework that is required to deploy a crawler with built-in tools. Scrapy was originally designed for web scraping; with its popularity and development, it is also used to extract data from APIs. Scrapy-based web crawlers are also easy to manage and maintain because of their structure. In general, Scrapy provides a project-based scope for projects dealing with web scraping.
The following are some of the features and distinguishable points that make Scrapy a favorite among developers:
- Scrapy provides built-in support for document parsing, traversing, and extracting data using XPath, CSS Selectors, and regular expressions.
- The crawler is scheduled and managed asynchronously allowing multiple links to be crawled at the same time.
- It automates HTTP methods and actions, that is, there's no need for importing libraries such as requests or urllib manually for code. Scrapy handles requests and responses using its built-in libraries.
- There's built-in support for feed export, pipelines (items, files, images, and media), that is, exporting, downloading, and storing data in JSON, CSV, XML, and database.
- The availability of the middleware and the large collection of built-in extensions can handle cookies, sessions, authentication, robots.txt, logs, usage statistics, email handling, and so on.
- Scrapy-driven projects are composed of easy-to-use distinguishable components and files, which can be handled with basic Python skills and many more.
Please refer to the official documentation of Scrapy at https://docs.scrapy.org/en/latest/intro/overview.html for an in-depth and detailed overview.
With a basic introduction to Scrapy, we now begin setting up a project and exploring the framework in more detail in the next sections.