Introduction to Scrapy

Scrapy is a web crawling framework written in Python used for crawling websites with effective and minimal coding. According to the official website of Scrapy (https://scrapy.org/), it is "An open source and collaborative framework for extracting the data you need from websites. In a fast, simple, yet extensible way." 

Scrapy provides a complete framework that is required to deploy a crawler with built-in tools. Scrapy was originally designed for web scraping; with its popularity and development, it is also used to extract data from APIs. Scrapy-based web crawlers are also easy to manage and maintain because of their structure. In general, Scrapy provides a project-based scope for projects dealing with web scraping.

The following are some of the features and distinguishable points that make Scrapy a favorite among developers:

  • Scrapy provides built-in support for document parsing, traversing, and extracting data using XPath, CSS Selectors, and regular expressions.
  • The crawler is scheduled and managed asynchronously allowing multiple links to be crawled at the same time.
  • It automates HTTP methods and actions, that is, there's no need for importing libraries such as requests or urllib manually for code. Scrapy handles requests and responses using its built-in libraries.
  • There's built-in support for feed export, pipelines (items, files, images, and media), that is, exporting, downloading, and storing data in JSON, CSV, XML, and database.
  • The availability of the middleware and the large collection of built-in extensions can handle cookies, sessions, authentication, robots.txt, logs, usage statistics, email handling, and so on.
  • Scrapy-driven projects are composed of easy-to-use distinguishable components and files, which can be handled with basic Python skills and many more.

Please refer to the official documentation of Scrapy at https://docs.scrapy.org/en/latest/intro/overview.html for an in-depth and detailed overview.

With a basic introduction to Scrapy, we now begin setting up a project and exploring the framework in more detail in the next sections.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.142.40.171