Chapter 8. Scrapy

Scrapy is a popular web scraping framework that comes with many high-level functions to make scraping websites easier. In this chapter, we will get to know Scrapy by using it to scrape the example website, just as we did in Chapter 2, Scraping the Data. Then, we will cover Portia, which is an application based on Scrapy that allows you to scrape a website through a point and click interface

Installation

Scrapy can be installed with the pip command, as follows:

pip install Scrapy

Scrapy relies on some external libraries so if you have trouble installing it there is additional information available on the official website at: http://doc.scrapy.org/en/latest/intro/install.html.

Currently, Scrapy only supports Python 2.7, which is more restrictive than other packages introduced in this book. Previously, Python 2.6 was also supported, but this was dropped in Scrapy 0.20. Also due to the dependency on Twisted, support for Python 3 is not yet possible, though the Scrapy team assures me they are working to solve this.

If Scrapy is installed correctly, a scrapy command will now be available in the terminal:

$ scrapy -h
Scrapy 0.24.4 - no active project

Usage:
  scrapy <command> [options] [args]

Available commands:
  bench         Run quick benchmark test
  check         Check spider contracts
  crawl         Run a spider
...

We will use the following commands in this chapter:

  • startproject: Creates a new project
  • genspider: Generates a new spider from a template
  • crawl: Runs a spider
  • shell: Starts the interactive scraping console

Note

For detailed information about these and the other commands available, refer to http://doc.scrapy.org/en/latest/topics/commands.html.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.220.106.9