Scrapy is a popular web scraping framework that comes with many high-level functions to make scraping websites easier. In this chapter, we will get to know Scrapy by using it to scrape the example website, just as we did in Chapter 2, Scraping the Data. Then, we will cover Portia, which is an application based on Scrapy that allows you to scrape a website through a point and click interface
Scrapy can be installed with the pip
command, as follows:
pip install Scrapy
Scrapy relies on some external libraries so if you have trouble installing it there is additional information available on the official website at: http://doc.scrapy.org/en/latest/intro/install.html.
Currently, Scrapy only supports Python 2.7, which is more restrictive than other packages introduced in this book. Previously, Python 2.6 was also supported, but this was dropped in Scrapy 0.20. Also due to the dependency on Twisted, support for Python 3 is not yet possible, though the Scrapy team assures me they are working to solve this.
If Scrapy is installed correctly, a
scrapy
command will now be available in the terminal:
$ scrapy -h Scrapy 0.24.4 - no active project Usage: scrapy <command> [options] [args] Available commands: bench Run quick benchmark test check Check spider contracts crawl Run a spider ...
We will use the following commands in this chapter:
For detailed information about these and the other commands available, refer to http://doc.scrapy.org/en/latest/topics/commands.html.
18.220.106.9