Setting up a project

We will require a Python library with scrapy successfully installed on the system before proceeding with the project setup. For setting up or installation refer to Chapter 2, Python and the Web – Using urllib and Requests, Setting things up section or, for more details on Scrapy installation, please refer to the official installation guide at https://docs.scrapy.org/en/latest/intro/overview.html. 

Upon successful installation, we can obtain the details shown in the following screenshot, using Python IDE:

Successful installation of Scrapy with details

With the successful installation of the scrapy library, there's also the availability of the scrapy command-line tool. This command-line tool contains a number of commands, which are used at various stages of a project from starting or creating a project through to it being fully up and running.

To begin with creating a project, let's follow the steps:

  1. Open Terminal or command-line interface
  2. Create a folder (ScrapyProjects) as shown in the following screenshot or select a folder in which to place Scrapy projects
  3. Inside the selected folder, run or execute the scrapy command
  4. A list of available commands and their brief details will appear, similar to the following screenshot: 

List of available commands for Scrapy

We will be creating a Quotes project to obtain author quotes related to web scraping from http://toscrape.com/, accessing information from the first five pages or less which exists using the URL http://quotes.toscrape.com/.

We are now going to start the Quotes project. From the Command Prompt, run or execute the scrapy startproject Quotes command as seen in the following screenshot:

Starting a project (using command: scrapy startproject Quotes)

If successful, the preceding command will be the creation of a new folder named Quotes (that is, the project root directory) with additional files and subfolders as shown in the following screenshot:

Contents for project folder ScrapyProjectsQuotes

With the project successfully created, let's explore the individual components inside the project folder:

  • scrapy.cfg is a configuration file in which default project-related settings for deployment are found and can be added.
  • Subfolder will find Quotes named same as project directory, which is actually a Python module. We will find additional Python files and other resources in this module as follows:

Contents for project folder ScrapyProjectsQuotesQuotes 

As seen in the preceding screenshot, the module is contained in the spiders folder and the items.py, pipelines.py, and settings.py Python files. These content found inside the Quotes module has specific implementation regarding the project scope explored in the following list:

  • spiders: This folder will contain Spider classes or Spider writing in Python. Spiders are classes that contain code that is used for scraping. Each individual Spider class is designated to specific scraping activities.
  • items.py: This Python file contains item containers, that is, Python class files inheriting scrapy. Items are used to collect the scraped data and use it inside spiders. Items are generally declared to carry values and receive built-in support from other resources in the main project. An item is like a Python dictionary object, where keys are fields or objects of scrapy.item.Field, which will hold certain values.

Although the default project creates the items.py for the item-related task, it's not compulsory to use it inside the spider. We can use any lists or collect data values and process them in our own way such as writing them into a file, appending them to a list, and so on. 

  • pipelines.py: This part is executed after the data is scraped. The scraped items are sent to the pipeline to perform certain actions. It also decides whether to process the received scraped items or drop them.
  • settings.py: This is the most important file in which settings for the project can be adjusted. According to the preference of the project, we can adjust the settings. Please refer to the official documentation from Scrapy for settings at https://scrapy2.readthedocs.io/en/latest/topics/settings.html

In this section, we have successfully created a project and the required files using Scrapy. These files will be used and updated as described in the following sections.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.14.69.119