Starting with Scrapy

First, we need to define what we want to accomplish. In this case, we want to create a crawler that will extract all the book titles from https://www.packtpub.com/. In order to do so, we need to analyze our target. If we go to the https://www.packtpub.com/ website and right-click on a book title and select Inspect, we will see the source code of that element. We can see, in this case, that the book title has this format:

Creating a crawler for extracting all the book titles

Here, we can see div with a class of book-block-title, and then the title name. Keep this in mind or in a notebook, as that would be even better. We need this to define what we want to extract in our crawl process. Now, let's get coding:

  1. Let's go back to our virtual machine and open a Terminal. In order to create a crawler, we'll change to the /Examples/Section-3 directory:
cd Desktop/Examples/Section-3/
  1. Then, we need to create our project with the following Scrapy command:
scrapy startproject basic_crawler 
In our case, the name of the crawler is basic_crawler.
  1. When we create a project, Scrapy automatically generates a folder with the basic structure of the crawler.
  2. Inside the basic_crawler directory, you will see another folder called basic_crawler. We are interested in working with the items.py file and the content of the spiders folder:

These are the two files we'll work with.

  1. So, we open the Atom editor, and add our project with Add Project Folder... under Examples | Section-3 | basic crawler.
  1. Now, we need to open items.py in the Atom editor:
When working with Scrapy, we need to specify what the things we're interested in getting are while crawling a website. These things are called items in Scrapy, and think about them as our data module.
  1. So, let's edit the items.py file and define our first item. We can see in the preceding screenshot that the BasicCrawlerItem class was created.
  2. We'll create a variable called title, and that will be an object of the class Field:
title = scrappy.Field()
  1. We can delete the remaining part of the code after title = scrappy.Field() as it is not used.

This is all for now with this file.

  1. Let's move onto our spider. For the spider, we'll work on the spiderman.py file, which is created for this exercise in order to save time.
  2. We first need to copy it from Examples/Section-3/examples/spiders/spiderman-base.py to /Examples/Section-3/basic_crawler/basic_crawler/spiders/spiderman.py:
cp examples/spiders/spiderman-base.py basic_crawler/basic_crawler/spiders/spiderman.py
  1. Then, open the file in the editor and we can see at the top of the file the imports needed for this to work. We have BaseSpider, which is the basic crawling class. Then, we have Selector, which will help us to extract data using cross path. BasicCrawlerItem is the model we created in the items.py file. Finally, find a Request that will perform the request to the website:

Then, we have the class MySpider, which has the following fields:

  • name: This is the name of our spider, which is needed to invoke it later. In our case, it is basic_crawler.
  • allowed_domains: This is a list of domains that are allowed to be crawled. Basically, this is done to keep the crawler in bounds of the project; in this case, we're using packtpub.com.
  • start_urls: This is a list that contains the starting URLs where the crawler will start with the process. In this case, it is https://www.packtpub.com.
  • parse: As the name suggests, here is where the parsing of the results happens. We instantiate the Selector, parsing it with the response of the request.

Then, we define the book_titles variable that will contain the results of executing the following cross path query. The cross path query is based on the analysis we performed at the beginning of the chapter. This will result in an array containing all of the book titles extracted with the defined cross path from the response content. Now, we need to loop that array and create books of the BasicCrawlerItem type, and assign the extracted book title to the title of the book.

That's all for our basic crawler. Let's go to the Terminal and change the directory to basic_crawler and then run the crawler with scrapy crawl basic_crawler.

All the results are printed in the console and we can see the book titles being scraped correctly:

Now, let's save the output of the folder in a file by adding -o books.json -t, followed by the type of the file that is json:

scrapy crawl basic_crawler -o books.json -t json

Now, run it. We'll open the books.json file with vi books.json.

We can see that the book titles being extracted as expected:

There are some extra tabs and spaces in the titles, but we got the name of the books. This will be the minimal structure needed to create a crawler, but you might be wondering that we are just scraping the index page. How do we make it recursively crawl a whole website? That is a great question, and we'll answer this in the next section.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.118.227.69