Starting a project

Now that Scrapy is installed, we can run the startproject command to generate the default structure for this project. To do this, open the terminal and navigate to the directory where you want to store your Scrapy project, and then run scrapy startproject <project name>. Here, we will use example for the project name:

$ scrapy startproject example
$ cd example

Here are the files generated by the scrapy command:

    scrapy.cfg
    example/
        __init__.py  
        items.py  
        pipelines.py  
        settings.py  
        spiders/
            __init__.py

The important files for this chapter are as follows:

  • items.py: This file defines a model of the fields that will be scraped
  • settings.py: This file defines settings, such as the user agent and crawl delay
  • spiders/: The actual scraping and crawling code are stored in this directory

Additionally, Scrapy uses scrapy.cfg for project configuration and pipelines.py to process the scraped fields, but they will not need to be modified in this example.

Defining a model

By default, example/items.py contains the following code:

import scrapy

class ExampleItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    pass

The ExampleItem class is a template that needs to be replaced with how we want to store the scraped country details when the spider is run. To help focus on what Scrapy does, we will just scrape the country name and population, rather than all the country details. Here is an updated model to support this:

class ExampleItem(scrapy.Item):
    name = scrapy.Field()
    population = scrapy.Field()

Note

Full documentation about defining models is available at http://doc.scrapy.org/en/latest/topics/items.html.

Creating a spider

Now, we will build the actual crawling and scraping code, known as a spider in Scrapy. An initial template can be generated with the genspider command, which takes the name you want to call the spider, the domain, and optionally, a template:

$ scrapy genspider country example.webscraping.com --template=crawl

The built-in crawl template was used here to generate an initial version closer to what we need to crawl the countries. After running the genspider command, the following code will have been generated in example/spiders/country.py:

import scrapy
from scrapy.contrib.linkextractors import LinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule

from example.items import ExampleItem

class CountrySpider(CrawlSpider):
    name = 'country'
    start_urls = ['http://www.example.webscraping.com/']
    allowed_domains = ['example.webscraping.com']

    rules = (
        Rule(LinkExtractor(allow=r'Items/'), callback='parse_item', follow=True),
    )

    def parse_item(self, response):
        i = ExampleItem()
        #i['domain_id'] = response.xpath('//input[@id="sid"]/@value').extract()
        #i['name'] = response.xpath('//div[@id="name"]').extract()
        #i['description'] = response.xpath('//div[@id="description"]').extract()
        return i

The initial lines import the required Scrapy libraries, including the ExampleItem model defined in the Defining a model section. Then, a class is created for the spider, which contains a number of class attributes, as follows:

  • name: This attribute is a string to identify the spider
  • start_urls: This attribute is a list of URLs to start the crawl. However, the start_urls default attribute is not what we want because www has been prepended to the example.webscraping.com domain
  • allowed_domains: This attribute is a list of the domains that can be crawled—if this is not defined, any domain can be crawled
  • rules: This attribute is a set of regular expressions to tell the crawler which links to follow

The rules attribute also has a callback function to parse the responses of these downloads, and the parse_item() example method gives an example of how to scrape data from the response.

Scrapy is a high-level framework, so there is a lot going on here in these few lines of code. The official documentation has further details about building spiders, and can be found at http://doc.scrapy.org/en/latest/topics/spiders.html.

Tuning settings

Before running the generated spider, the Scrapy settings should be updated to avoid the spider being blocked. By default, Scrapy allows up to eight concurrent downloads for a domain with no delay between downloads, which is much faster than a real user would browse, and so is straightforward for a server to detect. As mentioned in the Preface, the example website that we are scraping is configured to temporarily block crawlers that consistently download at faster than one request a second, so the default settings would cause our spider to be blocked. Unless you are running the example website locally then I recommend adding these lines to example/settings.py so that the crawler only downloads a single request at a time per domain with a delay between downloads:

CONCURRENT_REQUESTS_PER_DOMAIN = 1
DOWNLOAD_DELAY = 5

Note that Scrapy will not use this exact delay between requests, because this would also make a crawler easier to detect and block. Instead, it adds a random offset to this delay between requests. For details about these settings and the many others available, refer to http://doc.scrapy.org/en/latest/topics/settings.html.

Testing the spider

To run a spider from the command line, the crawl command is used along with the name of the spider:

$ scrapy crawl country -s LOG_LEVEL=ERROR
[country] ERROR: Error downloading <GET http://www.example.webscraping.com/>: DNS lookup failed: address 'www.example.webscraping.com' not found: [Errno -5] No address associated with hostname.

As expected, the crawl failed on this default spider because http://www.example.webscraping.com does not exist. Take note of the -s LOG_LEVEL=ERROR flag—this is a Scrapy setting and is equivalent to defining LOG_LEVEL = 'ERROR' in the settings.py file. By default, Scrapy will output all log messages to the terminal, so here the log level was raised to isolate the error messages.

Here is an updated version of the spider to correct the starting URL and set which web pages to crawl:

    start_urls = ['http://example.webscraping.com/']

    rules = (
        Rule(LinkExtractor(allow='/index/'), follow=True),
        Rule(LinkExtractor(allow='/view/'), callback='parse_item')
    )

The first rule will crawl the index pages and follow their links, and then the second rule will crawl the country pages and pass the downloaded response to the callback function for scraping. Let us see what happens when this spider is run with the log level set to DEBUG to show all messages:

$ scrapy crawl country -s LOG_LEVEL=DEBUG
...
[country] DEBUG: Crawled (200) <GET http://example.webscraping.com/> 
[country] DEBUG: Crawled (200) <GET http://example.webscraping.com/index/1> 
[country] DEBUG: Filtered duplicate request: <GET http://example.webscraping.com/index/1> - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates)
[country] DEBUG: Crawled (200) <GET http://example.webscraping.com/view/Antigua-and-Barbuda-10> 
[country] DEBUG: Crawled (200) <GET http://example.webscraping.com/user/login?_next=%2Findex%2F1> 
[country] DEBUG: Crawled (200) <GET http://example.webscraping.com/user/register?_next=%2Findex%2F1> 
...

This log output shows that the index pages and countries are being crawled and duplicate links are filtered, which is handy. However, the spider is wasting resources by also crawling the login and register forms linked from each web page, because they match the rules regular expressions. The login URL in the preceding command ends with _next=%2Findex%2F1, which is a URL encoding equivalent to _next=/index/1, and would let the server know where to redirect to after the user logged in. To prevent crawling these URLs, we can use the deny parameter of the rules, which also expects a regular expression and will prevent crawling all matching URLs. Here is an updated version of code to prevent crawling the user login and registration forms by avoiding the URLs containing /user/:

    rules = (
        Rule(LinkExtractor(allow='/index/', deny='/user/'), follow=True),
        Rule(LinkExtractor(allow='/view/', deny='/user/'), callback='parse_item')
    )

Note

Further documentation about how to use this class is available at http://doc.scrapy.org/en/latest/topics/link-extractors.html.

Scraping with the shell command

Now that Scrapy can crawl the countries, we need to define what data should be scraped. To help test how to extract data from a web page, Scrapy comes with a handy command called shell that will download a URL and provide the resulting state in the Python interpreter. Here are the results for a sample country:

$ scrapy shell http://example.webscraping.com/view/United-Kingdom-239
[s] Available Scrapy objects:
[s]   crawler    <scrapy.crawler.Crawler object at 0x7f1475da5390>
[s]   item       {}
[s]   request    <GET http://example.webscraping.com/view/United-Kingdom-239>
[s]   response   <200 http://example.webscraping.com/view/United-Kingdom-239>
[s]   settings   <scrapy.settings.Settings object at 0x7f147c1fb490>
[s]   spider     <Spider 'default' at 0x7f147552eb90>
[s] Useful shortcuts:
[s]   shelp()           Shell help (print this help)
[s]   fetch(req_or_url) Fetch request (or URL) and update local objects
[s]   view(response)    View response in a browser

We can now query these objects to check what data is available.

In [1]: response.url
'http://example.webscraping.com/view/United-Kingdom-239'
In [2]: response.status
200

Scrapy uses lxml to scrape data, so we can use the same CSS selectors as those in Chapter 2, Scraping the Data:

In [3]: response.css('tr#places_country__row td.w2p_fw::text')
[<Selector xpath=u"descendant-or-self::
    tr[@id = 'places_country__row']/descendant-or-self::
    */td[@class and contains(
    concat(' ', normalize-space(@class), ' '),
    ' w2p_fw ')]/text()" data=u'United Kingdom'>]

This method returns an lxml selector; to apply it, the extract() method needs to be called:

In [4]: name_css = 'tr#places_country__row td.w2p_fw::text'
In [5]: response.css(name_css).extract()
[u'United Kingdom']
In [6]: pop_css = 'tr#places_population__row td.w2p_fw::text'
In [7]: response.css(pop_css).extract()
[u'62,348,447']

These CSS selectors can then be used in the parse_item() method generated earlier in example/spiders/country.py:

def parse_item(self, response):
    item = ExampleItem()
    name_css = 'tr#places_country__row td.w2p_fw::text'
    item['name'] = response.css(name_css).extract()
    pop_css = 'tr#places_population__row td.w2p_fw::text'
    item['population'] = response.css(pop_css).extract()
    return item

Checking results

Here is the completed version of our spider:

class CountrySpider(CrawlSpider):
    name = 'country'
    start_urls = ['http://example.webscraping.com/']
    allowed_domains = ['example.webscraping.com']
    rules = (
        Rule(LinkExtractor(allow='/index/', deny='/user/'), follow=True),
        Rule(LinkExtractor(allow='/view/', deny='/user/'), callback='parse_item')
    )

    def parse_item(self, response):
        item = ExampleItem()
        name_css = 'tr#places_country__row td.w2p_fw::text'
        item['name'] = response.css(name_css).extract()
        pop_css = 'tr#places_population__row td.w2p_fw::text'
        item['population'] = response.css(pop_css).extract()
        return item

To save the results, we could add extra code to the parse_item() method to write the scraped country data, or perhaps define a pipeline. However, this work is not necessary because Scrapy provides a handy --output flag to save scraped items automatically in CSV, JSON, or XML format. Here are the results when the final version of the spider is run with the results output to a CSV file and the log level is set to INFO, to filter out less important messages:

$ scrapy crawl country --output=countries.csv -s LOG_LEVEL=INFO 
[scrapy] INFO: Scrapy 0.24.4 started (bot: example)
[country] INFO: Spider opened
[country] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
[country] INFO: Crawled 10 pages (at 10 pages/min), scraped 9 items (at 9 items/min)
...
[country] INFO: Crawled 264 pages (at 10 pages/min), scraped 238 items (at 9 items/min)
[country] INFO: Crawled 274 pages (at 10 pages/min), scraped 248 items (at 10 items/min)
[country] INFO: Closing spider (finished)
[country] INFO: Stored csv feed (252 items) in: countries.csv
[country] INFO: Dumping Scrapy stats:
  {'downloader/request_bytes': 155001,
   'downloader/request_count': 279,
   'downloader/request_method_count/GET': 279,
   'downloader/response_bytes': 943190,
   'downloader/response_count': 279,
   'downloader/response_status_count/200': 279,
   'dupefilter/filtered': 61,
   'finish_reason': 'finished',
   'item_scraped_count': 252,
   'log_count/INFO': 36,
   'request_depth_max': 26,
   'response_received_count': 279,
   'scheduler/dequeued': 279,
   'scheduler/dequeued/memory': 279,
   'scheduler/enqueued': 279,
   'scheduler/enqueued/memory': 279}
[country] INFO: Spider closed (finished)

At the end of the crawl, Scrapy outputs some statistics to give an indication of how the crawl performed. From these statistics, we know that 279 web pages were crawled and 252 items were scraped, which is the expected number of countries in the database, so we know that the crawler was able to find them all.

To verify these countries were scraped correctly we can check the contents of countries.csv:

name,population
Afghanistan,"29,121,286"
Antigua and Barbuda,"86,754"
Antarctica,0
Anguilla,"13,254"
Angola,"13,068,161"
Andorra,"84,000"
American Samoa,"57,881"
Algeria,"34,586,184"
Albania,"2,986,952"
Aland Islands,"26,711"
...

As expected this spreadsheet contains the name and population for each country. Scraping this data required writing less code than the original crawler built in Chapter 2, Scraping the Data because Scrapy provides a lot of high-level functionalities. In the following section on Portia we will re-implement this scraper writing even less code.

Interrupting and resuming a crawl

Sometimes when scraping a website, it can be useful to pause the crawl and resume it later without needing to start over from the beginning. For example, you may need to interrupt the crawl to reset your computer after a software update, or perhaps, the website you are crawling is returning errors and you want to continue the crawl later. Conveniently, Scrapy comes built-in with support to pause and resume crawls without needing to modify our example spider. To enable this feature, we just need to define the JOBDIR setting for the directory where the current state of a crawl is saved. Note that separate directories must be used to save the state of multiple crawls. Here is an example using this feature with our spider:

$ scrapy crawl country -s LOG_LEVEL=DEBUG -s JOBDIR=crawls/country
...
[country] DEBUG: Scraped from <200 http://example.webscraping.com/view/Afghanistan-1>
  {'name': [u'Afghanistan'], 'population': [u'29,121,286']}
^C  [scrapy] INFO: Received SIGINT, shutting down gracefully. Send again to force 
[country] INFO: Closing spider (shutdown)
[country] DEBUG: Crawled (200) <GET http://example.webscraping.com/view/Antigua-and-Barbuda-10> (referer: http://example.webscraping.com/)
[country] DEBUG: Scraped from <200 http://example.webscraping.com/view/Antigua-and-Barbuda-10>
  {'name': [u'Antigua and Barbuda'], 'population': [u'86,754']}
[country] DEBUG: Crawled (200) <GET http://example.webscraping.com/view/Antarctica-9> (referer: http://example.webscraping.com/)
[country] DEBUG: Scraped from <200 http://example.webscraping.com/view/Antarctica-9>
  {'name': [u'Antarctica'], 'population': [u'0']}
...
[country] INFO: Spider closed (shutdown)

Here, we see that ^C (Ctrl + C) was used to send the terminate signal, and that the spider finished processing a few items before terminating. To have Scrapy save the crawl state, you must wait here for the crawl to shut down gracefully and resist the temptation to enter Ctrl + C again to force immediate termination! The state of the crawl will now be saved in crawls/country, and the crawl can be resumed by running the same command:

$ scrapy crawl country -s LOG_LEVEL=DEBUG -s JOBDIR=crawls/country
...
[country] INFO: Resuming crawl (12 requests scheduled)
[country] DEBUG: Crawled (200) <GET http://example.webscraping.com/view/Anguilla-8> (referer: http://example.webscraping.com/)
[country] DEBUG: Scraped from <200 http://example.webscraping.com/view/Anguilla-8>
  {'name': [u'Anguilla'], 'population': [u'13,254']}
[country] DEBUG: Crawled (200) <GET http://example.webscraping.com/view/Angola-7> (referer: http://example.webscraping.com/)
[country] DEBUG: Scraped from <200 http://example.webscraping.com/view/Angola-7>
  {'name': [u'Angola'], 'population': [u'13,068,161']}
...

The crawl now resumes from where it was paused and continues as normal. This feature is not particularly useful for our example website because the number of pages to download is very small. However, for larger websites that take months to crawl, being able to pause and resume crawls is very convenient.

Note that there are some edge cases not covered here that can cause problems when resuming a crawl, such as expiring cookies, which are mentioned in the Scrapy documentation available at http://doc.scrapy.org/en/latest/topics/jobs.html.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.125.169