Running and exporting

We need to run a Spider and look for data for item fields in the provided URLs. We can start running the Spider from the command line by issuing the scrapy crawl quotes command or as seen in the following screenshot:

Running a Spider (scrapy crawl quotes)

The Scrapy argument crawl is provided with a Spider name (quotes) in the command. A successful run of the command will result in information about Scrapy, bots, Spider, crawling stats, and HTTP methods, and will list the item data as a dictionary. 

While executing a Spider we will receive various forms of information, such as INFO/DEBUG/scrapy statistics and so on, as found in the following code:

...[scrapy] INFO: Scrapy 1.0.3 started (bot: Quotes)
...[scrapy] INFO: Optional features available: ssl, http11, boto
...[scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'Quotes.spiders', 'SPIDER_MODULES': ['Quoyes.spiders'], 'BOT_NAME': 'Quotes'}
.......
...[scrapy] INFO: Enabled item pipelines:
...[scrapy] INFO: Spider opened
...[scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
...[scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
...[scrapy] DEBUG: Redirecting (301) to <GET http://quotes.toscrape.com/> from <GET http://quotes.toscrape.com/>

[scrapy] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/1/> (referer: None)
('Response Type >>> ', <class 'scrapy.http.response.html.HtmlResponse'>)
.......
.......
('Response Type >>> ', <class 'scrapy.http.response.html.HtmlResponse'>)
...[scrapy] DEBUG: Scraped from <200 http://quotes.toscrape.com/>
{'author': u'J.K. Rowling',
.......
...[scrapy] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/5/>
{'author': u'James Baldwin',
'author_link': u'http://quotes.toscrape.com/author/James-Baldwin',
.....
('Next Page URL: ', u'/page/6/')
.......
.......
Completed
..
.[scrapy] INFO: Closing spider (finished)

The Scrapy statistics are as follows:

[scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 3316,
'downloader/request_count': 13,
'downloader/request_method_count/GET': 13,
'downloader/response_bytes': 28699,
'downloader/response_count': 13,
'downloader/response_status_count/200': 11,
'downloader/response_status_count/301': 2,
'dupefilter/filtered': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(.....
'item_scraped_count': 110,
'log_count/DEBUG': 126,
'log_count/ERROR': 2,
'log_count/INFO': 8,
'log_count/WARNING': 1,
'request_depth_max': 8,
'response_received_count': 11,
'scheduler/dequeued': 13,
'scheduler/dequeued/memory': 13,
'scheduler/enqueued': 13,
'scheduler/enqueued/memory': 13,
'start_time': datetime.datetime(....
..... [scrapy] INFO: Spider closed (finished)

We can also run the Spider and save the item found or data scraped to the external files. Data is exported or stored in files for easy access, usage, and convenience in sharing and managing. 

With Scrapy, we can export scraped data to external files using crawl commands as seen in the following list: 

  • To extract data to a CSV file we can use the C:ScrapyProjectsQuotes> scrapy crawl quotes -o quotes.csv command as seen in the following screenshot:

Contents from file quotes.csv

  • To extract data to JSON file format, we can use the C:ScrapyProjectsQuotes> scrapy crawl quotes -o quotes.json command as seen in the following:

Contents from file quotes.json

The -o parameter followed by a filename will be generated inside the main project folder. Please refer to the official Scrapy documentation about feed exports at http://docs.scrapy.org/en/latest/topics/feed-exports.html for more detailed information and file types that can be used to export data.

In this section, we learned about Scrapy and used it to create a Spider to scrape data and export the data scraped to external files. In the next section, we will deploy the crawler on the web.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.189.186.214