Automated scraping with Scrapely

For scraping the annotated fields Portia uses a library called Scrapely, which is a useful open-source tool developed independently of Portia and is available at https://github.com/scrapy/scrapely. Scrapely uses training data to build a model of what to scrape from a web page, and then this model can be applied to scrape other web pages with the same structure in future. Here is an example to show how it works:

(portia_example)$ python
>>> from scrapely import Scraper
>>> s = Scraper()
>>> train_url = 'http://example.webscraping.com/view/Afghanistan-1'
>>> s.train(train_url, {'name': 'Afghanistan', 'population': '29,121,286'})
>>> test_url = 'http://example.webscraping.com/view/United-Kingdom-239'
>>> s.scrape(test_url)
[{u'name': [u'United Kingdom'], u'population': [u'62,348,447']}]

First, Scrapely is given the data we want to scrape from the Afghanistan web page to train the model, being the country name and population. Then, this model is applied to another country page and Scrapely uses this model to correctly return the country name and population here too.

This workflow allows scraping web pages without needing to know their structure, only the desired content to extract in a training case. This approach can be particularly useful if the content of a web page is static, while the layout is changing. For example, with a news website, the text of the published article will most likely not change, though the layout may be updated. In this case, Scrapely can then be retrained using the same data to generate a model for the new website structure.

The example web page used here to test Scrapely is well structured with separate tags and attributes for each data type so that Scrapely was able to correctly train a model. However, for more complex web pages, Scrapely can fail to locate the content correctly, and so their documentation warns that you should "train with caution". Perhaps, in future, a more robust automated web scraping library will be released, but, for now, it is still necessary to know how to scrape a website directly using the techniques covered in Chapter 2, Scraping the Data.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.16.218.105