Gap

Gap has a well structured website with a Sitemap to help web crawlers locate their updated content. If we use the techniques from Chapter 1, Introduction to Web Scraping, to investigate a website, we would find their robots.txt file at http://www.gap.com/robots.txt, which contains a link to this Sitemap:

Sitemap: http://www.gap.com/products/sitemap_index.xml

Here are the contents of the linked Sitemap file:

<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
    <sitemap>
        <loc>http://www.gap.com/products/sitemap_1.xml</loc>
        <lastmod>2015-03-03</lastmod>
    </sitemap>
    <sitemap>
        <loc>http://www.gap.com/products/sitemap_2.xml</loc>
        <lastmod>2015-03-03</lastmod>
    </sitemap>
</sitemapindex>

As shown here, this Sitemap link is just an index and contains links to other Sitemap files. These other Sitemap files then contain links to thousands of product categories, such as http://www.gap.com/products/blue-long-sleeve-shirts-for-men.jsp:

Gap

There is a lot of content to crawl here, so we will use the threaded crawler developed in Chapter 4, Concurrent Downloading. You may recall that this crawler supports an optional callback for defining how to parse the downloaded web page. Here is a callback to crawl the Gap Sitemap link:

from lxml import etree
from threaded_crawler import threaded_crawler

def scrape_callback(url, html):
    if url.endswith('.xml'):
        # Parse the sitemap XML file
        tree = etree.fromstring(html)
        links = [e[0].text for e in tree]
        return links
    else:
        # Add scraping code here
        pass

This callback first checks the downloaded URL extension. If the extension is .xml, the downloaded URL is for a Sitemap file, and the lxml etree module is used to parse the XML and extract the links from it. Otherwise, this is a category URL, although this example does not implement scraping the category. Now we can use this callback with the threaded crawler to crawl gap.com:

>>> from threaded_crawler import threaded_crawler
>>> sitemap = 'http://www.gap.com/products/sitemap_index.xml'
>>> threaded_crawler(sitemap, scrape_callback=scrape_callback)
Downloading: http://www.gap.com/products/sitemap_1.xml
Downloading: http://www.gap.com/products/sitemap_2.xml
Downloading: http://www.gap.com/products/cable-knit-beanie-P987537.jsp
Downloading: http://www.gap.com/products/2-in-1-stripe-tee-P987544.jsp
Downloading: http://www.gap.com/products/boyfriend-jeans-2.jsp
...

As expected the Sitemap files were first downloaded and then the clothing categories.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.137.184.29