Gap has a well structured website with a Sitemap
to help web crawlers locate their updated content. If we use the techniques from Chapter 1, Introduction to Web Scraping, to investigate a website, we would find their robots.txt
file at http://www.gap.com/robots.txt, which contains a link to this Sitemap:
Sitemap: http://www.gap.com/products/sitemap_index.xml
Here are the contents of the linked Sitemap
file:
<?xml version="1.0" encoding="UTF-8"?> <sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"> <sitemap> <loc>http://www.gap.com/products/sitemap_1.xml</loc> <lastmod>2015-03-03</lastmod> </sitemap> <sitemap> <loc>http://www.gap.com/products/sitemap_2.xml</loc> <lastmod>2015-03-03</lastmod> </sitemap> </sitemapindex>
As shown here, this Sitemap
link is just an index and contains links to other Sitemap
files. These other Sitemap
files then contain links to thousands of product categories, such as http://www.gap.com/products/blue-long-sleeve-shirts-for-men.jsp:
There is a lot of content to crawl here, so we will use the threaded crawler developed in Chapter 4, Concurrent Downloading. You may recall that this crawler supports an optional callback for defining how to parse the downloaded web page. Here is a callback to crawl the Gap Sitemap
link:
from lxml import etree from threaded_crawler import threaded_crawler def scrape_callback(url, html): if url.endswith('.xml'): # Parse the sitemap XML file tree = etree.fromstring(html) links = [e[0].text for e in tree] return links else: # Add scraping code here pass
This callback first checks the downloaded URL extension. If the extension is .xml
, the downloaded URL is for a Sitemap
file, and the lxml etree
module is used to parse the XML and extract the links from it. Otherwise, this is a category URL, although this example does not implement scraping the category. Now we can use this callback with the threaded crawler to crawl gap.com
:
>>> from threaded_crawler import threaded_crawler >>> sitemap = 'http://www.gap.com/products/sitemap_index.xml' >>> threaded_crawler(sitemap, scrape_callback=scrape_callback) Downloading: http://www.gap.com/products/sitemap_1.xml Downloading: http://www.gap.com/products/sitemap_2.xml Downloading: http://www.gap.com/products/cable-knit-beanie-P987537.jsp Downloading: http://www.gap.com/products/2-in-1-stripe-tee-P987544.jsp Downloading: http://www.gap.com/products/boyfriend-jeans-2.jsp ...
As expected the Sitemap
files were first downloaded and then the clothing categories.
3.137.184.29