Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 3. Starting to Crawl

So far, the examples in the book have covered single static pages, with somewhat artificial canned examples. In this chapter, we’ll start looking at some real-world problems, with scrapers traversing multiple pages and even multiple sites.

Web crawlers are called such because they crawl across the Web. At their core is an element of recursion. They must retrieve page contents for a URL, examine that page for another URL, and retrieve that page, ad infinitum.

Beware, however: just because you can crawl the Web doesn’t mean that you always should. The scrapers used in previous examples work great in situations where all the data you need is on a single page. With web crawlers, you must be extremely conscientious of how much bandwidth you are using and make every effort to determine if there’s a way to make the target server’s load easier.

Traversing a Single Domain

Even if you haven’t heard of “Six Degrees of Wikipedia,” you’ve almost certainly heard of its namesake, “Six Degrees of Kevin Bacon.” In both games, the goal is to link two unlikely subjects (in the first case, Wikipedia articles that link to each other, in the second case, actors appearing in the same film) by a chain containing no more than six total (including the two original subjects).

For example, Eric Idle appeared in Dudley Do-Right with Brendan Fraser, who appeared in The Air I Breathe with Kevin Bacon.¹ In this case, the chain from Eric Idle to Kevin Bacon is only three subjects long.

In this section, we’ll begin a project that will become a “Six Degrees of Wikipedia" solution finder. That is, we’ll be able to take the Eric Idle page and find the fewest number of link clicks that will take us to the Kevin Bacon page.

You should already know how to write a Python script that retrieves an arbitrary Wikipedia page and produces a list of links on that page:

from urllib.request import urlopen
from bs4 import BeautifulSoup 

html = urlopen("http://en.wikipedia.org/wiki/Kevin_Bacon")
bsObj = BeautifulSoup(html)
for link in bsObj.findAll("a"):
    if 'href' in link.attrs:
        print(link.attrs['href'])

If you look at the list of links produced, you’ll notice that all the articles you’d expect are there: “Apollo 13,” “Philadelphia,” “Primetime Emmy Award,” and so on. However, there are some things that we don’t want as well:

//wikimediafoundation.org/wiki/Privacy_policy
//en.wikipedia.org/wiki/Wikipedia:Contact_us

In fact, Wikipedia is full of sidebar, footer, and header links that appear on every page, along with links to the category pages, talk pages, and other pages that do not contain different articles:

/wiki/Category:Articles_with_unsourced_statements_from_April_2014
/wiki/Talk:Kevin_Bacon

Recently a friend of mine, while working on a similar Wikipedia-scraping project, mentioned he had written a very large filtering function, with over 100 lines of code, in order to determine whether an internal Wikipedia link was an article page or not. Unfortunately, he had not spent much time up front trying to find patterns between “article links” and “other links,” or he might have discovered the trick. If you examine the links that point to article pages (as opposed to other internal pages), they all have three things in common:

They reside within the div with the id set to bodyContent
The URLs do not contain semicolons
The URLs begin with /wiki/

We can use these rules to revise the code slightly to retrieve only the desired article links:

from urllib.request import urlopen 
from bs4 import BeautifulSoup 
import re

html = urlopen("http://en.wikipedia.org/wiki/Kevin_Bacon")
bsObj = BeautifulSoup(html)
for link in bsObj.find("div", {"id":"bodyContent"}).findAll("a", 
                      href=re.compile("^(/wiki/)((?!:).)*$")):
    if 'href' in link.attrs:
        print(link.attrs['href'])

If you run this, you should see a list of all article URLs that the Wikipedia article on Kevin Bacon links to.

Of course, having a script that finds all article links in one, hardcoded Wikipedia article, while interesting, is fairly useless in practice. We need to be able to take this code and transform it into something more like the following:

A single function, getLinks, that takes in a Wikipedia article URL of the form /wiki/<Article_Name> and returns a list of all linked article URLs in the same form.
A main function that calls getLinks with some starting article, chooses a random article link from the returned list, and calls getLinks again, until we stop the program or until there are no article links found on the new page.

Here is the complete code that accomplishes this:

from urllib.request import urlopen
from bs4 import BeautifulSoup
import datetime
import random
import re

random.seed(datetime.datetime.now())
def getLinks(articleUrl):
    html = urlopen("http://en.wikipedia.org"+articleUrl)
    bsObj = BeautifulSoup(html)
    return bsObj.find("div", {"id":"bodyContent"}).findAll("a", 
                     href=re.compile("^(/wiki/)((?!:).)*$"))
links = getLinks("/wiki/Kevin_Bacon")
while len(links) > 0:
    newArticle = links[random.randint(0, len(links)-1)].attrs["href"]
    print(newArticle)
    links = getLinks(newArticle)

The first thing the program does, after importing the needed libraries, is set the random number generator seed with the current system time. This practically ensures a new and interesting random path through Wikipedia articles every time the program is run.

Pseudorandom Numbers and Random Seeds

In the previous example, I used Python’s random number generator to select an article at random on each page in order to continue a random traversal of Wikipedia. However, random numbers should be used with caution.

While computers are great at calculating correct answers, they’re terrible at just making things up. For this reason, random numbers can be a challenge. Most random number algorithms strive to produce an evenly distributed and hard-to-predict sequence of numbers, but a “seed” number is needed to give these algorithms something to work with initially. The exact same seed will produce the exact same sequence of “random” numbers every time, so for this reason I’ve used the system clock as a starter for producing new sequences of random numbers, and, thus, new sequences of random articles. This makes the program a little more exciting to run.

For the curious, the Python pseudorandom number generator is powered by the Mersenne Twister algorithm. While it produces random numbers that are difficult to predict and uniformly distributed, it is slightly processor intensive. Random numbers this good don’t come cheap!

Next, it defines the getLinks function, which takes in an article URL of the form /wiki/..., prepends the Wikipedia domain name, http://en.wikipedia.org, and retrieves the BeautifulSoup object for the HTML at that domain. It then extracts a list of article link tags, based on the parameters discussed previously, and returns them.

The main body of the program begins with setting a list of article link tags (the links variable) to the list of links in the initial page: http://bit.ly/1KwqjU7. It then goes into a loop, finding a random article link tag in the page, extracting the href attribute from it, printing the page, and getting a new list of links from the extracted URL.

Of course, there’s a bit more to solving a “Six Degrees of Wikipedia” problem than simply building a scraper that goes from page to page. We must also be able to store and analyze the resulting data. For a continuation of the solution to this problem, see Chapter 5.

Handle Your Exceptions!

Although we are omitting most exception handling in the code examples for the sake of brevity in these examples, be aware that there are many potential pitfalls that could arise: What if Wikipedia changed the name of the bodyContent tag, for example? (Hint: the code would crash.)

So although these scripts might be fine to run as closely watched examples, autonomous production code requires far more exception handling than we can fit into this book. Look back to Chapter 1 for more information about this.

Crawling an Entire Site

In the previous section, we took a random walk through a website, going from link to link. But what if you need to systematically catalog or search every page on a site? Crawling an entire site, especially a large one, is a memory-intensive process that is best suited to applications where a database to store crawling results is readily available. However, we can explore the behavior of these types of applications without actually running them full-scale. To learn more about running these applications using a database, see Chapter 5.

The Dark and Deep Webs

You’ve likely heard the terms deep Web, dark Web, or hidden Web being thrown around a lot, especially in the media lately. What do they mean?

The deep Web is simply any part of the Web that’s not part of the surface Web The surface is part of the Internet that is indexed by search engines. Estimates vary widely, but the deep Web almost certainly makes up about 90% of the Internet. Because Google can’t do things like submit forms, find pages that haven’t been linked to by a top-level domain, or investigate sites where the robots.txt prohibits it, the surface Web stays relatively small.

The dark Web, also known as the Darknet or dark Internet, is another beast entirely. It is run over the existing network infrastructure but uses a Tor client with an application protocol that runs on top of HTTP, providing a secure channel to exchange information. Although it is possible to scrape the Darknet, just like you’d scrape any other website, doing so is outside the scope of this book.

Unlike the dark Web, the deep Web is relatively easy to scrape. In fact, there are many tools in this book that will teach you how to crawl and scrape information from many places that Google bots can’t go.

So when might crawling an entire website be useful and when might it actually be harmful? Web scrapers that traverse an entire site are good for many things, including:

Generating a site map: A few years ago, I was faced with a problem: an important client wanted an estimate for a website redesign, but did not want to provide my company with access to the internals of their current content management system, and did not have a publicly available site map. I was able to use a crawler to cover their entire site, gather all internal links, and organize the pages into the actual folder structure they had on their site. This allowed me to quickly find sections of the site I wasn’t even aware existed, and accurately count how many page designs would be required, and how much content would need to be migrated.
Gathering data: Another client of mine wanted to gather articles (stories, blog posts, news articles, etc.) in order to create a working prototype of a specialized search platform. Although these website crawls didn’t need to be exhaustive, they did need to be fairly expansive (there were only a few sites we were interested in getting data from). I was able to create crawlers that recursively traversed each site and collected only data it found on article pages.

The general approach to an exhaustive site crawl is to start with a top-level page (such as the home page), and search for a list of all internal links on that page. Every one of those links is then crawled, and additional lists of links are found on each one of them, triggering another round of crawling.

Clearly this is a situation that can blow up very quickly. If every page has 10 internal links, and a website is five pages deep (a fairly typical depth for a medium-size website), then the number of pages you need to crawl is , or 100,000 pages, before you can be sure that you’ve exhaustively covered the website. Strangely enough, while “5 pages deep and 10 internal links per page” are fairly typical dimensions for a website, there are very few websites with 100,000 or more pages. The reason, of course, is that the vast majority of internal links are duplicates.

In order to avoid crawling the same page twice, it is extremely important that all internal links discovered are formatted consistently, and kept in a running list for easy lookups, while the program is running. Only links that are “new” should be crawled and searched for additional links:

from urllib.request import urlopen
from bs4 import BeautifulSoup
import re

pages = set()
def getLinks(pageUrl):
    global pages
    html = urlopen("http://en.wikipedia.org"+pageUrl)
    bsObj = BeautifulSoup(html)
    for link in bsObj.findAll("a", href=re.compile("^(/wiki/)")):
        if 'href' in link.attrs:
            if link.attrs['href'] not in pages:
                #We have encountered a new page
                newPage = link.attrs['href']
                print(newPage)
                pages.add(newPage)
                getLinks(newPage)
getLinks("")

In order to get the full effect of how this web-crawling business works, I’ve relaxed the standards of what constitutes an “internal link we are looking for,” from previous examples. Rather than limit the scraper to article pages, it looks for all links that begin with /wiki/ regardless of where they are on the page, and regardless of whether they contain colons. Remember: article pages do not contain colons, but file upload pages, talk pages, and the like do contain colons in the URL).

Initially, getLinks is called with an empty URL. This is translated as “the front page of Wikipedia” as soon as the empty URL is prepended with http://en.wikipedia.org inside the function. Then, each link on the first page is iterated through and a check is made to see if it is in the global set of pages (a set of pages that the script has encountered already). If not, it is added to the list, printed to the screen, and the getLinks function is called recursively on it.

A Warning Regarding Recursion

This is a warning rarely seen in software books, but I thought you should be aware: if left running long enough, the preceding program will almost certainly crash.

Python has a default recursion limit (how many times programs can recursively call themselves) of 1,000. Because Wikipedia’s network of links is extremely large, this program will eventually hit that recursion limit and stop, unless you put in a recursion counter or something to prevent that from happening.

For “flat” sites that are fewer than 1,000 links deep, this method usually works very well, with some unusual exceptions. For example, I once encountered a website that had a rule for generating internal links to blog posts. The rule was “take the URL for the current page we are on, and append /blog/title_of_blog.php to it.”

The problem was that they would append /blog/title_of_blog.php to URLs that were already on a page that had /blog/ in the URL. So the site would simply add another /blog/ on. Eventually, my crawler was going to URLs like: /blog/blog/blog/blog.../blog/title_of_blog.php.

Eventually, I had to add a check to make sure that URLs weren’t overly ridiculous, containing repeating segments that might indicate an infinite loop. However, if I had left this running unchecked overnight, it could have easily crashed.

Collecting Data Across an Entire Site

Of course, web crawlers would be fairly boring if all they did was hop from one page to the other. In order to make them useful, we need to be able to do something on the page while we’re there. Let’s look at how to build a scraper that collects the title, the first paragraph of content, and the link to edit the page (if available).

As always, the first step to determine how best to do this is to look at a few pages from the site and determine a pattern. By looking at a handful of Wikipedia pages both articles and non-article pages such as the privacy policy page, the following things should be clear:

All titles (on all pages, regardless of their status as an article page, an edit history page, or any other page) have titles under h1→span tags, and these are the only h1 tags on the page.
As mentioned before, all body text lives under the div#bodyContent tag. However, if we want to get more specific and access just the first paragraph of text, we might be better off using div#mw-content-text →p (selecting the first paragraph tag only). This is true for all content pages except file pages (for example: http://bit.ly/1KwqJtE), which do not have sections of content text.
Edit links occur only on article pages. If they occur, they will be found in the li#ca-edit tag, under li#ca-edit → span → a.

By modifying our basic crawling code, we can create a combination crawler/data-gathering (or, at least, data printing) program:

from urllib.request import urlopen
from bs4 import BeautifulSoup
import re

pages = set()
def getLinks(pageUrl):
   global pages
    html = urlopen("http://en.wikipedia.org"+pageUrl)
    bsObj = BeautifulSoup(html)
    try:
        print(bsObj.h1.get_text())
        print(bsObj.find(id ="mw-content-text").findAll("p")[0])
        print(bsObj.find(id="ca-edit").find("span").find("a").attrs['href'])
    except AttributeError:
        print("This page is missing something! No worries though!")
    
    for link in bsObj.findAll("a", href=re.compile("^(/wiki/)")):
        if 'href' in link.attrs:
            if link.attrs['href'] not in pages:
                #We have encountered a new page
                newPage = link.attrs['href']
                print("----------------
"+newPage)
                pages.add(newPage)
                getLinks(newPage)
getLinks("")

The for loop in this program is essentially the same as it was in the original crawling program (with the addition of some printed dashes for clarity, separating the printed content).

Because we can never be entirely sure that all the data is on each page, each print statement is arranged in the order that it is likeliest to appear on the site. That is, the <h1> title tag appears on every page (as far as I can tell, at any rate) so we attempt to get that data first. The text content appears on most pages (except for file pages), so that is the second piece of data retrieved. The “edit” button only appears on pages where both titles and text content already exist, but it does not appear on all of those pages.

Different Patterns for Different Needs

There are obviously some dangers involved with wrapping multiple lines in an exception handler. You cannot tell which line threw the exception, for one thing. Also, if for some reason a page contained an “edit” button but no title, the “edit” button would never be logged. However, it suffices for many instances in which there is an order of likeliness of items appearing on the site, and inadvertently missing a few data points or keeping detailed logs is not a problem.

You might notice that in this and all the previous examples, we haven’t been “collecting” data so much as “printing” it. Obviously, data in your terminal is hard to manipulate. We’ll look more at storing information and creating databases in Chapter 5.

Crawling Across the Internet

Whenever I give a talk on web scraping, someone inevitably asks: “How do you build Google?” My answer is always twofold: “First, you get many billions of dollars so that you can buy the world’s largest data warehouses and place them in hidden locations all around the world. Second, you build a web crawler.”

When Google started in 1994, it was just two Stanford graduate students with an old server and a Python web crawler. Now that you know this, you officially have the tools you need to become the next tech multi-billionaire!

In all seriousness, web crawlers are at the heart of what drives many modern web technologies, and you don’t necessarily need a large data warehouse to use them. In order to do any cross-domain data analysis, you do need to build crawlers that can interpret and store data across the myriad of different pages on the Internet.

Just like in the previous example, the web crawlers we are going to build will follow links from page to page, building out a map of the Web. But this time, they will not ignore external links; they will follow them. For an extra challenge, we’ll see if we can record some kind of information about each page as we move through it. This will be harder than working with a single domain as we did before—different websites have completely different layouts. This means we will have to be very flexible in the kind of information we’re looking for and how we’re looking for it.

Unknown Waters Ahead

Keep in mind that the code from the next section can go anywhere on the Internet. If we’ve learned anything from “Six Degrees of Wikipedia,” it’s that it’s entirely possible to go from a site like http://www.sesamestreet.org/ to something less savory in just a few hops.

Kids, ask your parents before running this code. For those with sensitive constitutions or with religious restrictions that might prohibit seeing text from a prurient site, follow along by reading the code examples but be careful when actually running them.

Before you start writing a crawler that simply follows all outbound links willy nilly, you should ask yourself a few questions:

What data am I trying to gather? Can this be accomplished by scraping just a few predefined websites (almost always the easier option), or does my crawler need to be able to discover new websites I might not know about?
When my crawler reaches a particular website, will it immediately follow the next outbound link to a new website, or will it stick around for a while and drill down into the current website?
Are there any conditions under which I would not want to scrape a particular site? Am I interested in non-English content?
How am I protecting myself against legal action if my web crawler catches the attention of a webmaster on one of the sites it runs across? (Check out Appendix C for more information on this subject.)

A flexible set of Python functions that can be combined to perform a variety of different types of web scraping can be easily written in fewer than 50 lines of code:

from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
import datetime
import random

pages = set()
random.seed(datetime.datetime.now())

#Retrieves a list of all Internal links found on a page
def getInternalLinks(bsObj, includeUrl):
    internalLinks = []
    #Finds all links that begin with a "/"
    for link in bsObj.findAll("a", href=re.compile("^(/|.*"+includeUrl+")")):
        if link.attrs['href'] is not None:
            if link.attrs['href'] not in internalLinks:
                internalLinks.append(link.attrs['href'])
    return internalLinks
            
#Retrieves a list of all external links found on a page
def getExternalLinks(bsObj, excludeUrl):
    externalLinks = []
    #Finds all links that start with "http" or "www" that do
    #not contain the current URL
    for link in bsObj.findAll("a", 
                         href=re.compile("^(http|www)((?!"+excludeUrl+").)*$")):
        if link.attrs['href'] is not None:
            if link.attrs['href'] not in externalLinks:
                externalLinks.append(link.attrs['href'])
    return externalLinks

def splitAddress(address):
    addressParts = address.replace("http://", "").split("/")
    return addressParts

def getRandomExternalLink(startingPage):
    html = urlopen(startingPage)
    bsObj = BeautifulSoup(html)
    externalLinks = getExternalLinks(bsObj, splitAddress(startingPage)[0])
    if len(externalLinks) == 0:
        internalLinks = getInternalLinks(startingPage)
        return getNextExternalLink(internalLinks[random.randint(0, 
                                  len(internalLinks)-1)])
    else:
        return externalLinks[random.randint(0, len(externalLinks)-1)]
    
def followExternalOnly(startingSite):
    externalLink = getRandomExternalLink("http://oreilly.com")
    print("Random external link is: "+externalLink)
    followExternalOnly(externalLink)
            
followExternalOnly("http://oreilly.com")

The preceding program starts at http://oreilly.com and randomly hops from external link to external link. Here’s an example of the output it produces:

Random external link is: http://igniteshow.com/
Random external link is: http://feeds.feedburner.com/oreilly/news
Random external link is: http://hire.jobvite.com/CompanyJobs/Careers.aspx?c=q319
Random external link is: http://makerfaire.com/

External links are not always guaranteed to be found on the first page of a website. In order to find external links in this case, a method similar to the one used in the previous crawling example is employed to recursively drill down into a website until it finds an external link.

Figure 3-1 visualizes the operation as a flowchart:

Don’t Put Example Programs into Production

I keep bringing this up, but it’s important for space and readability, the example programs in this book do not always contain the necessary checks and exception handling required for production-ready code.

For example, if an external link is not found anywhere on a site that this crawler encounters (unlikely, but it’s bound to happen at some point if you run it for long enough), this program will keep running until it hits Python’s recursion limit.

Before running this code for any serious purpose, make sure that you are putting checks in place to handle potential pitfalls.

The nice thing about breaking up tasks into simple functions such as “find all external links on this page” is that the code can later be easily refactored to perform a different crawling task. For example, if our goal is to crawl an entire site for external links, and make a note of each one, we can add the following function:

#Collects a list of all external URLs found on the site
allExtLinks = set()
allIntLinks = set()
def getAllExternalLinks(siteUrl):
    html = urlopen(siteUrl)
    bsObj = BeautifulSoup(html)
    internalLinks = getInternalLinks(bsObj,splitAddress(siteUrl)[0])
    externalLinks = getExternalLinks(bsObj,splitAddress(siteUrl)[0])
    for link in externalLinks:
        if link not in allExtLinks:
            allExtLinks.add(link)
            print(link)
    for link in internalLinks:
        if link not in allIntLinks:
            print("About to get link: "+link)
            allIntLinks.add(link)
            getAllExternalLinks(link)
            
getAllExternalLinks("http://oreilly.com")

This code can be thought of as two loops—one gathering internal links, one gathering external links—working in conjunction with each other. The flowchart looks something like Figure 3-2:

Jotting down or making diagrams of what the code should do before you write the code itself is a fantastic habit to get into, and one that can save you a lot of time and frustration as your crawlers get more complicated.

Handling Redirects

Redirects allow the same web page to be viewable under different domain names. Redirects come in two flavors:

Server-side redirects, where the URL is changed before the page is loaded
Client-side redirects, sometimes seen with a “You will be directed in 10 seconds...” type of message, where the page loads before redirecting to the new one.

This section will handle server-side redirects. For more information on client-side redirects, which are performed using JavaScript or HTML, see Chapter 10.

For server-side redirects, you usually don’t have to worry. If you’re using the urllib library with Python 3.x, it handles redirects automatically! Just be aware that, occasionally, the URL of the page you’re crawling might not be exactly the URL that you entered the page on.

Crawling with Scrapy

One of the challenges of writing web crawlers is that you’re often performing the same tasks again and again: find all links on a page, evaluate the difference between internal and external links, go to new pages. These basic patterns are useful to know about and to be able to write from scratch, but there are options if you want something else to handle the details for you.

Scrapy is a Python library that handles much of the complexity of finding and evaluating links on a website, crawling domains or lists of domains with ease. Unfortunately, Scrapy has not yet been released for Python 3.x, though it is compatible with Python 2.7.

The good news is that multiple versions of Python (e.g., Python 2.7 and 3.4) usually work well when installed on the same machine. If you want to use Scrapy for a project but also want to use various other Python 3.4 scripts, you shouldn’t have a problem doing both.

The Scrapy website offers the tool for download from its website, as well as instructions for installing Scrapy with third-party installation managers such as pip. Keep in mind that you will need to install Scrapy using Python 2.7 (it is not compatible with 2.6 or 3.x) and run all programs using Scrapy with Python 2.7 as well.

Although writing Scrapy crawlers is relatively easy, there is a small amount of setup that needs to be done for each crawler. To create a new Scrapy project in the current directory, run from the command line:

$scrapy startproject wikiSpider

wikiSpider is the name of our new project. This creates a new directory in the directory the project was created in, with the title wikiSpider. Inside this directory is the following file structure:

scrapy.cfg
- wikiSpider
  - __init.py__
  - items.py
  - pipelines.py
  - settings.py
  - spiders
    - __init.py__

In order to create a crawler, we will add a new file to wikiSpider/wikiSpider/spiders/articleSpider.py called items.py. In addition, we will define a new item called Article inside the items.py file.

Your items.py file should be edited to look like this (with Scrapy-generated comments left in place, although you can feel free to remove them):

# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html

from scrapy import Item, Field

class Article(Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    title = Field()

Each Scrapy Item object represents a single page on the website. Obviously, you can define as many fields as you’d like (url, content, header image, etc.), but I’m simply collecting the title field from each page, for now.

In your newly created articleSpider.py file, write the following:

from scrapy.selector import Selector
from scrapy import Spider
from wikiSpider.items import Article


class ArticleSpider(Spider):
    name="article"
    allowed_domains = ["en.wikipedia.org"]
    start_urls = ["http://en.wikipedia.org/wiki/Main_Page",
              "http://en.wikipedia.org/wiki/Python_%28programming_language%29"]

    def parse(self, response):
        item = Article()
        title = response.xpath('//h1/text()')[0].extract()
        print("Title is: "+title)
        item['title'] = title
        return item

The name of this object (ArticleSpider) is different from the name of the directory (WikiSpider), indicating that this class in particular is responsible for spidering only through article pages, under the broader category of WikiSpider. For large sites with many types of content, you might have separate Scrapy items for each type (blog posts, press releases, articles, etc.), each with different fields, but all running under the same Scrapy project.

You can run this ArticleSpider from inside the main WikiSpider directory by typing:

$ scrapy crawl article

This calls the scraper by the item name of article (not the class or file name, but the name defined on the line: name="article" in the ArticleSpider).

Along with some debugging information, this should print out the lines:

Title is: Main Page
Title is: Python (programming language)

The scraper goes to the two pages listed as the start_urls, gathers information, and then terminates. Not much of a crawler, but using Scrapy in this way can be useful if you have a list of URLs you need scrape. To turn it into a fully fledged crawler, you need to define a set of rules that Scrapy can use to seek out new URLs on each page it encounters:

from scrapy.contrib.spiders import CrawlSpider, Rule
from wikiSpider.items import Article
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor

class ArticleSpider(CrawlSpider):
    name="article"
    allowed_domains = ["en.wikipedia.org"]
    start_urls = ["http://en.wikipedia.org/wiki/Python_
                   %28programming_language%29"]
    rules = [Rule(SgmlLinkExtractor(allow=('(/wiki/)((?!:).)*$'),), 
                                       callback="parse_item", follow=True)]

    def parse_item(self, response):
        item = Article()
        title = response.xpath('//h1/text()')[0].extract()
        print("Title is: "+title)
        item['title'] = title
        return item

This crawler is run from the command line in the same way as the previous one, but it will not terminate (at least not for a very, very long time) until you halt execution using Ctrl+C or by closing the terminal.

Logging with Scrapy

The debug information generated by Scrapy can be useful, but it is often too verbose. You can easily adjust the level of logging by adding a line to the settings.py file in your Scrapy project:

LOG_LEVEL = 'ERROR'

There are five levels of logging in Scrapy, listed in order here:

CRITICAL
ERROR
WARNING
DEBUG
INFO

If logging is set to ERROR, only CRITICAL and ERROR logs will be displayed. If logging is set to INFO, all logs will be displayed, and so on.

To output logs to a separate logfile instead of the terminal, simply define a logfile when running from the command line:

$ scrapy crawl article -s LOG_FILE=wiki.log

This will create a new logfile, if one does not exist, in your current directory and output all logs and print statements to it.

Scrapy uses the Item objects to determine which pieces of information it should save from the pages it visits. This information can be saved by Scrapy in a variety of ways, such as a CSV, JSON, or XML files, using the following commands:

$ scrapy crawl article -o articles.csv -t csv
$ scrapy crawl article -o articles.json -t json
$ scrapy crawl article -o articles.xml -t xml

Of course, you can use the Item objects yourself and write them to a file or a database in whatever way you want, simply by adding the appropriate code to the parsing function in the crawler.

Scrapy is a powerful tool that handles many problems associated with crawling the Web. It automatically gathers all URLs and compares them against predefined rules; makes sure all URLs are unique; normalizes relative URLs where needed; and recurses to go more deeply into pages.

Although this section hardly scratches the surface of what Scrapy is capable of, I encourage you to check out the Scrapy documentation or many of the other resources available online. Scrapy is an extremely large and sprawling library with many features. If there’s something you’d like to do with Scrapy that has not been mentioned here, there is likely a way (or several) to do it!

¹ Thanks to The Oracle of Bacon for satisfying my curiosity about this particular chain.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for
3. Starting to Crawl

Chapter 3. Starting to Crawl

Traversing a Single Domain

Handle Your Exceptions!

Crawling an Entire Site

A Warning Regarding Recursion

Collecting Data Across an Entire Site

Different Patterns for Different Needs

Crawling Across the Internet

Unknown Waters Ahead

Figure 3-1. Flowchart for script that crawls through different sites on the Internet

Don’t Put Example Programs into Production

Figure 3-2. Flow diagram for the website crawler that collects all external links

Crawling with Scrapy

Logging with Scrapy

Table of Contents for 3. Starting to Crawl

Create new playlist

Sign In

Sign Up

Chapter 3. Starting to Crawl

Traversing a Single Domain

Handle Your Exceptions!

Crawling an Entire Site

A Warning Regarding Recursion

Collecting Data Across an Entire Site

Different Patterns for Different Needs

Crawling Across the Internet

Unknown Waters Ahead

Figure 3-1. Flowchart for script that crawls through different sites on the Internet

Don’t Put Example Programs into Production

Figure 3-2. Flow diagram for the website crawler that collects all external links

Crawling with Scrapy

Logging with Scrapy

Table of Contents for
3. Starting to Crawl