© Wolfram Donat 2018
Wolfram DonatLearn Raspberry Pi Programming with Pythonhttps://doi.org/10.1007/978-1-4842-3769-4_5

5. The Web Bot

Wolfram Donat1 
(1)
Palmdale, California, USA
 
As anyone who has spent any time at all online can tell you, there is a lot of information available on the Internet. According to Google’s indexes, as of 2013 there were 4.04 billion web pages in existence, so there are likely many, many more than that today. Sure, a lot of those pages are probably cat pictures and pornography, but there are also hundreds of millions of pages with information on them. Useful information. It has been said that every piece of information that has been digitized exists somewhere on the Internet. It just has to be found—not an easy task when the Internet looks something like Figure 5-1.
../images/323064_2_En_5_Chapter/323064_2_En_5_Fig1_HTML.jpg
Figure 5-1

Map of the Internet (©2013 http://internet-map.net , Ruslan Enikeev)

Unfortunately, there’s no way any one person could download and read all of the information he or she found interesting. Human beings just aren’t that fast, and we have to eat and sleep and perform all sorts of inefficient, sometimes unpleasant, tasks like showering and working for a living.

Luckily, we can program computers to do some of the boring, repetitive tasks that we don’t need to perform ourselves. This is one of the functions of a web bot: we can program the bot to crawl web pages, following links and downloading files as it goes. It’s commonly just called a bot, and knowing how to program and use one can be an incredibly useful skill. Need the stock reports when you wake up in the morning? Have your bot crawl the international indexes and have a spreadsheet waiting for you. Need to research all of the passenger manifests for the White Star Line that have been posted online, looking for your ancestor? Have your bot start with “White Star” in Google and traverse all of the links from there. Or perhaps you want to locate all of Edgar Allan Poe’s manuscripts that are currently available in the public domain; a bot can help with that as well while you sleep.

Python is particularly well suited to doing the job of a web bot, also called—in this context—a spider . There are a few modules that need to be downloaded, and then you can program a fully functional bot to do your bidding, starting from whatever page you give it. Because traversing web pages and downloading information is not a terribly processor-intensive task, it is also a task well suited to the Raspberry Pi. While your normal desktop machine handles more difficult computing tasks, the Pi can handle the light lifting required to download web pages, parse their text, and follow links to download files.

Bot Etiquette

One factor you need to keep in mind, should you build a functioning web crawler, is bot etiquette. I don’t mean etiquette in the sense of making sure that the bot’s pinky is extended when drinking high tea. Rather, there are certain niceties you should observe when you program your bot to crawl sites.

One is to respect the robots.txt file. Most sites have this file in the root directory of the site. It’s a simple text file that contains instructions for visiting bots and spiders. If the owner of the site does not want certain pages crawled and indexed, he can list those pages and directories in the text file, and courteous bots will accede to his requests. The file format is simple. It looks like this:

User-agent: *
Disallow: /examples/
Disallow: /private.html

This robots.txt file specifies that no bots (User-agent: *) may visit (crawl) any pages in the /examples/ folder, nor may they visit the page private.html. The robots.txt file is a standard mechanism by which websites can restrict visits to certain pages. If you want your bot to be welcome at all sites, it’s a good idea to follow those rules. I’ll explain how to do that. If you choose to ignore those rules, you can often expect your bot (and all visits from your IP address) to be banned (and blocked) from the site in question.

Another piece of etiquette is controlling the speed of your bot’s information requests. Because bots are computers, they can visit and download pages and files hundreds and thousands of times faster than humans can. For this reason, it is entirely possible for a bot to make so many requests to a site in such a short time that it can incapacitate a poorly configured web server. Therefore, it is polite to keep your bot’s page requests to a manageable level; most site owners are fine with around ten page requests per second—far more than can be done by hand, but not enough to bring down a server. Again, in Python, this can be done with a simple sleep() function.

Finally, it can often be problematic to fake your user-agent identity. A user-agent identity identifies visitors to a site. Firefox browsers have a certain user agent, Internet Explorer has another, and bots have yet another. Because there are many sites that do not want bots to visit or crawl their pages at all, some bot-writers give their bots a fraudulent user agent to make it look like a normal web browser. This is not cool. You may never be discovered, but it’s a matter of common decency—if you had pages you wanted kept private, you’d want others to respect those wishes as well. Do the same for other site owners. It’s just part of being a good bot-writer and netizen (net citizen). You may simulate a browser’s user agent if you are emulating a browser for other purposes, such as site testing or to find and download files (PDFs, mp3s, and so on), but not to crawl those sites.

The Connections of the Web

Before we get to the business of programming our spider, you need to understand a bit about how the Internet operates. Yes, it’s basically a giant computer network, but that network follows certain rules and uses certain protocols, and we need to utilize those protocols in order to do anything on the Web, including using a spider.

Web Communications Protocols

HyperText Transfer Protocol (HTTP) is the format in which most common web traffic is encapsulated. A protocol is simply an agreement between two communicating parties (in this case, computers) as to how that communication is to proceed. It includes information such as how data is addressed, how to determine whether errors have occurred during transmission (and how to handle those errors), how the information is to travel between the source and destination, and how that information is formatted. The “http” in front of most URLs (Uniform Resource Locators) defines the protocol used to request the page. Other common protocols used are TCP/IP (Transmission Control Protocol/Internet Protocol), UDP (User Datagram Protocol), SMTP (Simple Mail Transfer Protocol), and FTP (File Transfer Protocol). Which protocol is used depends on factors such as the traffic type, the speed of the requests, whether the data streams need to be served in order, and how forgiving of errors those streams can be.

When you request a web page with your browser, there’s a good bit happening behind the scenes. Let’s say you type http://www.irrelevantcheetah.com into your location bar. Your computer, knowing that it’s using the HTTP protocol, first sends www.irrelevantcheetah.com to its local DNS (Domain Name System) server to determine to what Internet address it belongs. The DNS server responds with an IP address—let’s say, 192.185.21.158. That is the address of the server that holds the web pages for that domain. The Domain Name System maps IP addresses to names, because it’s much easier for you and me to remember “ www.irrelevantcheetah.com ” than it is to remember “192.185.21.158.”

Now that your computer knows the IP address of the server, it initiates a TCP connection with that server using a three-way “handshake.” The server responds, and your computer asks for the page index.html. The server responds and then closes the TCP connection.

Your browser then reads the coding on the page and displays it. If there are other parts of the page it needs, such as PHP code or images, it then requests those parts or images from the server and displays them as well.

Web Page Formats

Most web pages are formatted in HTML —HyperText Markup Language . It’s a form of XML (eXtensible Markup Language) that is pretty easy to read and parse, and it can be understood by most computers. Browsers are programmed to interpret the language of the pages and display those pages in a certain way. For instance, the tag pair <html> and </html> indicate that the page is in HTML. <i> and </i> indicate that the enclosed text is italic, while <a> and </a> indicate a hyperlink , which is normally displayed as blue and underlined. JavaScript is surrounded by <script type="text/javascript"></script> tags, and various other more-involved tags surround various languages and scripts.

All of these tags and formats make browsing and reading raw web pages easy for humans. However, they have the pleasant side effect of also making it easy for computers to parse those pages. After all, if your browser couldn’t decode the pages, the Internet wouldn’t exist in its current form. But you don’t need a browser to request and read web pages—only to display them once you’ve got them. You can write a script to request web pages, read them, and do pre-scripted tasks with the pages’ information—all without the interference of a human. Thus, you can automate the long, boring process of searching for particular links, pages, and formatted documents and pass it to your Pi. Therein lies the web bot.

A Request Example

For simplicity’s sake, let’s begin by saying we have requested the page http://www.carbon111.com/links.html . The page’s text is pretty simple—it’s a static page, after all, with no fancy web forms or dynamic content, and it looks pretty much like this:

<HTML>
<HEAD>
<TITLE>Links.html</TITLE>
</HEAD>
<BODY BACKGROUND="mainback.jpg" BGCOLOR="#000000"
 TEXT="#E2DBF5" LINK="#EE6000" VLINK="#BD7603" ALINK="#FFFAF0">
<br>
<H1 ALIGN="CENTER">My Favorite Sites and Resources</H1>
<br>
<H2>Comix, Art Gallerys and Points of Interest:</H2>
<DL>
<DT><A HREF="http://www.alessonislearned.com/index.html" TARGET="blank">
A Lesson Is Learned...</A>
<DD>Simply amazing! Great ideas, great execution. I love the depth of humanity
these two dig into. Not for the faint-of-heart ;)
.
.
.

And so on, until the final closing <HTML> tag.

If a spider were receiving this page over a TCP connection , it would first learn that the page is formatted in HTML. It would then learn the page title, and it could start looking for content it has been tasked to find (such as .mp3 or .pdf files) as well as links to other pages, which will be contained within <A></A> tags. A spider can also be programmed to follow links to a certain “depth”; in other words, you can specify whether or not the bot should follow links from linked pages or whether it should stop following links after the second layer. This is an important question, because it is possible that if you program too many layers, your spider could end up trawling (and downloading) the entire Internet—a critical problem if your bandwidth and storage are limited!

Our Web Bot Concept

The concept behind our web bot is as follows: we’ll start with a certain page based on user input. Then, we’ll determine what files we are looking for—for example, are we looking for .pdf files of works in the public domain? How about freely available .mp3s by our favorite bands? That choice will be programmed into our bot as well.

The bot will then start at the beginning page and parse all of the text on the page. It will look for text contained within <a href></a> tags (hyperlinks). If that hyperlink ends in a “.pdf” or “.mp3” or another chosen file type, we’ll make a call to wget (a command-line downloading tool) to download the file to our local directory. If we can’t find any links to our chosen file type, we’ll start following the links that we do find, repeating the process for each of those links as recursively as we determine beforehand. When we’ve gone as far as we want to, we should have a directory full of files to be perused at our leisure. That is what a web bot is for—letting the computer do the busy work, while you sip a margarita and wait to enjoy the fruits of its labor.

Parsing Web Pages

Parsing refers to the process a computer goes through when it “reads” a web page. At its most basic, a web page is nothing more than a data stream consisting of bits and bytes (a byte is eight bits) that, when decoded, form numbers, letters, and symbols. A good parsing program not only can re-form that data stream into the correct symbols, but it can also read the re-formed stream and “understand” what it reads. A web bot needs to be able to parse the pages it loads because those pages may/should contain links to the information it’s programmed to retrieve. Python has several different text-parser modules available, and I encourage you to experiment, but the one I have found the most useful is Beautiful Soup.

Note

Beautiful Soup is named after the Mock Turtle’s song by Lewis Carroll (1855):

Beautiful soup, so rich and green

Waiting in a hot tureen!

Who for such dainties would not stoop?

Soup of the evening, beautiful soup!

Soup of the evening, beautiful soup!

Beautiful Soup (the Python library) has gone through several versions; as of this writing, it is on version 4.6.0 and works in both Python 2.x and 3.x.

Beautiful Soup’s syntax is pretty basic. Once you’ve installed it by typing

sudo apt-get install python-bs4

in your terminal, you can start using it in your scripts. Open a Python prompt by typing python and try typing the following:

import BeautifulSoup

If you get an error message that says No module named BeautifulSoup, you may be using an older version of Beautiful Soup 4. In that case, type

from bs4 import BeautifulSoup

Then, continuing in your Python prompt:

import re
doc = ['<html><head><title>Page title</title></head>',
    '<body><p id="firstpara" align="center">This is paragraph <b>one</b>.',
    '<p id="secondpara" align="blah">This is paragraph <b>two</b>.',
    '</html>']
soup = BeautifulSoup(".join(doc)) #That's two apostrophes, one after another, not a double quote

This loads the file named doc with what a web-page stream would look like—a long, single stream of characters. Then, Soup loads the lines into a file that can be parsed by the library. If you were to type print soup at this point, it would look the same as the results of typing print doc.

However, if you type

print soup.prettify()

you’ll be rewarded with the page, redone in a more readable fashion. This is just an example of what Beautiful Soup can do; I’ll go over it more when we get to programming the bot.

As an aside: the re module you import in the preceding example is used to evaluate regular expressions in text. Regular expressions, if you’re not familiar with them, are an extremely versatile way to search through text and pick out strings and sequences of characters in ways that may not be immediately obvious to a human reader. A regular expression term can look like complete gibberish; a good example of a regular expression is the sequence (?<=-)w+, which searches for a sequence of characters in a string that follows a hyphen. To try it out, open a Python prompt by typing python and then type

import re
m = re.search('(?<=-)w+', 'free-bird')
m.group(0)

and you’ll be rewarded with

bird

While regular expressions are very helpful in terms of finding sequences of characters in text and strings, they’re also not very intuitive and are far beyond the scope of this book. We won’t be spending much time on them here. It’s enough that you know they exist, and you can spend some time learning about them if they interest you.

Coding with Python Modules

When it comes to using different Python modules to code your web spider, you have quite a few options. Many open-source spiders already exist, and you could borrow from those, but it’s a good learning experience to code the spider from the ground up.

Our spider will need to do several things in order to do what we need it to. It will need to
  • initiate TCP connections and request pages;

  • parse the received pages;

  • download important files that it finds; and

  • follow links that it comes across.

Luckily, most of these are pretty simple tasks, so programming our spider should be relatively straightforward.

Using the Mechanize Module

Probably the most-used module when it comes to automated web browsing, mechanize is both incredibly simple and incredibly complex. It is simple to use and can be set up with a few paltry lines of code, yet it is also packed with features that many users don’t fully utilize. It’s a great tool for automating tasks such as website testing: if you need to log into a site 50 times with 50 different username/password combinations and then fill out an address form afterward, mechanize is your tool of choice. Another nice thing about it is that it does much of the work, such as initiating TCP connections and negotiating with the web server, behind the scenes so that you can concentrate on the downloading part.

To use mechanize in your script, you must first download and install it. If you’ve been following along, you still have a Python prompt open, but you’ll need a regular command-line interface for this download and installation process. Here, you have two options: you can exit from the Python entry mode, or you can open another terminal session. If you prefer to have only one terminal session open, exit from the Python prompt in your current window by typing Ctrl+d, which will return you to the normal terminal prompt. On the other hand, if you choose to open another terminal session, you can leave the Python session running, and everything you’ve typed so far will still be in memory.

Whichever option you choose, from a command-line prompt, enter

https://pypi.python.org/packages/source/m/mechanize/mechanize-0.3.6.tar.gz

When that’s finished downloading, untar the file with

tar -xzvf mechanize-0.3.6.tar.gz

and navigate into the resulting folder by typing

cd mechanize-0.3.6.tar.gz

Then, enter

sudo python setup.py install

Follow any onscreen instructions, and mechanize will be installed and ready to use.

Parsing with Beautiful Soup

I mentioned parsing earlier; Beautiful Soup is still the best way to go. If you haven’t done so already, enter

sudo apt-get install python-bs4

into a terminal and let the package manager do its work. It’s ready to use immediately afterward. As I stated before, once you download the page, Beautiful Soup is responsible for finding links and passing them to the function we’ll use for downloading, as well as for setting aside those links that will be followed later.

As a result of this, however, it turns out that the job of finding links and determining what to download becomes mainly a problem with strings. In other words, links (and the text contained within them) are nothing but strings, and in our quest to unravel those links and follow them or download them, we’ll be doing a lot of work with strings—work ranging from lstrip (removing the leftmost character) to append to split and various other methods from the string library. Perhaps the most interesting part of a web bot, after all, isn’t the files it downloads; rather, it’s the manipulations you have to do to get there.

Downloading with the urllib Library

The last part of the puzzle here is the urllib library—specifically, its URLopener.retrieve() function. This function is used to download files, smoothly and without fuss. We’ll pass it the name of our file and let it do its work.

To use urllib, you must first import it. Switch to the terminal with your Python prompt, if it’s still open, or start another session by typing python. Then, type

import urllib

to make it available for use.

The urllib library uses the following syntax:

image = urllib.URLopener()
image.retrieve ("http://www.website.com/imageFile.jpg", "imageFile.jpg")

where the first parameter sent to the URLopener.retrieve() function is the URL of the file, and the second parameter is the local file name that the file will be saved as. The second, file-name parameter obeys Linux file and directory conventions; if you give it the parameter “../../imageFile.jpg”, imageFile.jpg will be saved two folders up in the directory tree. Likewise, passing it the parameter “pics/imageFile.jpg” will save it in the pics folder inside of the current directory (from which the script is running). However, the folder must already exist; retrieve() will not create the directory. This is an important thing to remember, as it will also fail silently; that is, if the directory doesn’t exist, your script will execute as if everything was dandy, and then you’ll learn the next morning that none of those two thousand records you downloaded were ever saved to disk.

Deciding What to Download

This can get kind of sticky because there is so much out there. Unfortunately (or fortunately, depending on your point of view), a good deal of it is copyrighted, so even if you find it for free, it’s really not cool to just download it. Whatever you’re looking for is out there.

That, however, is a topic for an entirely different book. For the time being, let’s assume you’re going to be looking for freely-available information, such as all works by Mark Twain that are in the public domain. That means you’re probably going to be looking for .pdf, .txt, and possibly even .doc or .docx files. You might even want to widen your search parameters to include .mobi (Kindle) and .epub files, as well as .chm. (chm stands for Compiled HtMl, which is used by Microsoft in their HTML-formatted help programs, and it is also often used in web-based versions of textbooks.) These are all legitimate file formats that may contain the text of books you’re looking for.

Choosing a Starting Point

The next thing you’re going to need is a starting point. You may be inclined to just say “Google!,” but with tens of millions of search results from a simple search for “Mark Twain,” you would probably be better off staying a bit more focused. Do a little groundwork beforehand and save yourself (and your bot) hours of work later. If you can find an online archive of Twain’s works, for example, that would be an excellent starting point. If you’re looking for free music downloads, you may want to get a list together of blogs that feature new music files from up-and-coming bands, because many new artists offer song downloads free on those blogs to promote themselves and their music. Likewise, technical documents dealing with IEEE network specifications can probably be found on a technical site, or even a government one, with much more success (and more focused results) than a wide Google search.

Storing Your Files

You may also need a place to store your files, depending on the size of your Pi’s SD card. That card acts as both RAM and a place for file storage, so if you’re using a 32GB card, you’ll have lots of room for .pdf files. However, an 8GB card may fill up rather quickly if you’re downloading free documentary movie files. So, you’ll need an external USB hard drive—either a full-blown hard drive or a smaller flash drive.

Again, this is where some experimentation may come in handy, because some external drives won’t work well with the Raspberry Pi. Because they’re not particularly expensive these days, I would buy one or two medium-sized ones and give them a try. I’m currently using an 8GB flash drive by DANE-ELEC (shown in Figure 5-2) without any problems.
../images/323064_2_En_5_Chapter/323064_2_En_5_Fig2_HTML.jpg
Figure 5-2

Common flash drive to store files

A note on accessing your jump drive via the command line: a connected drive such as a flash drive is accessible in the /media directory; that is,

cd /media

will get you to the directory where you should see your drive listed. You can then navigate into it and access its contents. You’ll want to set up your Python script to save files to that directory—/media/PENDRIVE, for example, or /media/EnglebertHumperdinckLoveSongs. Probably the easiest way to do it is to save your webbot.py script in a directory on your external drive and then run it from there.

Writing the Python Bot

Let’s start writing some Python. The following code imports the necessary modules and uses Python’s version of input (raw_input) to get a starting point (to which I’ve prepended the http:// found in every web address). It then initiates a “browser” (with air quotes) with mechanize.Browser() . This code, in its final completed form, is listed at the end of this chapter. It’s also available for download as webbot.py from the apress.com website.

To start the process of writing your bot, use your text editor to begin a new file, called webbot.py. Enter the following:

from bs4 import BeautifulSoup
import mechanize
import time
import urllib
import string
start = "http://" + raw_input ("Where would you like to start searching? ")
br = mechanize.Browser()
r = br.open(start)
html = r.read()

Later, we may need to fake a user agent, depending on the sites we visit, but this code will work for now.

Reading a String and Extracting All the Links

Once you’ve got a browser object, which is called br in the preceding code, you can do all sorts of tasks with it. We opened the start page requested from the user with br.open() and read it into one long string, html. Now, we can use Beautiful Soup to read that string and extract all of the links from it by adding the following lines:

soup = BeautifulSoup(html)
for link in soup.find_all('a'):
    print (link.get('href'))

Now, run the script to try it out. Save it and close it. Open a terminal session and navigate to the same directory in which you created webbot.py. Then type

python webbot.py

to start the program and type example.com when it asks where to start. It should return the following and then quit:

http://www.iana.org/domains/example

You’ve successfully read the contents of http://example.com , extracted the links (there’s only one), and printed that link to the screen. This is an awesome start.

The next logical step is to instantiate a list of links and add to that list whenever Beautiful Soup finds another link. You can then iterate over the list, opening each link with another browser object and repeating the process.

Looking For and Downloading Files

Before we instantiate that list of links, however, there’s one more function we need to create—the one that actually looks for and downloads files! So, let’s search the code on the page for a file type. We should probably go back and ask what sort of file we’re looking for by adding the following code line at the beginning of the script, after the start line:

filetype = raw_input("What file type are you looking for? ")

Note

In case you’re wondering, the at the end of the raw_input string in both of these cases is a carriage return. It doesn’t get printed when the line is displayed. Rather, it sends the cursor to the beginning of the next line to wait for your input. It’s not necessary—it just makes the output look a little prettier.

Now that we know what we’re looking for, as we add each link to the list we can check to see if it’s a link to a file that we want. If we’re looking for .pdf files, for example, we can parse the link to see if it ends in pdf. If it does, we’ll call URLopener.retrieve() and download the file. So, open your copy of webbot.py again and replace the for block of code with the following:

for link in soup.find_all('a'):
    linkText = str(link)
    if filetype in linkText:
        # Download file code here

You’ll notice two elements in this little snippet of code. First, the str(link) bit has been added. Beautiful Soup finds each link on the page for us, but it returns it as a link object, which is sort of meaningless to non-Soup code. We need to convert it to a string in order to work with it and do all of our crafty manipulations. That’s what calling the str() method does. In fact, Beautiful Soup provides a method to do this for us, but learning to parse a string with the str() function is important. As a matter of fact, that’s why we used the line import string at the beginning of our code—so we can interact with string objects.

Second, once the link is a string, you can see how we use Python’s in call. Similar to C#’s String.contains() method, Python’s in call simply searches the string to see if it contains the requested substring. So, in our case, if we’re looking for .pdf files, we can search the link text for that substring, “pdf.” If it has it, it’s a link we’re interested in.

Testing the Bot

To make testing our bot easier, I set up a page at http://www.irrelevantcheetah.com/browserimages.html to use for testing. It contains images, files, links, and various other HTML goodies. Using this page, we can start with something simple, like images. So, let’s modify our webbot.py code and make it look like this:

import mechanize
import time
from bs4 import BeautifulSoup
import string
import urllib
start = "http://www.irrelevantcheetah.com/browserimages.html"
filetype = raw_input ("What file type are you looking for? ")
br = mechanize.Browser()
r = br.open(start)
html = r.read()
soup = BeautifulSoup(html)
for link in soup.find_all('a'):
    linkText = str(link)
    fileName = str(link.get('href'))
    if filetype in fileName:
        image = urllib.URLopener()
        linkGet = http://www.irrelevantcheetah.com + fileName
        filesave = string.lstrip(fileName, '/')
        image.retrieve(linkGet, filesave)

This last section of code, starting with the for loop, requires some explanation, methinks. The for loop iterates through all of the links that Beautiful Soup found for us. Then, linkText converts those links to strings so that we can manipulate them. We then convert the body of the link (the actual file or page to which the link points) to a string as well and check to see if it contains the file type we’re looking for. If it does, we append it to the site’s base URL, giving us linkGet.

The last two lines have to happen because of the retrieve() function. As you recall, that function takes two parameters: the URL of the file we’re downloading and the local name we’d like to save that file to. filesave takes the fileName we found earlier and removes the leading “/” from the name so that we can save it. If we didn’t do this, the fileName we would try to save under would be—for example—/images/flower1.jpg. If we tried to save an image with that name, Linux would attempt to save flower.jpg to the /images folder and then give us an error because the /images folder doesn’t exist. By stripping the leading “/”, the fileName becomes images/flower1.jpg, and as long as there’s an images folder in our current directory (remember what I said about creating the directory first), the file will save without incident. Finally, the last line of code does the actual downloading with the two parameters I already mentioned: linkGet and filesave.

If you create an images directory in your current directory and then run this script, answering “jpg” to the file-type question, the images directory should fill up with 12 different images of pretty flowers, hand-selected by yours truly. Simple, eh? If, instead, you create a files directory and answer “pdf,” you’ll get 12 different (boring) PDFs in your files folder.

Creating Directories and Instantiating a List

There are two more features we need to add to finish this bot. First, we aren’t always going to know what directories we need to create ahead of time, so we need to find a way to parse the folder name from the link text and create the directory on the fly. Second, we need to create a list of links that link to other pages so that we can then visit those pages and repeat the download process. If we do this several times, we’ve got ourselves a real web bot, following links and downloading the files we want.

Let’s do the second task first—instantiating the list of links we mentioned earlier. We can create a list at the beginning of the script, after the import statements, and add to it as we go. To create a list we simply use

linkList = []

To add to it, we add an elif block to our script:

if filetype in fileName:
    image = urllib.URLopener()
    linkGet = http://www.irrelevantcheetah.com + fileName
    filesave = string.lstrip(fileName, '/')
    image.retrieve(linkGet, filesave)
elif "htm" in fileName: # This covers both ".htm" and ".html" filenames
    linkList.append(link)

That’s it! If the fileName contains the type of link we’re looking for, it gets retrieved. If it doesn’t, but there’s an htm in it, it gets appended to linkList—a list that we can then iterate through, one by one, opening each page and repeating the download process.

The fact that we’re going to be repeating the download process many times should make you think of one element of coding: a function—also called a method. Remember, a function is used in code if there’s a process you’re going to be repeating over and over again. It makes for cleaner, simpler code, and it’s also easier to write. Programmers, you’ll find, are very efficient people (some would say lazy). If we can code it once and reuse it, that’s ever so much better than typing it over and over and over again. It’s also a massive time-saver.

So, let’s start our downloading function by adding the following lines to our webbot.py script, after the linkList = [] line we added just a bit ago:

def downloadFiles (html, base, filetype, filelist):
    soup = BeautifulSoup(html)
    for link in soup.find_all('a'):
        linkText = str(link.get('href'))
        if filetype in linkText:
            image = urllib.URLopener()
            linkGet = base + linkText
            filesave = string.lstrip(linkText, '/')
            image.retrieve(linkGet, filesave)
        elif "htm" in linkText:  # Covers both "html" and "htm"
            linkList.append(link)

Now that we have our downloadFiles function, all we have to do is parse our linkText to get the name of the directory we’ll need to create.

Again, it’s simple string manipulation, along with using the os module. The os module allows us to manipulate directories and files, regardless of what operating system we’re running. First, we can add

import os

to our script, and then we can create a directory (if needed) by adding

os.makedirs()

You may remember that in order to simplify file saving we need to have a local directory on our machine that matches the web directory in which our target files are stored. In order to see if we need a local directory, we need to first determine that directory’s name. In most (if not all) cases, that directory will be the first part of our linkText; for example, the directory name in /images/picture1.html is images. So, the first step is to iterate through the linkText again, looking for slashes the same way we did to get the base of our website name, like this:

slashList = [i for i, ind in enumerate(linkText) if ind == '/']
directoryName = linkText[(slashList[0] + 1) : slashList[1]]

The preceding code creates a list of indices at which slashes are found in the linkText string. Then, directoryName slices linkText to just the part between the first two slashes (/images/picture1.html gets cut to images).

The first line of that snippet bears some explanation because it’s an important line of code. linkText is a string, and as such is enumerable; that is, the characters within it can be iterated over, one by one. slashList is a list of the positions (indices) in linkText where a slash is located. After the first line populates slashList, directoryName simply grabs the text contained between the first and second slashes.

The next two lines simply check to see if a directory exists that matches directoryName; if it doesn’t, we create it:

if not os.path.exists(directoryName):
    os.makedirs(directoryName)

This completes our downloadProcess function, and with it our simple web bot. Give it a try by pointing it at http://www.irrelevantcheetah.com/browserimages.html and asking for either jpg, pdf, or txt file types, then watch it create folders and download files—all without your help.

Now that you get the idea, you can go crazy with it! Create directories, surf three (and more) levels deep, and see what your bot downloads for you while you’re not looking! Half the fun is sometimes in seeing what gets downloaded when you least expect it!

The Final Code

Here, you can see the final, lengthy code you’ve been typing in, bit by bit, if you’ve been following along as we progressed through the chapter. Again, if you don’t want to type it all, it’s available on Apress.com as webbot.py. However, I highly recommend you type it in, because learning code can be much more effective if you type it rather than simply copying and pasting it. One of my professors used to say that by typing it in, you make the code your own.

import mechanize
import time
from bs4 import BeautifulSoup
import re
import urllib
import string
import os
def downloadProcess(html, base, filetype, linkList):
    "This does the actual file downloading"
    Soup = BeautifulSoup(html)
    For link in soup.find('a'):
        linkText = str(link.get('href'))
        if filetype in linkText:
            slashList = [i for i, ind in enumerate(linkText) if ind == '/']
            directoryName = linkText[(slashList[0] + 1) : slashList[1]]
            if not os.path.exists(directoryName):
                os.makedirs(directoryName)
            image = urllib.URLopener()
            linkGet = base + linkText
            filesave = string.lstrip(linkText, "/")
            image.retrieve(linkGet, filesave)
        elif "htm" in linkText:  # Covers both "html" and "htm"
            linkList.append(link)
start = "http://" + raw_input ("Where would you like to start searching? ")
filetype = raw_input ("What file type are you looking for? ")
numSlash = start.count('/') #number of slashes in start—need to remove everything after third slash
slashList = [i for i, ind in enumerate(start) if ind == '/'] #list of indices of slashes
if (len(slashList) >= 3): #if there are 3 or more slashes, cut after 3
    third = slashList[2]
    base = start[:third] #base is everything up to third slash
else:
    base = start
br = mechanize.Browser()
r = br.open(start)
html = r.read()
linkList = [] #empty list of links
print "Parsing" + start
downloadProcess(html, base, filetype, linkList)
for leftover in linkList:
    time.sleep(0.1) #wait 0.1 seconds to avoid overloading server
    linkText = str(leftover.get('href'))
    print "Parsing" + base + linkText
    br = mechanize.Browser()
    r = br.open(base + linkText)
    html = r.read()
    linkList = []
    downloadProcess(html, base, filetype, linkList)

Summary

In this chapter, you got a nice introduction to Python by writing a web bot, or spider, that can traverse the Internet for you and download files you find interesting, perhaps even while you sleep. You used a function or two, constructed and added to a list object, and even did some simple string manipulation.

In the next chapter, we’ll transition away from the digital world and interact with a very physical phenomenon—the weather.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.129.249.105