With great power...

As an HTTP client developer, you may have different priorities to the webmasters that run websites. A webmaster will typically provide a site for human users; possibly offering a service designed for generating revenue, and it is most likely that all this will need to be done with the help of very limited resources. They will be interested in analyzing how humans use their site, and may have areas of the site they would prefer that automated clients didn't explore.

HTTP clients that automatically parse and download pages on websites are called various things, such as bots, web crawlers, and spiders. Bots have many legitimate uses. All the search engine providers make extensive use of bots for crawling the web and building their huge page indexes. Bots can be used to check for dead links, and to archive sites for repositories, such as the Wayback Machine. But, there are also many uses that might be considered as illegitimate. Automatically traversing an information service to extract the data on its pages and then repackaging that data for presentation elsewhere without permission of the site owners, downloading large batches of media files in one go when the spirit of the service is online viewing and so on could be considered as illegitimate. Some sites have terms of service which explicitly bar automated downloads. Although some actions such as copying and republishing copyrighted material are clearly illegitimate, some other actions are subject to interpretation. This gray area is a subject of ongoing debate, and it is unlikely that it will ever be resolved to everyone's satisfaction.

However, even when they do serve a legitimate purpose, in general, bots do make webmasters lives somewhat more difficult. They pollute the webserver logs, which webmasters use for calculating statistics on how their site is being used by their human audience. Bots also consume bandwidth and other server resources.

Using the methods that we are looking at in this chapter, it is quite straightforward to write a bot that performs many of the aforementioned functions. Webmasters provide us with services that we will be using, so in return, we should respect the aforementioned areas and design our bots in such a way that they impact them as little as possible.

Choosing a User Agent

There are a few things that we can do to help our webmasters out. We should always pick an appropriate user agent for our client. The principle way in which webmasters filter out bot traffic from their logfiles is by performing user agent analysis.

There are lists of the user agents of known bots, for example, one such list can be found at http://www.useragentstring.com/pages/Crawlerlist/.

Webmasters can use these in their filters. Many webmasters also simply filter out any user agents that contain the words bot, spider, or crawler. So, if we are writing an automated bot rather than a browser, then it will make the webmasters' lives a little easier if we use a user agent that contains one of these words. Many bots used by the search engine providers follow this convention, some examples are listed here:

  • Mozilla/5.0 compatible; bingbot/2.0; http://www.bing.com/bingbot.htm
  • Baiduspider: http://www.baidu.com/search/spider.htm
  • Mozilla/5.0 compatible; Googlebot/2.1; http://www.google.com/bot.html

There are also some guidelines in section 5.5.3 of the HTTP RFC 7231.

The Robots.txt file

There is an unofficial but standard mechanism to tell bots if there are any parts of a website that they should not crawl. This mechanism is called robots.txt, and it takes the form of a text file called, unsurprisingly, robots.txt. This file always lives in the root of a website so that bots can always find it. It has rules that describe the accessible parts of the website. The file format is described at http://www.robotstxt.org.

The Python standard library provides the urllib.robotparser module for parsing and working with robots.txt files. You can create a parser object, feed it a robots.txt file and then you can simply query it to see whether a given URL is permitted for a given user agent. A good example can be found in the documentation in the standard library. If you check every URL that your client might want to access before you access it, and honor the webmasters wishes, then you'll be helping them out.

Finally, since we may be making quite a lot of requests as we test out our fledgling clients, it's a good idea to make local copies of the web pages or the files that you want your client to parse and test it against them. In this way, we'll be saving bandwidth for ourselves and for the websites.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.232.189