9. Robots Exclusion Protocol

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 9. Robots Exclusion Protocol

The story of Robots Exclusion Protocol (REP) begins with the introduction of the robots.txt protocol in 1993. This is thanks in part to a Perl web crawler hogging network bandwidth of a site whose owner would become the eventual robots.txt creator (http://bit.ly/bRB3H).

In 1994, REP was formalized by the consensus of a “majority of robot authors” (http://robotstxt.org/orig.html). Originally, REP was only meant to allow for resource exclusion. This has changed over time to include directives for inclusion.

When we talk about REP today, we are talking about several things: robots.txt, XML Sitemaps, robots meta tags, X-Robot-Tag(s), and the nofollow link attribute. Understanding REP is important, as it is used for various SEO tasks. Content duplication, hiding unwanted documents from search results, strategic distribution of link juice, and document (search engine) index removal are just some of the things REP can assist with.

Adoption of REP is nonbinding, and it is not necessarily adopted by all search engines. However, the big three search engines (Yahoo!, Google, and Bing) have adopted a strategy of working together in supporting REP in almost a uniform way, while also working together to introduce new REP standards. The goal of these efforts is to provide consistent crawler behavior for the benefit of all webmasters.

This chapter covers REP in detail. Topics include robots.txt and its associated directives, HTML meta directives, the .htaccess file for simple access control, and the HTTP Header X-Robot-Tag(s). We will also discuss ways for dealing with rogue spiders.

Understanding REP

Before we dig deeper into REP, it is important to reiterate the key differences between indexing and crawling (also known as spidering). There is no guarantee if a document is crawled that it will also be indexed.

Crawling Versus Indexing

Crawling is the automated, systematic process of web document retrieval initiated by a web spider. The precursor to indexing, crawling has no say in how pages are ranked. Ideally, all crawling activities should be governed by the agreed-upon standards of REP.

Indexing is the algorithmic process of analyzing and storing the crawled information in an index. Each search engine index is created with a set of rules, ranking factors, or weights governing the page rank for a particular keyword.

Why Prohibit Crawling or Indexing?

You may want to prohibit crawling or indexing for many reasons. Sometimes this is done on just a few pages or documents within certain portions of a site, and other times it is done across the entire site. Here are some typical scenarios.

New sites

Say you’ve just purchased your domain name. Unless you already changed the default DNS server assignments, chances are that when you type in your domain name, you get to a domain parking page served by your domain registrar. It can be somewhat annoying to see the domain registrar’s advertisements plastered all over your domain while passing (at least temporarily) your domain’s link juice (if any) to its sites.

Most people in this situation will put up an “Under Construction” page or something similar. If that is the case, you really do not want search engines to index this page. So, in your index.html (or equivalent) file, add the following robots meta tag:

<meta name="robots" content="noindex">

The suggested practice is to have a “Coming Soon” page outlining what your site will be all about. This will at least give your visitors some ideas about what to expect from your site in the near future. If for some reason you want to block crawling of your entire site, you can simply create a robots.txt file in the root web folder:

User-agent: *
Disallow: /

The star character (*) implies all web spiders. The trailing slash character (/) signifies everything after the base URL or domain name, including the default document (such as index.html).

Content duplication

Content duplication is a common issue in SEO. When your site is serving the same content via different link combinations, it can end up splitting your link juice across the different link permutations for the page.

The robots.txt file can be helpful in blocking the crawling of duplicated content, as we will discuss in some examples later in this chapter. We will discuss content duplication in detail in Chapter 14.

REP and document security

Hiding specific pages or files from the SERPs can be helpful. When it comes to documents that should be accessed by only authenticated clients, REP falls short in providing that sort of functionality. Using some sort of authentication system is required. One such system is the .htaccess (Basic authentication) method.

Protecting directories with .htaccess

Implementing .htaccess protection is relatively straightforward—if you are running your site on an Apache web server on a Linux-flavored OS. Let’s look at an example. Suppose we have a directory structure as follows:

.public_html
.public_htmlimages
.public_htmllog
.public_htmlprivate

We are interested in using Basic HTTP authentication for the private subdirectory. Using Basic HTTP authentication requires the creation of two text files, namely .htaccess and .htpasswd. Here is how the .htaccess file might look:

AuthType Basic
AuthName "Restricted Area"
AuthUserFile "/home/nonpublicfolder/.passwd"
require valid-user

AuthType corresponds to the authentication type, which in our case is Basic HTTP authentication. AuthName corresponds to a string used in the authentication screen after the “The site says:” part, as shown in Figure 9-1. AuthUserFile corresponds to the location of the password file.

Figure 9-1. Basic HTTP authentication

The .passwd file might look as follows:

guest:uQbLtt/C1yQXY
john:l1Ji97iyY2Zyc
jane:daV/8w4ZiSEf.
mike:DkSWPG/1SuYa6
tom:yruheluOelUrg

The .passwd file is a colon-separated combination of username and password strings. Each user account is described in a separate line. The passwords are encrypted. You can obtain the encrypted values in several ways. Typically, you would use the following shell command:

htpasswd -c .htpasswd guest

Upon executing this command, you are asked to enter the password for user guest. Finally, the .htpasswd text file is created. On subsequent additions of users, simply omit the -c command-line argument to append another user to the existing .htpasswd file. You can also create encrypted passwords by using the online service available at the Dynamic Drive website (http://tools.dynamicdrive.com/password/).

Website maintenance

All sites need to perform maintenance at some point. During that time, Googlebot and others might try to crawl your site. This begs the question: should something be done to present spiders with a meaningful message as opposed to a generic maintenance page? The last thing we want is for search engines to index our maintenance page.

The most appropriate response is to issue the HTTP 503 header. The HTTP 503 response code signifies service unavailability. According to the W3C, the HTTP 503 code is defined as follows (http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html):

The server is currently unable to handle the request due to a temporary overloading or maintenance of the server. The implication is that this is a temporary condition which will be alleviated after some delay. If known, the length of the delay MAY be indicated in a Retry-After header. If no Retry-After is given, the client SHOULD handle the response as it would for a 500 response.

Handling this scenario is easy with some PHP code. Using the PHP header method, we can write the appropriate headers. Here is how the code might look:

<?php
ob_start();
header('HTTP/1.1 503 Service Temporarily Unavailable'),
header('Status: 503 Service Temporarily Unavailable'),
header('Retry-After: 7200'),
?><html>
<head>
<title>503 Service Temporarily Unavailable</title>
</head>

<body>
<strong>Service Temporarily Unavailable</strong><br><br>
We are currently performing system upgrades. Website is not available
until 5am. <br>Please try again later and we apologize for any
inconvenience.
<!--file:maintenance.php-->
</body>
</html>

The sample code accomplishes two things: it communicates the service outage to the user while also letting spiders know they should not crawl the site during this time. Please note that for this solution to fully work you would also need to write a rewrite rule that would be used to forward all domain requests to this page. You can do this with a simple .htaccess file, which you should place in the root folder of your website. Here is how the file might look:

Options +FollowSymlinks

RewriteEngine on
RewriteCond %{REQUEST_URI} !/maintenance.php$
RewriteCond %{REMOTE_HOST} !^57.168.228.28

RewriteRule $ /maintenance.php [R=302,L]

The .htaccess file also allows you as the administrator to browse the site from your hypothetical IP address (57.168.228.28) during the upgrade. Once the upgrade is complete, the maintenance .htaccess file should be renamed (or removed) while restoring the original .htaccess file (if any).

Anyone hitting the site during the maintenance period would see the following HTTP headers, as produced by running the maintenance.php script:

GET /private/test.php HTTP/1.1
Host: www.seowarrior.net
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.7)
Gecko/2009021910 Firefox/3.0.7
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-us,en;q=0.5
Accept-Encoding: gzip,deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Keep-Alive: 300
Connection: keep-alive
Cookie: PHPSESSID=308252d57a74d86d36a7fde552ff2a7f

HTTP/1.x 503 Service Temporarily Unavailable
Date: Sun, 22 Mar 2009 02:49:30 GMT
Server: Apache/1.3.41 (Unix) Sun-ONE-ASP/4.0.3 Resin/3.1.6
mod_fastcgi/2.4.6 mod_log_bytes/1.2 mod_bwlimited/1.4
mod_auth_passthrough/1.8 FrontPage/5.0.2.2635 mod_ssl/2.8.31
OpenSSL/0.9.7a
Retry-After: 7200
X-Powered-By: PHP/5.2.6
Connection: close
Transfer-Encoding: chunked
Content-Type: text/html

The boldface strings are the string values we specified in the PHP file. The first string represents the HTTP status code (503). The second string is the Retry-After segment representing a note to the web crawler to try crawling the site again in two hours (7,200 seconds).

Saving website bandwidth

Although website bandwidth is relatively inexpensive today when compared to the early days of the mainstream Internet, it still adds up in terms of cost. This is especially the case with large websites. If you are hosting lots of media content, including images, audio, and video, you may want to prohibit crawlers from accessing this content. In many cases, you may also want to prohibit crawlers from indexing your CSS, JavaScript, and other types of files.

Before implementing any crawler blocking, first check with your web logs for web spider activities to see where your site is taking hits. Keep in mind that anything you block from crawling will potentially take away some traffic. For example, all the big search engines provide specialized search for images. Sometimes people can learn about your site when only searching for images.

Preventing website performance hits

Certain web pages may be hogging your website’s CPU cycles. Taking additional hits on these pages during peak utilization hours can further degrade your site’s performance, which your visitors will surely notice.

Sometimes you have no choice but to ensure that crawlers can see these pages. You can utilize several different performance optimizations. These include the use of web page caching, web server compression, content compression (images, video, etc.), and many others techniques. Sometimes you have no other choice but to limit crawling activities by preventing crawl access to CPU-intensive pages.

More on robots.txt

Using robots.txt is the original way to tell crawlers what not to crawl. This method is particularly helpful when you do not want search engines to crawl certain portions or all portions of your website. Maybe your website is not ready to be browsed by the general public, or you simply have materials that are not appropriate for inclusion in the SERPs.

When you think of robots.txt, it needs to be in the context of crawling and never in terms of indexing. Think of crawling as rules for document access on your website. The use of the robots.txt standard is almost always applied at the sitewide level, whereas the use of the robots HTML meta tag is limited to the page level or lower. It is possible to use robots.txt for individual files, but you should avoid this practice due to its associated additional maintenance overhead.

Note

All web spiders do not interpret or support the robots.txt file in entirely the same way. Although the big three search engines have started to collaborate on the robots.txt standard, they still deviate in terms of how they support robots.txt.

Is robots.txt an absolute requirement for every website? In short, no; but the use of robots.txt is highly encouraged, as it can play a vital role in SEO issues such as content duplication.

Creation of robots.txt

Creating robots.txt is straightforward and can be done in any simple text editor. Once you’ve created the file, you should give it read permissions so that it is visible to the outside world. On an Apache web server, you can do this by executing the following command:

chmod 644 robots.txt

Validation of robots.txt

All robots.txt files need to be verified for syntax and functional validity. This is very important for large or complex robots.txt files because it takes only a little mistake to affect an entire site.

Many free validators are available on the Web. One such tool is available in the Google Webmaster Tools platform. Microsoft also provides a robots.txt validator service. For best results, verify your robots.txt files on both platforms.

Google’s robots.txt analysis tool is particularly useful, as you can verify specific URLs against your current robots.txt file to see whether they would be crawled. This can be helpful when troubleshooting current problems or when identifying potential problems.

Placement of robots.txt

The robots.txt file must reside in the root folder of your website. This is the agreed standard and there are no exceptions to this rule. A site can have only one robots.txt file.

Important Crawlers

Not all crawlers are created equal. Some crawlers crawl your web pages, whereas others crawl your images, news feeds, sound files, video files, and so forth. Table 9-1 summarizes the most popular web crawlers.

Table 9-1. Important crawlers

Google:
Crawler	Description
Googlebot	Crawls web pages (it’s the most important of the bunch)
Googlebot-Mobile	Crawls pages specifically designed for mobile devices
Googlebot-Image	Crawls images for inclusion in image search results
Mediapartners-Google	Crawls AdSense content
AdsBot-Google	Crawls AdWords landing pages to measure their quality

Yahoo!:
Crawler	Description
Slurp	Crawls web pages
Yahoo-MMAudVid	Crawls video files
Yahoo-MMCrawler	Crawls images

Bing:
Crawler	Description
MSNBot	Crawls web pages
MSNBot-Media	Crawls media files
MSNBot-News	Crawls news feeds

Thousands of crawlers are operating on the Internet. It would make no sense to pay attention to all of them. Depending on your site’s content and the region you are targeting, you may need to pay more attention to other crawlers. Refer to the robots database located at http://www.robotstxt.org/db.html for more information on many other web crawlers.

Understanding the robots.txt Format

The robots.txt file is composed of a set of directives preceded by specific user-agent heading lines signifying the start of directives for a particular crawler. The following is an example robots.txt file that instructs three different crawlers:

1 User-agent: *
2 Disallow: /
3 Allow: /blog/
4 Allow: /news/
5 Allow: /private
6
7 User-agent: msnbot
8 Disallow:
9
10 User-agent: googlebot
11 Disallow: /cgi-bin/

The first five lines of this example are the instructions for the catchall web crawler. There is a single Disallow directive and three Allow directives. The remaining lines are specific instructions for Google and Bing crawlers.

If Yahoo!’s Slurp paid a visit to this site, it would honor the first five lines of robots.txt—as it does not have its own custom entry within this robots.txt file. This example has several interesting scenarios. For instance, what would Slurp do if it had the following URLs to process?

http://mydomain.com/blog
http://mydomain.com/blog/

Which URL would it crawl? The answer would be the second URL, as it fully matches the Allow directive on line 3 of the preceding code. The trailing slash signifies a directory, whereas the absence of the trailing slash signifies a file. In this example, it is quite possible that the web server would return the same page.

Similarly, what would happen if Slurp had the following URL references?

http://mydomain.com/privatejet.html
http://mydomain.com/private/abcd.html

Slurp would crawl both of these URLs, as they both match the Allow pattern on line 5. Line 5 is also the longest directive. The longest directive always takes precedence!

Robots.txt Directives

The following subsections talk about the various supported robots.txt directives.

The Allow directive

The Allow directive tells web crawlers that the specified resources can be crawled. When multiple directives are applicable to the same URL, the longest expression takes precedence. Suppose we had a robots.txt file as follows:

1 User-agent: *
2 Disallow: /private/
3 Allow: /private/abc/

We have one Disallow directive for the private directory. In addition, we have one Allow directive for the abc subdirectory. Googlebot comes along and has to decide which URL references it needs to crawl:

The answer is that Googlebot needs to crawl all but the second URL, since the longest directive it matches is the Disallow directive on line 2 of the preceding code. The first URL is a perfect match of the Allow directive. The third URL also matches the Allow directive. Finally, the last URL is allowed, since it does not match the Disallow directive.

When in doubt, always ask the following question: what is the longest directive that can be applied to a given URL? As a rule of thumb, if no directive can be applied to a given URL, the URL would be allowed.

Let’s examine another example. What happens if you have an empty Allow directive? Here is the code fragment:

User-agent: *
Allow:

The simple answer is that nothing happens. It makes no sense to place Allow directives when there is not a single Disallow directive. In this case, all documents would be allowed for crawling.

The Disallow directive

The Disallow directive was the original directive created. It signified the webmaster’s desire to prohibit web crawler(s) from crawling specified directories or files. Many sites choose to use this directive only. Its basic format is:

Disallow: /directory/
Disallow: /file.ext

For the sake of completeness, let’s see what happens if we have an empty Disallow directive. Consider the following code fragment:

User-agent: *
Disallow:

In this case, all documents would be allowed for crawling. This seems counterintuitive. But let’s think for a moment. Yes, the Disallow directive is used, but it does not have any arguments. Since we did not specify any file (or directory) to disallow, everything would be allowed. Let’s look at another example:

User-agent: *
Disallow: /

In this case, everything would be disallowed for crawling. The forward slash (Disallow) argument signifies the web root of your site. What a difference a single character makes! Note that both Allow and Disallow can employ wildcard characters.

The wildcard directives

Two wildcards are used in robots.txt: $ and *. Both have functionality similar to how they’re used in Perl regular expression matching. You use the dollar sign wildcard ($) when you need to match everything from the end of the URL. You use the star wildcard character (*) to match zero or more characters in a sequence. Let’s look at some examples in the following fragment:

1 User-agent: *
2 Disallow: *.gif$
3
4 User-Agent: Googlebot-Image
5 Disallow: *.jpg$
6
7 User-Agent: googlebot
8 Disallow: /*sessionid

The catchall block (the first two lines) prohibits all obeying spiders from crawling all GIF images. The second block tells Google’s image crawler not to crawl any JPG files. The last block speaks to Googlebot. In this case, Googlebot should not crawl any URLs that contain the sessionid string. As an example, suppose we have Googlebot and Googlebot-Image crawlers deciding what to do with the following link references:

http://mydomain.com/images/header.jpg
http://mydomain.com/product?id=1234&sessionid=h7h1g29k83xh&
http://mydomain.com/images/header.gif

The net result would be that the Google-Image crawler would not crawl header.jpg while crawling header.gif. The GIF image would be crawled because the Google-Image crawler ignores the catchall crawler segment when it finds the Google-Image segment. Furthermore, Googlebot would not crawl the second URL, which contains a match for the Disallow directive on line 8.

The Sitemap location directive

The big three search engines collaborated on the introduction of the Sitemap directive. Here is an excerpt from Yahoo! Search Blog:

All search crawlers recognize robots.txt, so it seemed like a good idea to use that mechanism to allow webmasters to share their Sitemaps. You agreed and encouraged us to allow robots.txt discovery of Sitemaps on our suggestion board. We took the idea to Google and Microsoft and are happy to announce today that you can now find your sitemaps in a uniform way across all participating engines.

The Sitemap location directive simply tells the crawler where your Sitemap can be found. Here is an example of the robots.txt file utilizing the Sitemap directive:

Sitemap: http://mydomain.com/sitemap.xml

User-Agent: *
Disallow:

The location of the Sitemap directive is not mandated. It can be anywhere within the robots.txt file. For full coverage of the Sitemaps protocol, see Chapter 10.

The Crawl-delay directive

Only Yahoo! and Bing use the Crawl-delay directive. They use it to tell Slurp and MSNBot crawlers how frequently they should check new content. Here is an example:

Sitemap: /sitemap.xml

User-agent: *
Disallow: /private/

User-agent: Slurp
Disallow:
Crawl-delay: 0.5

User-agent: MSNBot
Disallow:
Crawl-delay: 15

In this example, we have three crawler blocks in addition to the Sitemap directive placed at the beginning of robots.txt. The first crawler block is of no interest to us, as it is the catchall crawler. The second block tells the Slurp crawler to use a Crawl-delay value of 0.5. The last block instructs the MSNBot crawler to use a Crawl-delay of 15.

The values of the Crawl-delay directive have different meanings to Slurp and MSNBot. For MSNBot, it represents the number of seconds between each crawl. The current allowed range is between one and 600 seconds.

Slurp interprets the Crawl-delay directive a bit differently. Here is what Yahoo! says:

Setting the “delay value” in robots.txt to a high value, for example 5 or 10, sets a greater delay for Yahoo! web crawlers accessing your server. Yahoo! suggests that you start with small values (0.5–1), and increase the “delay value” only as needed for an acceptable and comfortable crawling rate for your server. Larger “delay values” add more latency between successive crawling and results in decreasing the optimum Yahoo! web search results of your web server.

Although Google does not support the Crawl-delay directive via robots.txt, it does have its own alternative. Under the Settings section of Google Webmaster Tools, you can set up the specific crawl rate for your site, as shown in Figure 9-2.

Figure 9-2. Setting the site’s crawl rate in Google Webmaster Tools

You do not have to set up Google’s crawl rate unless you are experiencing performance degradations attributed to crawler activities. The same is true of the Crawl-delay directive.

Case Sensitivity

Apache and IIS web servers handle filenames and URLs in different ways. Apache is case-sensitive, which can be attributed to its Unix roots. IIS is case-insensitive. This could have implications when it comes to content duplication issues and link canonicalization.

The robots.txt file is case-sensitive. If you are running your website on an IIS platform, you will need to pay close attention to your file and directory naming conventions. This is especially the case if you have various (but really the same) link references to the same content. Let’s consider an example. The following fragment shows an arbitrary robots.txt file:

User-Agent: *
Disallow: /Shop*/*sessionid

In this example, we are dealing with a catchall crawler block. There is a single Disallow directive with two wildcards. All URLs found that match the given string regular expression should be blocked from crawling.

Now, suppose that crawlers have the following link references to process:

http://mydomain.com/Shop-computers/item-1234/sessionid-3487563847/
http://mydomain.com/shop-printers/item-8254/sessionid-3487563847/

Based on the Disallow directive and the fact that robots.txt files are case-sensitive, only the first link would be blocked (not crawled).

Common robots.txt Configurations

You can view many examples of common robots.txt configurations on the Web just by issuing the filetype:txt robots.txt command in Google. The following subsections discuss some typical uses of robots.txt. We’ll discuss how to block crawling images, allow crawls by Google and Yahoo! only, block crawls of Microsoft Office documents, and block crawls by the Internet Archive.

Disallowing image crawling

You can disallow image crawling in a few ways. These include prohibiting specific image crawlers, prohibiting specific image folders, and prohibiting specific file extensions.

Let’s suppose we want to prohibit all image crawling for the most popular search engines. We also want to accomplish this on the sitewide level. The following fragment shows how we can do this:

User-agent: Yahoo-MMCrawler
Disallow: /

User-agent: msnbot-media
Disallow: /

User-Agent: Googlebot-Image
Disallow: /

In this example, all other crawlers would be able to crawl your images. In some situations you might not care about other crawlers. You can extend this example by explicitly using folder names.

You should always store your images in centralized directories to allow for easy maintenance, and to make it easier to block crawlers. The following fragment speaks to the major crawlers in addition to any other crawlers obeying REP:

User-agent: Yahoo-MMCrawler
Disallow: /images/
Allow: /images/public/

User-agent: msnbot-media
Disallow: /images/
Allow: /images/public/

User-Agent: Googlebot-Image
Disallow: /images/
Allow: /images/public/

User-Agent: *
Disallow: /images/

In this example, we are blocking all crawlers from crawling all images in the images folder, with one exception. Google, Yahoo!, and Bing are allowed to crawl any images that contain /images/public/ in their URL path.

The easiest way to block all images is by using robots.txt wildcards. Suppose you are using only three image types: GIF, PNG, and JPG. To block crawling of these types of images, you would simply use the following code:

User-agent: *
Disallow: /*.gif$
Disallow: /*.jpg$
Disallow:/*.png$

Allowing Google and Yahoo!, but rejecting all others

If your target audience is in North America or Europe, you may want to target only Google and Yahoo!. The following fragment shows how to implement this sort of setup:

User-agent: slurp
Disallow:

User-agent: googlebot
Disallow:

User-agent: *
Disallow: /

Blocking Office documents

You already saw examples of blocking specific image file types. Nothing is restricting you from blocking the crawling of many other file types. For example, to restrict access to Microsoft Office files, simply use the following fragment:

User-agent: *
Disallow: /*.doc$
Disallow: /*.xls$
Disallow: /*.ppt$
Disallow: /*.mpp
Disallow: /*.mdb$

Blocking Internet Archiver

Did you know that the Internet Archive website contains billions of web pages from the mid-1990s onward? For archiving purposes, the Internet Archiver makes use of its crawler, which is called ia_archiver. Most website owners are not even aware of this. Sometimes this is something you may not wish to allow. To block your site from being archived, use the following fragment:

User-agent: ia_archiver
Disallow: /

Summary of the robots.txt Directive

It took many years before major search engines started to endorse a common approach in supporting REP consistently. Webmasters and content publishers do appreciate this “coming together” of the big three search engines, as they can now develop sites more easily by having a nearly identical set of rules when it comes to REP. Table 9-2 provides a summary of all of the robots.txt directives.

Table 9-2. REP directives summary

Directive (support)	Description
`Allow`	Instructs crawlers to crawl a specific page (or resource). Example: `Allow: /cgi-bin/report.cgi` This code instructs crawlers to crawl the report.cgi file.
`Disallow`	Instructs crawlers not to crawl all or parts of your site. The only exception to the rule is the robots.txt file. Example: `Disallow: /cgi-bin/` This code prohibits crawlers from crawling your cgi-bin folder.
`Sitemap`	Instructs crawlers where to find your Sitemap file. Example: `Sitemap: http://domain.com/sitemap.xml` Hint: use absolute paths for cross search engine compatibility. Multiple `Sitemap` directives are allowed.
`$` wildcard	Instructs crawlers to match everything starting from the end of the URL. Example: `Disallow: /*.pdf$` This code prohibits crawlers from crawling PDF files.
`*` wildcard	Instructs crawlers to match zero or more characters. Example: `Disallow: /search?*` All URLs matching the portion of the string preceding the wildcard character will be crawled.
`Crawl-delay`	Directive specific to MSNBot and Slurp specifying a search engine–specific delay. Example: `Crawl-delay: 5` Google does not support this directive.

Robots Meta Directives

The robots meta directives were introduced a few years after robots.txt. Operating on a page (or document) level only, they provide indexing instructions to the obeying search engines. There are two types of meta directives: those that are part of the HTML page, and those that the web server sends as HTTP headers.

HTML Meta Directives

HTML meta directives are found in the actual HTML page. According to the W3C:

The META element allows HTML authors to tell visiting robots whether a document may be indexed, or used to harvest more links. No server administrator action is required.

You place these directives within the HTML <head> tag. The big three search engines support several common directives, as listed in Table 9-3.

Table 9-3. HTML meta directives

Directive	Description
`Noindex`	Instructs search engines not to index this page. Example: `<meta name="robots" content="noindex" />`
`Nofollow`	Instructs search engines not to follow or drain any link juice for any outbound links. Example for all spiders: `<meta name="robots" content="nofollow" />`
`Nosnippet`	Instructs search engines not to show any search results for this page. Example: `<meta name="robots" content="nosnippet" />`
`Noarchive`	Instructs search engines not to show a cache link for this page. Example: `<meta name="robots" content="noarchive" />`
`Noodp`	Instructs search engines not to use Open Directory Project descriptions in the SERPs. Example: `<meta name="robots" content="noodp" />`
`Follow`	Default implied directive that says “follow” all outbound links. Example: `<meta name="robots" content="follow" />`
`Index`	Default implied directive that says “index” this page. Example: `<meta name="robots" content="index" />`

Mixing HTML meta directives

It is perfectly acceptable to combine HTML meta directives into a single meta tag. For example, you may want to noindex and nofollow a particular page. You would simply list both directives, separated by a comma, as in the following example:

<meta name="robots" content="noindex,nofollow" />

Let’s say you have a page that is not your canonical (preferred) page, but is being linked to by some highly trusted site. In this case, you may want to do the following:

<meta name="robots" content="noindex,follow" />

In this example, crawlers can follow any outbound links away from this page, as you might feel that some of these links are your canonical links (such as your home page).

Targeting HTML meta tags

You can define HTML meta tags to target specific spiders. For example, to instruct Googlebot and Slurp not to index a particular page, you could write:

<meta name="googlebot" content="noindex" />
<meta name="slurp" content="noindex" />

Yahoo!-specific directives

Yahoo! uses a couple of extra directives, namely noydir and robots-nocontent. The first directive is used to instruct Yahoo! not to use Yahoo! directory descriptions for its search engine results pertaining to this page. Here is the basic format:

<meta name="robots" content="noydir" />

The second directive, robots-nocontent, operates on other HTML tags. It is not to be used with meta tags. Utilizing this tag allows you to prevent certain portions of your page, such as navigational menus, from being considered by Yahoo!’s indexing algorithms. The net effect of this method is increased page copy relevance (higher keyword density). Here are some examples:

<div class="robots-nocontent"><!--menu html--></div>

<span class="robots-nocontent"><!--footer html--></span>

<p class="robots-nocontent"><!-- ads html--></p>

These examples should be self-explanatory. Marking text that is unrelated to the basic theme or topic of this page would be advantageous. Note that there are other ways of doing the same thing for all search engines.

One such way is by loading ads in iframes. The actual ads would be stored in external files loaded by iframes at page load time. Crawlers typically ignore iframes. You could also use Ajax to achieve the same effect.

Google-specific directives

There are three Google-specific directives : unavailable_after, noimageindex, and notranslate. The first directive, unavailable_after, instructs Google to remove the page from its index after the date and time expiry. Here is an example:

<meta name="googlebot" content="unavailable_after: 12-Sep-2009
12:00:00 PST">

In this example, Googlebot is instructed to remove the page from its index after September 12, 2009, 12:00:00 Pacific Standard Time. The second directive, noimageindex, instructs Google not to show the page as a referring page for any images that show up on Google image SERPs. Here is the format:

<meta name="googlebot" content="noimageindex">

The third directive, notranslate, instructs Google not to translate the page or specific page elements. It has two formats. The following fragment illustrates the meta format:

<meta name="google" value="notranslate">

Sometimes it is useful to translate only certain parts of a page. The way to accomplish that is by identifying page elements that are not to be translated. The following fragment illustrates how you can do this:

<span class="notranslate">Company Name, Location</span>
<p class="notranslate">Brand X Slogan</p>

HTTP Header Directives

We need HTTP header directives because not all web documents are HTML pages. Search engines index a wide variety of our documents, including PDF files and Microsoft Office files. Each page directive has its own HTTP header equivalent.

For example, let’s say your site has Microsoft Word files that you do not wish to index, cache, or use for search result descriptions. If you are using an Apache web server, you could add the following line to your .htaccess files:

<FilesMatch ".doc$">
Header set X-Robots-Tag "noindex, noarchive, nosnippet"
</Files>

In this example, the three directives will be added to the HTTP header created by the Apache web server. Webmasters who prefer to do this in code can use built-in PHP functions.

The nofollow Link Attribute

A discussion of REP would not be complete without the inclusion of the nofollow link attribute. This attribute was introduced to discourage comment spammers from adding their links. The basic idea is that links marked with the nofollow attribute will not pass any link juice to the spammer sites. The format of these links is as follows:

<a href="http://www.spamsite.com/" rel="nofollow">some nonesensical
text</a>

In this case, the hypothetical website http://www.spamsite.com would not receive any link juice from the referring page. Spammers will continue to attack sites. In most cases they are easily detected. Here are some example posts:

Please visit my great <a href="http://www.spamsite.com/"
rel="nofollow">Viagra Site</a> for great discounts.

<a href="http://www.spamsite2.com/" rel="nofollow">Free Porn Site</a>

<a href="http://www.spamsite3.com/" rel="nofollow">I was reading
through your site and I must say it is great! Good Job.</a>

We discuss nofollow link attributes in several other parts of this book.

Dealing with Rogue Spiders

Not all crawlers will obey REP. Some rogue spiders will go to great lengths to pose as one of the big spiders. To deal with this sort of situation, we can utilize the fact that major search engines support reverse DNS crawler authentication.

Reverse DNS Crawler Authentication

Setup of reverse DNS crawler authentication is straightforward. Yahoo! discusses how to do it on its blogging site:

For each page view request, check the user-agent and IP address. All requests from Yahoo! Search utilize a user-agent starting with ‘Yahoo! Slurp.’
For each request from ‘Yahoo! Slurp’ user-agent, you can start with the IP address (i.e. 74.6.67.218) and use reverse DNS lookup to find out the registered name of the machine.
Once you have the host name (in this case, lj612134.crawl.yahoo.net), you can then check if it really is coming from Yahoo! Search. The name of all Yahoo! Search crawlers will end with ‘crawl.yahoo.net,’ so if the name doesn’t end with this, you know it’s not really our crawler.
Finally, you need to verify the name is accurate. In order to do this, you can use Forward DNS to see the IP address associated with the host name. This should match the IP address you used in Step 2. If it doesn’t, it means the name was fake.

As you can see, it is relatively easy to check for rogue spiders by using the reverse DNS approach. Here is the Yahoo! approach translated to PHP code:

<?php
$ua = $_SERVER['HTTP_USER_AGENT'];
$httpRC403 = "HTTP/1.0 403 Forbidden";
$slurp = 'slurp';
if(stristr($ua, $slurp)){
   $ip = $_SERVER['REMOTE_ADDR'];
   $host = gethostbyaddr($ip);
   $slurpDomain = '.crawl.yahoo.net';
   if(!preg_match("/$slurpDomain$/", $host) ) {
      header("$httpRC403");
      exit;
   } else {
      $realIP = gethostbyname($host);
      if($realIP != $ip){
         header("$httpRC403");
         exit;
      }
   }
}
?>

You could extend this script to include other bots. Including this file in all of your PHP files is straightforward with the PHP include command:

<?php include("checkcrawler.php"); ?>

Most CMSs, blogs, and forums are based on modern application frameworks. It is likely that you would have to include this file in only one template file that would be rendered in every page.

Summary

A full understanding of Robots Exclusion Protocol is crucial. Note that REP is not fully supported in the same way by every search engine. The good news is that the most popular search engines are now working together to offer more uniform REP support. The benefits of this are obvious in terms of the work required to address the needs of different search engines.

Using robots.txt to block crawlers from specific site areas is an important tactic in SEO. In most cases, you should use robots.txt at the directory (site) level. With the introduction of wildcards, you can handle common SEO problems such as content duplication with relative ease. Although the use of the Sitemap directive is a welcome addition to robots.txt, Google is still encouraging webmasters to add their Sitemaps manually by using the Google Webmaster Tools platform.

Using HTML meta tags and their HTTP header equivalents is a way to specify indexing directives on the page level for HTML and non-HTML file resources. These types of directives are harder to maintain and you should use them only where required.

Not all web spiders honor REP. At times, it might be necessary to block their attempts to crawl your site. You can do this in many different ways, including via coding, server-side configuration, firewalls, and intrusion detection devices.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 9. Robots Exclusion Protocol

Create new playlist

Sign In

Sign Up