The story of Robots Exclusion Protocol (REP) begins with the introduction of the robots.txt protocol in 1993. This is thanks in part to a Perl web crawler hogging network bandwidth of a site whose owner would become the eventual robots.txt creator (http://bit.ly/bRB3H).
In 1994, REP was formalized by the consensus of a “majority of robot authors” (http://robotstxt.org/orig.html). Originally, REP was only meant to allow for resource exclusion. This has changed over time to include directives for inclusion.
When we talk about REP today, we are talking about several things:
robots.txt, XML Sitemaps, robots
meta tags, X-Robot-Tag(s), and the
nofollow
link attribute. Understanding
REP is important, as it is used for various SEO tasks. Content
duplication, hiding unwanted documents from search results, strategic
distribution of link juice, and document (search engine) index removal are
just some of the things REP can assist with.
Adoption of REP is nonbinding, and it is not necessarily adopted by all search engines. However, the big three search engines (Yahoo!, Google, and Bing) have adopted a strategy of working together in supporting REP in almost a uniform way, while also working together to introduce new REP standards. The goal of these efforts is to provide consistent crawler behavior for the benefit of all webmasters.
This chapter covers REP in detail. Topics include robots.txt and its associated directives, HTML meta directives, the .htaccess file for simple access control, and the HTTP Header X-Robot-Tag(s). We will also discuss ways for dealing with rogue spiders.
Before we dig deeper into REP, it is important to reiterate the key differences between indexing and crawling (also known as spidering). There is no guarantee if a document is crawled that it will also be indexed.
Crawling is the automated, systematic process of web document retrieval initiated by a web spider. The precursor to indexing, crawling has no say in how pages are ranked. Ideally, all crawling activities should be governed by the agreed-upon standards of REP.
Indexing is the algorithmic process of analyzing and storing the crawled information in an index. Each search engine index is created with a set of rules, ranking factors, or weights governing the page rank for a particular keyword.
You may want to prohibit crawling or indexing for many reasons. Sometimes this is done on just a few pages or documents within certain portions of a site, and other times it is done across the entire site. Here are some typical scenarios.
Say you’ve just purchased your domain name. Unless you already changed the default DNS server assignments, chances are that when you type in your domain name, you get to a domain parking page served by your domain registrar. It can be somewhat annoying to see the domain registrar’s advertisements plastered all over your domain while passing (at least temporarily) your domain’s link juice (if any) to its sites.
Most people in this situation will put up an “Under
Construction” page or something similar. If that is the case, you
really do not want search engines to index this page. So, in your
index.html (or equivalent)
file, add the following robots
meta
tag:
<meta name="robots" content="noindex">
The suggested practice is to have a “Coming Soon” page outlining what your site will be all about. This will at least give your visitors some ideas about what to expect from your site in the near future. If for some reason you want to block crawling of your entire site, you can simply create a robots.txt file in the root web folder:
User-agent: * Disallow: /
The star character (*
)
implies all web spiders. The trailing slash character
(/
) signifies everything after the base URL or domain
name, including the default document (such as index.html).
Content duplication is a common issue in SEO. When your site is serving the same content via different link combinations, it can end up splitting your link juice across the different link permutations for the page.
The robots.txt file can be helpful in blocking the crawling of duplicated content, as we will discuss in some examples later in this chapter. We will discuss content duplication in detail in Chapter 14.
Hiding specific pages or files from the SERPs can be helpful. When it comes to documents that should be accessed by only authenticated clients, REP falls short in providing that sort of functionality. Using some sort of authentication system is required. One such system is the .htaccess (Basic authentication) method.
Implementing .htaccess protection is relatively straightforward—if you are running your site on an Apache web server on a Linux-flavored OS. Let’s look at an example. Suppose we have a directory structure as follows:
.public_html .public_htmlimages .public_htmllog .public_htmlprivate
We are interested in using Basic HTTP authentication for the private subdirectory. Using Basic HTTP authentication requires the creation of two text files, namely .htaccess and .htpasswd. Here is how the .htaccess file might look:
AuthType Basic AuthName "Restricted Area" AuthUserFile "/home/nonpublicfolder/.passwd" require valid-user
AuthType
corresponds to
the authentication type, which in our case is Basic HTTP
authentication. AuthName
corresponds to a string used in the authentication screen after
the “The site says:” part, as shown in Figure 9-1. AuthUserFile
corresponds to the location
of the password file.
The .passwd file might look as follows:
guest:uQbLtt/C1yQXY john:l1Ji97iyY2Zyc jane:daV/8w4ZiSEf. mike:DkSWPG/1SuYa6 tom:yruheluOelUrg
The .passwd file is a colon-separated combination of username and password strings. Each user account is described in a separate line. The passwords are encrypted. You can obtain the encrypted values in several ways. Typically, you would use the following shell command:
htpasswd -c .htpasswd guest
Upon executing this command, you are asked to enter the
password for user guest. Finally, the .htpasswd
text file is created. On subsequent additions of users, simply
omit the -c
command-line
argument to append another user to the existing .htpasswd file. You can also create
encrypted passwords by using the online service available at the
Dynamic Drive website (http://tools.dynamicdrive.com/password/).
All sites need to perform maintenance at some point. During that time, Googlebot and others might try to crawl your site. This begs the question: should something be done to present spiders with a meaningful message as opposed to a generic maintenance page? The last thing we want is for search engines to index our maintenance page.
The most appropriate response is to issue the HTTP 503 header. The HTTP 503 response code signifies service unavailability. According to the W3C, the HTTP 503 code is defined as follows (http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html):
The server is currently unable to handle the request due to a temporary overloading or maintenance of the server. The implication is that this is a temporary condition which will be alleviated after some delay. If known, the length of the delay MAY be indicated in a Retry-After header. If no Retry-After is given, the client SHOULD handle the response as it would for a 500 response.
Handling this scenario is easy with some PHP code. Using the PHP header method, we can write the appropriate headers. Here is how the code might look:
<?php ob_start(); header('HTTP/1.1 503 Service Temporarily Unavailable'), header('Status: 503 Service Temporarily Unavailable'), header('Retry-After: 7200'), ?><html> <head> <title>503 Service Temporarily Unavailable</title> </head> <body> <strong>Service Temporarily Unavailable</strong><br><br> We are currently performing system upgrades. Website is not available until 5am. <br>Please try again later and we apologize for any inconvenience. <!--file:maintenance.php--> </body> </html>
The sample code accomplishes two things: it communicates the service outage to the user while also letting spiders know they should not crawl the site during this time. Please note that for this solution to fully work you would also need to write a rewrite rule that would be used to forward all domain requests to this page. You can do this with a simple .htaccess file, which you should place in the root folder of your website. Here is how the file might look:
Options +FollowSymlinks RewriteEngine on RewriteCond %{REQUEST_URI} !/maintenance.php$ RewriteCond %{REMOTE_HOST} !^57.168.228.28 RewriteRule $ /maintenance.php [R=302,L]
The .htaccess file also allows you as the administrator to browse the site from your hypothetical IP address (57.168.228.28) during the upgrade. Once the upgrade is complete, the maintenance .htaccess file should be renamed (or removed) while restoring the original .htaccess file (if any).
Anyone hitting the site during the maintenance period would see the following HTTP headers, as produced by running the maintenance.php script:
GET /private/test.php HTTP/1.1 Host: www.seowarrior.net User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.7) Gecko/2009021910 Firefox/3.0.7 Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8 Accept-Language: en-us,en;q=0.5 Accept-Encoding: gzip,deflate Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7 Keep-Alive: 300 Connection: keep-alive Cookie: PHPSESSID=308252d57a74d86d36a7fde552ff2a7f HTTP/1.x 503 Service Temporarily Unavailable Date: Sun, 22 Mar 2009 02:49:30 GMT Server: Apache/1.3.41 (Unix) Sun-ONE-ASP/4.0.3 Resin/3.1.6 mod_fastcgi/2.4.6 mod_log_bytes/1.2 mod_bwlimited/1.4 mod_auth_passthrough/1.8 FrontPage/5.0.2.2635 mod_ssl/2.8.31 OpenSSL/0.9.7a Retry-After: 7200 X-Powered-By: PHP/5.2.6 Connection: close Transfer-Encoding: chunked Content-Type: text/html
The boldface strings are the string values we specified in the PHP file. The first string represents the HTTP status code (503). The second string is the Retry-After segment representing a note to the web crawler to try crawling the site again in two hours (7,200 seconds).
Although website bandwidth is relatively inexpensive today when compared to the early days of the mainstream Internet, it still adds up in terms of cost. This is especially the case with large websites. If you are hosting lots of media content, including images, audio, and video, you may want to prohibit crawlers from accessing this content. In many cases, you may also want to prohibit crawlers from indexing your CSS, JavaScript, and other types of files.
Before implementing any crawler blocking, first check with your web logs for web spider activities to see where your site is taking hits. Keep in mind that anything you block from crawling will potentially take away some traffic. For example, all the big search engines provide specialized search for images. Sometimes people can learn about your site when only searching for images.
Certain web pages may be hogging your website’s CPU cycles. Taking additional hits on these pages during peak utilization hours can further degrade your site’s performance, which your visitors will surely notice.
Sometimes you have no choice but to ensure that crawlers can see these pages. You can utilize several different performance optimizations. These include the use of web page caching, web server compression, content compression (images, video, etc.), and many others techniques. Sometimes you have no other choice but to limit crawling activities by preventing crawl access to CPU-intensive pages.
Using robots.txt is the original way to tell crawlers what not to crawl. This method is particularly helpful when you do not want search engines to crawl certain portions or all portions of your website. Maybe your website is not ready to be browsed by the general public, or you simply have materials that are not appropriate for inclusion in the SERPs.
When you think of robots.txt,
it needs to be in the context of crawling and never in terms of indexing. Think of crawling as
rules for document access on your website. The use of the robots.txt standard is almost always applied
at the sitewide level, whereas the use of the robots
HTML meta tag is limited to the page
level or lower. It is possible to use robots.txt for individual files, but you
should avoid this practice due to its associated additional maintenance
overhead.
All web spiders do not interpret or support the robots.txt file in entirely the same way. Although the big three search engines have started to collaborate on the robots.txt standard, they still deviate in terms of how they support robots.txt.
Is robots.txt an absolute requirement for every website? In short, no; but the use of robots.txt is highly encouraged, as it can play a vital role in SEO issues such as content duplication.
Creating robots.txt is straightforward and can be done in any simple text editor. Once you’ve created the file, you should give it read permissions so that it is visible to the outside world. On an Apache web server, you can do this by executing the following command:
chmod 644 robots.txt
All robots.txt files need to be verified for syntax and functional validity. This is very important for large or complex robots.txt files because it takes only a little mistake to affect an entire site.
Many free validators are available on the Web. One such tool is available in the Google Webmaster Tools platform. Microsoft also provides a robots.txt validator service. For best results, verify your robots.txt files on both platforms.
Google’s robots.txt analysis tool is particularly useful, as you can verify specific URLs against your current robots.txt file to see whether they would be crawled. This can be helpful when troubleshooting current problems or when identifying potential problems.
The robots.txt file must reside in the root folder of your website. This is the agreed standard and there are no exceptions to this rule. A site can have only one robots.txt file.
Not all crawlers are created equal. Some crawlers crawl your web pages, whereas others crawl your images, news feeds, sound files, video files, and so forth. Table 9-1 summarizes the most popular web crawlers.
Google: | |
Crawler | Description |
Googlebot | |
Googlebot-Mobile | |
Googlebot-Image | |
Mediapartners-Google | |
AdsBot-Google |
Yahoo!: | |
Crawler | Description |
Slurp | |
Yahoo-MMAudVid | |
Yahoo-MMCrawler |
Bing: | |
Crawler | Description |
MSNBot | |
MSNBot-Media | |
MSNBot-News |
Thousands of crawlers are operating on the Internet. It would make no sense to pay attention to all of them. Depending on your site’s content and the region you are targeting, you may need to pay more attention to other crawlers. Refer to the robots database located at http://www.robotstxt.org/db.html for more information on many other web crawlers.
The robots.txt file is composed of a set of directives preceded by specific user-agent heading lines signifying the start of directives for a particular crawler. The following is an example robots.txt file that instructs three different crawlers:
1 User-agent: * 2 Disallow: / 3 Allow: /blog/ 4 Allow: /news/ 5 Allow: /private 6 7 User-agent: msnbot 8 Disallow: 9 10 User-agent: googlebot 11 Disallow: /cgi-bin/
The first five lines of this example are the instructions for
the catchall web crawler. There is a single Disallow
directive
and three Allow
directives.
The remaining lines are specific instructions for Google and Bing
crawlers.
If Yahoo!’s Slurp paid a visit to this site, it would honor the first five lines of robots.txt—as it does not have its own custom entry within this robots.txt file. This example has several interesting scenarios. For instance, what would Slurp do if it had the following URLs to process?
http://mydomain.com/blog
http://mydomain.com/blog/
Which URL would it crawl? The answer would be the second URL, as
it fully matches the Allow
directive on line 3 of the preceding code. The trailing slash
signifies a directory, whereas the absence of the trailing slash
signifies a file. In this example, it is quite possible that the web
server would return the same page.
Similarly, what would happen if Slurp had the following URL references?
http://mydomain.com/privatejet.html
http://mydomain.com/private/abcd.html
Slurp would crawl both of these URLs, as they both match the
Allow
pattern on line 5. Line 5 is
also the longest directive. The longest directive always takes
precedence!
The following subsections talk about the various supported robots.txt directives.
The Allow
directive
tells web crawlers that the specified resources can be
crawled. When multiple
directives are applicable to the same URL, the longest expression
takes precedence. Suppose we
had a robots.txt file as
follows:
1 User-agent: * 2 Disallow: /private/ 3 Allow: /private/abc/
We have one Disallow
directive for the private
directory. In addition, we have one Allow
directive for the abc subdirectory. Googlebot comes along
and has to decide which URL references it needs to crawl:
The answer is that Googlebot needs to crawl all but the second
URL, since the longest directive it matches is the Disallow
directive on line 2 of the preceding code. The first URL is a
perfect match of the Allow
directive. The third URL also matches the Allow
directive. Finally, the last URL is
allowed, since it does not match the Disallow
directive.
When in doubt, always ask the following question: what is the longest directive that can be applied to a given URL? As a rule of thumb, if no directive can be applied to a given URL, the URL would be allowed.
Let’s examine another example. What happens if you have an
empty Allow
directive? Here is
the code fragment:
User-agent: * Allow:
The simple answer is that nothing happens. It makes no sense
to place Allow
directives when
there is not a single Disallow
directive. In this case, all documents would be allowed for
crawling.
The Disallow
directive was the original directive created. It
signified the webmaster’s desire to prohibit web crawler(s) from
crawling specified directories or files. Many sites choose to use
this directive only. Its basic format is:
Disallow: /directory/ Disallow: /file.ext
For the sake of completeness, let’s see what happens if we
have an empty Disallow
directive.
Consider the following code fragment:
User-agent: * Disallow:
In this case, all documents would be allowed for crawling.
This seems counterintuitive. But let’s think for a moment. Yes, the
Disallow
directive is used, but
it does not have any arguments. Since we did not specify any file
(or directory) to disallow, everything would be allowed. Let’s look
at another example:
User-agent: * Disallow: /
In this case, everything would be disallowed for crawling. The
forward slash (Disallow
) argument signifies the web
root of your site. What a difference a single character makes! Note
that both Allow
and Disallow
can employ wildcard
characters.
Two wildcards are used in robots.txt: $
and *
. Both have functionality similar to how they’re used in
Perl regular expression matching. You use the dollar sign wildcard
($
) when you need to match
everything from the end of the URL. You use the star wildcard
character (*
) to match zero or
more characters in a sequence. Let’s look at some examples in the
following fragment:
1 User-agent: * 2 Disallow: *.gif$ 3 4 User-Agent: Googlebot-Image 5 Disallow: *.jpg$ 6 7 User-Agent: googlebot 8 Disallow: /*sessionid
The catchall block (the first two lines) prohibits all obeying
spiders from crawling all GIF images. The second block tells
Google’s image crawler not to crawl any JPG files. The last block
speaks to Googlebot. In this case, Googlebot should not crawl any
URLs that contain the sessionid
string. As an example, suppose we have Googlebot and Googlebot-Image
crawlers deciding what to do with the following link
references:
http://mydomain.com/images/header.jpg
http://mydomain.com/product?id=1234&sessionid=h7h1g29k83xh&
http://mydomain.com/images/header.gif
The net result would be that the Google-Image crawler would
not crawl header.jpg while
crawling header.gif. The GIF
image would be crawled because the Google-Image crawler ignores the
catchall crawler segment when it finds the Google-Image segment.
Furthermore, Googlebot would not crawl the second URL, which
contains a match for the Disallow
directive on line 8.
The big three search engines collaborated on the introduction of the
Sitemap
directive. Here is an
excerpt from Yahoo! Search
Blog:
All search crawlers recognize robots.txt, so it seemed like a good idea to use that mechanism to allow webmasters to share their Sitemaps. You agreed and encouraged us to allow robots.txt discovery of Sitemaps on our suggestion board. We took the idea to Google and Microsoft and are happy to announce today that you can now find your sitemaps in a uniform way across all participating engines.
The Sitemap
location
directive simply tells the crawler where your Sitemap can be found.
Here is an example of the robots.txt file utilizing the Sitemap
directive:
Sitemap: http://mydomain.com/sitemap.xml User-Agent: * Disallow:
The location of the Sitemap
directive is not mandated. It can be anywhere within the robots.txt file. For full coverage of the
Sitemaps protocol, see Chapter 10.
Only Yahoo! and Bing use the Crawl-delay
directive. They use it to tell Slurp and MSNBot crawlers how
frequently they should check new content. Here is an example:
Sitemap: /sitemap.xml User-agent: * Disallow: /private/ User-agent: Slurp Disallow: Crawl-delay: 0.5 User-agent: MSNBot Disallow: Crawl-delay: 15
In this example, we have three crawler blocks in addition to
the Sitemap
directive placed at
the beginning of robots.txt.
The first crawler block is of no interest to us, as it is the
catchall crawler. The second block tells the Slurp crawler to use a
Crawl-delay
value of 0.5. The
last block instructs the MSNBot crawler to use a Crawl-delay
of 15.
The values of the Crawl-delay
directive have different
meanings to Slurp and MSNBot. For MSNBot, it represents the number
of seconds between each crawl. The current allowed range is between
one and 600 seconds.
Slurp interprets the Crawl-delay
directive a bit differently.
Here is what Yahoo!
says:
Setting the “delay value” in robots.txt to a high value, for example 5 or 10, sets a greater delay for Yahoo! web crawlers accessing your server. Yahoo! suggests that you start with small values (0.5–1), and increase the “delay value” only as needed for an acceptable and comfortable crawling rate for your server. Larger “delay values” add more latency between successive crawling and results in decreasing the optimum Yahoo! web search results of your web server.
Although Google does not support the Crawl-delay
directive via robots.txt, it does have its own
alternative. Under the Settings section of Google Webmaster Tools,
you can set up the specific crawl rate for your site, as shown in
Figure 9-2.
You do not have to set up Google’s crawl rate unless you are
experiencing performance degradations attributed to crawler
activities. The same is true of the Crawl-delay
directive.
Apache and IIS web servers handle filenames and URLs in different ways. Apache is case-sensitive, which can be attributed to its Unix roots. IIS is case-insensitive. This could have implications when it comes to content duplication issues and link canonicalization.
The robots.txt file is case-sensitive. If you are running your website on an IIS platform, you will need to pay close attention to your file and directory naming conventions. This is especially the case if you have various (but really the same) link references to the same content. Let’s consider an example. The following fragment shows an arbitrary robots.txt file:
User-Agent: * Disallow: /Shop*/*sessionid
In this example, we are dealing with a catchall crawler block.
There is a single Disallow
directive with two wildcards.
All URLs found that match the given string regular expression should
be blocked from crawling.
Now, suppose that crawlers have the following link references to process:
http://mydomain.com/Shop-computers/item-1234/sessionid-3487563847/
http://mydomain.com/shop-printers/item-8254/sessionid-3487563847/
Based on the Disallow
directive and the fact that robots.txt files are case-sensitive, only
the first link would be blocked (not crawled).
You can view many examples of common robots.txt configurations on the Web just
by issuing the filetype:txt
robots.txt
command in Google. The following subsections
discuss some typical uses of robots.txt. We’ll discuss how to block
crawling images, allow crawls by Google and Yahoo! only, block crawls
of Microsoft Office documents, and block crawls by the Internet Archive.
You can disallow image crawling in a few ways. These include prohibiting specific image crawlers, prohibiting specific image folders, and prohibiting specific file extensions.
Let’s suppose we want to prohibit all image crawling for the most popular search engines. We also want to accomplish this on the sitewide level. The following fragment shows how we can do this:
User-agent: Yahoo-MMCrawler Disallow: / User-agent: msnbot-media Disallow: / User-Agent: Googlebot-Image Disallow: /
In this example, all other crawlers would be able to crawl your images. In some situations you might not care about other crawlers. You can extend this example by explicitly using folder names.
You should always store your images in centralized directories to allow for easy maintenance, and to make it easier to block crawlers. The following fragment speaks to the major crawlers in addition to any other crawlers obeying REP:
User-agent: Yahoo-MMCrawler Disallow: /images/ Allow: /images/public/ User-agent: msnbot-media Disallow: /images/ Allow: /images/public/ User-Agent: Googlebot-Image Disallow: /images/ Allow: /images/public/ User-Agent: * Disallow: /images/
In this example, we are blocking all crawlers from crawling all images in the images folder, with one exception. Google, Yahoo!, and Bing are allowed to crawl any images that contain /images/public/ in their URL path.
The easiest way to block all images is by using robots.txt wildcards. Suppose you are using only three image types: GIF, PNG, and JPG. To block crawling of these types of images, you would simply use the following code:
User-agent: * Disallow: /*.gif$ Disallow: /*.jpg$ Disallow:/*.png$
If your target audience is in North America or Europe, you may want to target only Google and Yahoo!. The following fragment shows how to implement this sort of setup:
User-agent: slurp Disallow: User-agent: googlebot Disallow: User-agent: * Disallow: /
You already saw examples of blocking specific image file types. Nothing is restricting you from blocking the crawling of many other file types. For example, to restrict access to Microsoft Office files, simply use the following fragment:
User-agent: * Disallow: /*.doc$ Disallow: /*.xls$ Disallow: /*.ppt$ Disallow: /*.mpp Disallow: /*.mdb$
Did you know that the Internet Archive
website contains billions of web pages from the mid-1990s
onward? For archiving purposes, the Internet Archiver makes use of
its crawler, which is called ia_archiver
. Most website owners are not
even aware of this. Sometimes this is something you may not wish to
allow. To block your site from being archived, use the following
fragment:
User-agent: ia_archiver Disallow: /
It took many years before major search engines started to endorse a common approach in supporting REP consistently. Webmasters and content publishers do appreciate this “coming together” of the big three search engines, as they can now develop sites more easily by having a nearly identical set of rules when it comes to REP. Table 9-2 provides a summary of all of the robots.txt directives.
The robots
meta directives were introduced a few years after robots.txt. Operating on a page (or document)
level only, they provide indexing instructions to the obeying search
engines. There are two types of meta directives: those that are part of
the HTML page, and those that the web server sends as HTTP
headers.
HTML meta directives are found in the actual HTML page. According to the W3C:
The META element allows HTML authors to tell visiting robots whether a document may be indexed, or used to harvest more links. No server administrator action is required.
You place these directives within the HTML <head>
tag.
The big three search engines support several common directives, as
listed in Table 9-3.
It is perfectly acceptable to combine HTML meta directives
into a single meta tag. For example, you may want to noindex
and nofollow
a particular page. You would
simply list both directives, separated by a comma, as in the
following example:
<meta name="robots" content="noindex,nofollow" />
Let’s say you have a page that is not your canonical (preferred) page, but is being linked to by some highly trusted site. In this case, you may want to do the following:
<meta name="robots" content="noindex,follow" />
In this example, crawlers can follow any outbound links away from this page, as you might feel that some of these links are your canonical links (such as your home page).
You can define HTML meta tags to target specific spiders. For example, to instruct Googlebot and Slurp not to index a particular page, you could write:
<meta name="googlebot" content="noindex" /> <meta name="slurp" content="noindex" />
Yahoo! uses a couple of extra directives, namely noydir
and robots-nocontent
.
The first directive is used to instruct Yahoo!
not to use Yahoo! directory descriptions for
its search engine results pertaining to this page. Here is the basic
format:
<meta name="robots" content="noydir" />
The second directive, robots-nocontent
, operates on other HTML
tags. It is not to be used with meta tags.
Utilizing this tag allows you to prevent certain portions of your
page, such as navigational menus, from being considered by Yahoo!’s
indexing algorithms. The net effect of this method is increased page
copy relevance (higher keyword density). Here are some
examples:
<div class="robots-nocontent"><!--menu html--></div> <span class="robots-nocontent"><!--footer html--></span> <p class="robots-nocontent"><!-- ads html--></p>
These examples should be self-explanatory. Marking text that is unrelated to the basic theme or topic of this page would be advantageous. Note that there are other ways of doing the same thing for all search engines.
One such way is by loading ads in iframes. The actual ads would be stored in external files loaded by iframes at page load time. Crawlers typically ignore iframes. You could also use Ajax to achieve the same effect.
There are three Google-specific directives: unavailable_after
, noimageindex
, and notranslate
. The first directive, unavailable_after
, instructs Google to
remove the page from its index after the date and time expiry. Here
is an example:
<meta name="googlebot" content="unavailable_after: 12-Sep-2009 12:00:00 PST">
In this example, Googlebot is instructed to remove the page
from its index after September 12, 2009, 12:00:00 Pacific Standard
Time. The second directive, noimageindex
, instructs Google not to
show the page as a referring page for any images that show up on
Google image SERPs. Here is the format:
<meta name="googlebot" content="noimageindex">
The third directive, notranslate
, instructs Google not to
translate the page or specific page elements. It has two formats.
The following fragment illustrates the meta format:
<meta name="google" value="notranslate">
Sometimes it is useful to translate only certain parts of a page. The way to accomplish that is by identifying page elements that are not to be translated. The following fragment illustrates how you can do this:
<span class="notranslate">Company Name, Location</span> <p class="notranslate">Brand X Slogan</p>
We need HTTP header directives because not all web documents are HTML pages. Search engines index a wide variety of our documents, including PDF files and Microsoft Office files. Each page directive has its own HTTP header equivalent.
For example, let’s say your site has Microsoft Word files that you do not wish to index, cache, or use for search result descriptions. If you are using an Apache web server, you could add the following line to your .htaccess files:
<FilesMatch ".doc$"> Header set X-Robots-Tag "noindex, noarchive, nosnippet" </Files>
In this example, the three directives will be added to the HTTP header created by the Apache web server. Webmasters who prefer to do this in code can use built-in PHP functions.
A discussion of REP would not be complete without the inclusion of the
nofollow
link attribute. This
attribute was introduced to discourage comment spammers from adding
their links. The basic idea is that links marked with the nofollow
attribute will not pass any link
juice to the spammer sites. The format of these links is as
follows:
<a href="http://www.spamsite.com/" rel="nofollow">some nonesensical text</a>
In this case, the hypothetical website http://www.spamsite.com would not receive any link juice from the referring page. Spammers will continue to attack sites. In most cases they are easily detected. Here are some example posts:
Please visit my great <a href="http://www.spamsite.com/" rel="nofollow">Viagra Site</a> for great discounts. <a href="http://www.spamsite2.com/" rel="nofollow">Free Porn Site</a> <a href="http://www.spamsite3.com/" rel="nofollow">I was reading through your site and I must say it is great! Good Job.</a>
We discuss nofollow
link
attributes in several other parts of this book.
Not all crawlers will obey REP. Some rogue spiders will go to great lengths to pose as one of the big spiders. To deal with this sort of situation, we can utilize the fact that major search engines support reverse DNS crawler authentication.
Setup of reverse DNS crawler authentication is straightforward. Yahoo! discusses how to do it on its blogging site:
For each page view request, check the user-agent and IP address. All requests from Yahoo! Search utilize a user-agent starting with ‘Yahoo! Slurp.’
For each request from ‘Yahoo! Slurp’ user-agent, you can start with the IP address (i.e. 74.6.67.218) and use reverse DNS lookup to find out the registered name of the machine.
Once you have the host name (in this case, lj612134.crawl.yahoo.net), you can then check if it really is coming from Yahoo! Search. The name of all Yahoo! Search crawlers will end with ‘crawl.yahoo.net,’ so if the name doesn’t end with this, you know it’s not really our crawler.
Finally, you need to verify the name is accurate. In order to do this, you can use Forward DNS to see the IP address associated with the host name. This should match the IP address you used in Step 2. If it doesn’t, it means the name was fake.
As you can see, it is relatively easy to check for rogue spiders by using the reverse DNS approach. Here is the Yahoo! approach translated to PHP code:
<?php $ua = $_SERVER['HTTP_USER_AGENT']; $httpRC403 = "HTTP/1.0 403 Forbidden"; $slurp = 'slurp'; if(stristr($ua, $slurp)){ $ip = $_SERVER['REMOTE_ADDR']; $host = gethostbyaddr($ip); $slurpDomain = '.crawl.yahoo.net'; if(!preg_match("/$slurpDomain$/", $host) ) { header("$httpRC403"); exit; } else { $realIP = gethostbyname($host); if($realIP != $ip){ header("$httpRC403"); exit; } } } ?>
You could extend this script to include other bots. Including
this file in all of your PHP files is straightforward with the
PHP include
command:
<?php include("checkcrawler.php"); ?>
Most CMSs, blogs, and forums are based on modern application frameworks. It is likely that you would have to include this file in only one template file that would be rendered in every page.
A full understanding of Robots Exclusion Protocol is crucial. Note that REP is not fully supported in the same way by every search engine. The good news is that the most popular search engines are now working together to offer more uniform REP support. The benefits of this are obvious in terms of the work required to address the needs of different search engines.
Using robots.txt to block
crawlers from specific site areas is an important tactic in SEO. In most
cases, you should use robots.txt at
the directory (site) level. With the introduction of wildcards, you can
handle common SEO problems such as content duplication with relative
ease. Although the use of the Sitemap
directive is a welcome addition to robots.txt, Google is still encouraging
webmasters to add their Sitemaps manually by using the Google Webmaster
Tools platform.
Using HTML meta tags and their HTTP header equivalents is a way to specify indexing directives on the page level for HTML and non-HTML file resources. These types of directives are harder to maintain and you should use them only where required.
Not all web spiders honor REP. At times, it might be necessary to block their attempts to crawl your site. You can do this in many different ways, including via coding, server-side configuration, firewalls, and intrusion detection devices.
3.138.122.11