Chapter 8. Search Engine Traps

Although search engine technology is getting better, it is still not good enough to index every bit of dynamic content. This is why it is always better to stick to basic text or HTML on your web pages—in theory. In practice, that is not always feasible. If you have to use dynamic content, yet you need to ensure proper search engine crawling, you must be careful how you do it. This chapter goes over many search engine traps and how to deal with them in your code.

JavaScript Traps

JavaScript is a client-side scripting language that runs on web browsers. For the most part, you cannot see this code if you are just looking at a web page. You do get to see some of its effects when you see pop-up/pop-under windows, animations, and so forth.

JavaScript-Generated Content

You will want to put any content you want to index outside the JavaScript code. Here is some example code that shows dynamic text within JavaScript:

<HTML>
<head>

<title>Example of Dynamic Text</title>

<script type="text/javascript">
function defaultInfo1(){
    var infoText1="";
    document.getElementById('infoblock').innerHTML = infoText1;
}
function onMouseOverInfo1(){
        var infoText1="Product A description goes here.";
        document.getElementById('infoblock').innerHTML = infoText1;
}
function defaultInfo2(){
    var infoText2="";
    document.getElementById('infoblock').innerHTML = infoText2;
}
function onMouseOverInfo2(){
        var infoText2="Product B description goes here.";
        document.getElementById('infoblock').innerHTML = infoText2;
}
</script>
</head>

<body>
<img src="productA.jpg" onmouseover="onMouseOverInfo1()"
onmouseout="defaultInfo1();"/>

<img src="productB.jpg" onmouseover="onMouseOverInfo2()"
onmouseout="defaultInfo2();"/>

<br>
<br>

<div id="infoblock"></div>
</body>

</HTML>

If you examine the code, you will see that the text used for image mouseovers is buried in the actual JavaScript code. Most search engines will ignore this. If you point your mouse cursor over either image (productA.jpg or productB.jpg in the code), you can see the dynamically generated text immediately below the image, as shown in Figure 8-1. In this example, the mouse was moved over the “Product A” image.

Dynamic text: OnMouseOver output
Figure 8-1. Dynamic text: OnMouseOver output

This example is frequently used in sites that want to show more information but in the same screen real estate. Here is code that achieves the same effect, but in an SEO-friendly way:

<HTML>
<head>
<title>Example of Dynamic Text</title>
<style>
div.infoblock1 {
 display:none;
}
div.infoblock2 {
 display:none;
}
</style>
<script type="text/javascript">
function defaultInfo1(){
    document.getElementById('infoblock').innerHTML = "";
}
function onMouseOverInfo1(){
        document.getElementById('infoblock').innerHTML =
 document.getElementById('infoblock1').innerHTML;
}
function defaultInfo2(){
    document.getElementById('infoblock').innerHTML = "";
}
function onMouseOverInfo2(){
        document.getElementById('infoblock').innerHTML =
 document.getElementById('infoblock2').innerHTML;
}
</script>
</head>

<body>

<img src="productA.jpg" onmouseover="onMouseOverInfo1()"
 onmouseout="defaultInfo1();"/>
<img src="productB.jpg" onmouseover="onMouseOverInfo2()"
 onmouseout="defaultInfo2();"/>

<br>
<br>

<div id="infoblock"></div>

<div id="infoblock1" class="infoblock1">Product A description goes
here.</div>

<div id="infoblock2" class="infoblock2">Product B description goes
here.</div>

</body>
</HTML>

The output of this code is identical to that of the previous code. The only difference is that in this code you are placing all your text with the HTML instead of the JavaScript. Another option is to put your CSS into separate files and prohibit search engines from accessing the CSS files within your robots.txt file.

At the time of this writing, some rumors are circulating that Google does not index hidden (HTML) DIV tags. The premise is that search engine spammers are using these techniques to fool search engines. Although this may be true, many times this sort of functionality is necessary to present more information in the same screen real estate. When in doubt, simply ask yourself whether the method is deceptive or designed only for web crawlers. If the answer is no in both cases, you should be fine.

Many sites use JavaScript to create links to other website pages. Here is some example code with different link types that you may want to avoid:

<HTML>
<head>
<title>Link Examples ~ Things to stay away from</title>

<script type="text/javascript">
function gotoLocationX(){
    window.location.href='http://www.cnn.com';
}
function gotoLocationY(){
    window.location.href='http://www.yahoo.com';
}
function gotoLocationZ(){
    window.location.href='http://www.google.com';
}
</script>

</head>
<body>

Example 1:
<a href="#" onClick="javascript:window.location.href=
'http://www.cnn.com'">News on CNN</a>
<br><br>

Example 2:
<a href="#" onClick="javascript:gotoLocationY()">Yahoo Portal</a>
<br><br>

Example 3:
<form>
<input name="mybtn" value="Google Search Engine" type=button
onClick="window.location.href='http://www.google.com'">
</form>
<br><br>
</body>
</html>

When you open this code in your browser, you will see a screen similar to Figure 8-2.

Bad link examples output
Figure 8-2. Bad link examples output

This is not to say that you can never use dynamic links. You obviously can, but you need to think about tweaking your code to help web spiders see what they need to see. Looking back at the preceding example, instead of this:

<a href="#" onClick="javascript:gotoLocationY()">Yahoo Portal</a>

use this:

<a href="http://www.yahoo.com" onClick="javascript:gotoLocationY()">
Yahoo Portal</a>

Plenty of sites are using dynamic JavaScript menus. The following code fragment is one such variant:

<html>
<head>
<script type="text/javascript">
function goTo( form ) {
    var optionIndex = form.menuOption.selectedIndex;
    if ( optionIndex == 0 ) {
        //do nothing
    } else {
        selectedURL = form.menuOption.options[ optionIndex ].value;
        window.location.assign( selectedURL );
    }
}
</script>
</head>
<body>
<h1>Menu Example</h1>
<form name="myform">
<select name="menuOption" size="1" onChange="goTo( this.form )">
  <option>Menu Options (choose below)</option>
  <option value="http://www.abcde.com/keyword1.html">Link 1</option>
  <option value="http://www.abcde.com/keyword2.html">Link 2</option>
  <option value="http://www.abcde.com/keyword3.html">Link 3</option>
  <option value="http://www.abcde.com/keyword4.html">Link 4</option>
  <option value="http://www.abcde.com/keyword5.html">Link 5</option>
</select>
</form>
</body>
</HTML>

This HTML code renders in your browser as shown in Figure 8-3. Note that the figure represents the state of the drop-down box upon clicking the down arrow button.

Example of bad menus, part 2
Figure 8-3. Example of bad menus, part 2

If you click on any of the choices shown in Figure 8-3, your browser opens the corresponding link. The basic problem with this approach is that we are using a nonstandard link to go to a particular page. This type of linking would present problems to web spiders, and hence would leave some of your links unspidered. There are even worse examples in which the actual links are stored in external JavaScript or HTML files. Stay away from these designs.

Many free menu scripts on the Internet are SEO-friendly. You can find several of them at http://www.dynamicdrive.com and similar websites. The basic idea behind SEO-friendly menus is that all links are placed with the proper link tags within DIVs in plain HTML. The “cool” effects are achieved with the clever use of CSS and JavaScript.

There are other ways to improve the readability of your links. If for some reason you cannot specify links within the link tags, you can use these methods:

  • Create a Sitemap listing of your links. A Sitemap is a simple HTML page that contains links. If your Sitemap is too large, you can break it up. You may also create search engine–specific Sitemaps.

  • List all your dynamic links within the <noscript> tags. This is legitimate, as you are trying to help the search engine see only identical links.

Ajax

Ajax is a technology based on JavaScript and XML, and stands for Asynchronous JavaScript and XML. It is used to change certain parts of a web page by reloading only related fragments instead of the full page.

The basic problem with Ajax is that web crawlers will not execute any JavaScript when they read an HTML file containing Ajax. Let’s look at a typical Ajax implementation loading external file content:

<HTML>

<head>
<script type="text/javascript">
function externalFileImport(resourceOrFile, pageElementId) {
  var objInstance = (window.ActiveXObject) ? new ActiveXObject(
"Microsoft.XMLHTTP") : new XMLHttpRequest();
  if (objInstance) {
    objInstance.onreadystatechange = function() {
      if (objInstance.readyState == 4 && objInstance.status == 200) {
        pageElement = document.getElementById(pageElementId);
        pageElement.innerHTML = objInstance.responseText;
      }
    }
    objInstance.open("GET", resourceOrFile, true);
    objInstance.send(null);
  }
}
</script>
</head>

<body>

<div id="mycontent"></div>

<script type="text/javascript">
externalFileImport('header.html','mycontent'),
</script>

</body>
</HTML>

If you examine the source code, you will notice that this HTML file does not have any content in itself. It tries to load header.html, an external HTML file, with a call to the externalFileImport JavaScript/Ajax function. The output of this HTML file looks like Figure 8-4.

Output of Ajax code
Figure 8-4. Output of Ajax code

Now, this should give you a hint of the content within the header.html file. The content of header.html is plain text:

Internet search engine censorship is practiced in a few countries
around the world for reasons of filtering pornography and
controlling government criticism, criminal activity, and free trade.
The reasons all boil down to the categories of cultural values, money,
and/or power. Internet access is provided by local internet service
providers (ISP) in the host country and the ISP must comply with local
laws in order to remain in business. It is usually at the point of the
ISP that search engines as well as individual web sites are placed
under censorship and blocked.

Considering the fact that web spiders will never execute any JavaScript in the HTML code, this means the search engines will never index any content found in header.html.

We advise you to stay away from Ajax if you want to allow the search engines to properly index your website. Nonetheless, using Ajax is a very attractive way to improve the web user experience. However, be selective in terms of where you employ it.

Dynamic Widget Traps

Dynamic widget traps are Flash files, Java applets, and ActiveX controls. All of these technologies are similar in the sense that they provide features that are not usually available within JavaScript, CSS, or DHTML. Be careful when using widgets based on these platforms.

Using Flash

Flash specifically refers to advanced graphics or animations programmed in ActionScript for the Adobe Flash platform. Although Flash is almost always used for animations, sites without animation can be built in Flash, but doing so is a bad idea from an SEO point of view.

Whenever you see online video games with advanced graphics/animation, chances are they were done in Flash. Many web designers use Flash to create their sites. Flash is not a problem in general, but the same rules apply as with JavaScript. If the content you are presenting in Flash needs to be indexed, take it out of Flash. Here is a typical example of HTML source code with an embedded Flash application:

<HTML>
<HEAD>
    <TITLE>Flash Page Example</TITLE>
</HEAD>
<BODY>
<OBJECT classid='clsid:D27CDB6E-AE6D-11cf-96B8-444553540000' codebase
='http://download.macromedia.com/pub/shockwave/cabs/flash/swflash.cab
#version=5,0,0,0'
 WIDTH=100% HEIGHT=100% id=ShockwaveFlash1>
<PARAM NAME=movie VALUE='busi_01_01.swf'>
<PARAM NAME=menu VALUE=false>
<PARAM NAME=quality VALUE=high>
<PARAM NAME=scale VALUE=exactfit>
<PARAM NAME=bgcolor VALUE="#FFFFFF">

<EMBED src='flashappABC.swf'
       menu=false
       quality=high
       scale=exactfit
       bgcolor="#FFFFFF"
       WIDTH=100%
       HEIGHT=100%
       TYPE='application/x-shockwave-flash'
 PLUGINSPAGE='http://www.macromedia.com/shockwave/download/index.cgi?
P1_Prod_Version=ShockwaveFlash'>
</EMBED>
</OBJECT>
</BODY>
</HTML>

The key item to note in the source code is the filename of the Flash application: flashappABC.swf. In most cases, everything this application needs, including graphics, text, and animations, is stored in this file or is loaded from other external files usually not usable by search engine spiders.

Google’s support of Adobe Flash

In June 2008, Google announced the improved ability to index Flash files. You can confirm this on Adobe’s website, which states, “Google has already begun to roll out Adobe Flash Player technology incorporated into its search engine.” Other search engines are likely to follow. For more information, visit the following URLs:

If you are curious what Googlebot will index in your Flash .swf files, download the swf2html utility from the Adobe website.

Using Java Applets

Java applets are similar to Adobe Flash technology. Java applets have many practical purposes, including financial applications, games, and chat applications. Java applet technology is based on Java technology created by Sun Microsystems.

The same rules apply for Java applets as for Flash content. Any content you wish to be visible to search engines should stay out of Java applets. Here is how Java applets are embedded in HTML code:

<HTML>
<head>
<title>Sample Applet</title>
</head>
<body>
<H1> Sample Applet </H1><br>
You need a Java enabled browser to view these. <br>
<hr>
<applet code="ExampleApplication.class" width=700 height=400>Pheonix
 Applet</applet>
<hr>
</body>
</HTML>

As you can see, the HTML source code contains hardly any content. Everything the applet needs is embedded in the “class” file.

Using ActiveX Controls

ActiveX controls are similar to Java applets and Flash files. They typically run on the Internet Explorer browser and are web components based on Microsoft technology:

<HTML>

<head>
<title>Sample Active X Control</title>
</head>

<body>
<H1> Sample Active X Control </H1>
<br>
<hr>
<OBJECT ID="colorCtl" WIDTH=248 HEIGHT=192
    CLASSID="CLSID:FC25B790-85BE-12CF-2B21-665553560606">
    <PARAM NAME="paramA" VALUE="121">
    <PARAM NAME="paramB" VALUE="133">
    <PARAM NAME="paramC" VALUE="144>
    <PARAM NAME="paramD" VALUE="red">
    <PARAM NAME="paramE" VALUE="blue">
    <PARAM NAME="paramF" VALUE="green">
</OBJECT>
<hr>
</body>
</HTML>

The same rules apply to ActiveX controls as to Java applets and Flash files. If you need web spiders to index anything inside an ActiveX control, remove that content from the ActiveX control.

HTML Traps

There are several HTML-related traps. These include the use of frames, iframes, external DIVs, graphical text, large HTML files, and complex (or erroneous) HTML.

Using Frames

For better or for worse, HTML frames are still used on many websites today. The idea of HTML frames was to help in website navigation by separating distinct screen areas into more easily manageable units. Figure 8-5 shows a typical frame layout.

Sample frame structure
Figure 8-5. Sample frame structure

The following HTML code created the layout in Figure 8-5:

<HTML>
<head>
<title>Frame Example</title>
</head>
<noframes>
This website was designed with frames. Please use a browser that
supports frames.
</noframes>

<frameset rows="15%,70%,15%">
   <frame src=header.html">
   <frameset cols="20%,80%">
     <frame src="navigation.html">
     <frame src="mainbody.html">
   </frameset>
   <frame src="footer.html">
</frameset>
</HTML>

So, where is the problem? The biggest problem is that some search engines may not be able to crawl all of your pages. Some of these search engines might even choose to ignore anything within the <frameset> and </frameset> tags. To help the situation, you can add links to your main content between the <noframes> and </noframes> tags.

Now, let’s say the search engine can index all your pages within your framed structure. There is still a problem: search engines will more than likely return your frame fragments instead of the frameset definition page. Do you want to show navigation.html or footer.html or even mainbody.html by itself or as part of the whole framed structure? In most cases, you probably want to avoid these scenarios. Although there are some clever JavaScript ways to address this problem, these workarounds are hardly a good solution.

The consensus these days is to stay away from frames for proper search engine indexing. Some people would disagree with that statement. Strictly from a design standpoint, frames can look clunky and outdated, but the bottom line is that all new web browsers and most search engines still support frames.

If you do use HTML frames, chances are Google will index your pages, but again, the users who get to your site from Google might arrive in a way that you did not intend (e.g., by seeing pages individually, even though you intended that they be shown in combination with other pages). I stay away from frames because most of the websites I design are dynamic, and this framing effect can be done easily in JSP, PHP, ASP, CGI, Python, or ColdFusion, thereby defeating the purpose of using HTML frames.

Using Iframes

Iframe stands for inline frame. Some people call iframes a poor man’s frames. They have fewer features and are harder to set up, but are still used. Google and others can read iframe tags with no problems. They will also follow an iframe’s page link (if present). The problem is that the iframe source URL is indexed separately. Here is an example page with an iframe using content from an outside file:

<HTML>
<head>
<title>Loading Content with IFRAME Example</title>
</head>

<body>
Lorem Ipsum 1 | Lorem Ipsum 1 | Lorem Ipsum 1 | Lorem Ipsum 1 |
Lorem Ipsum 1 | Lorem Ipsum 1 | Lorem Ipsum 1 | Lorem Ipsum 1 |
Lorem Ipsum 1 | Lorem Ipsum 1 | Lorem Ipsum 1 | Lorem Ipsum 1 |
Lorem Ipsum 1 | Lorem Ipsum 1 | Lorem Ipsum 1 | Lorem Ipsum 1 |
Lorem Ipsum 1 | Lorem Ipsum 1 | Lorem Ipsum 1 | Lorem Ipsum 1 |
Lorem Ipsum 1 | Lorem Ipsum 1 | Lorem Ipsum 1 | Lorem Ipsum 1 |
Lorem Ipsum 1 | Lorem Ipsum 1 | Lorem Ipsum 1 | Lorem Ipsum 1 |
Lorem Ipsum 1 | Lorem Ipsum 1 | Lorem Ipsum 1 | Lorem Ipsum 1 |
Lorem Ipsum 1 | Lorem Ipsum 1 <br><br>

<iframe src="externalIframeContent.html" scrolling="no"
id="externalContent"
name="externalContent" height="400" width="100%" frameborder="0" >
If you are seeing this text your browser does not support Iframes.
</iframe>
</body>
</HTML>

Here is the content of the externalIframeContent.html file:

Lorem Ipsum 2 | Lorem Ipsum 2 | Lorem Ipsum 2 | Lorem Ipsum 2 |
Lorem Ipsum 2 | Lorem Ipsum 2 | Lorem Ipsum 2 | Lorem Ipsum 2 |
Lorem Ipsum 2 | Lorem Ipsum 2 | Lorem Ipsum 2 | Lorem Ipsum 2 |
Lorem Ipsum 2 | Lorem Ipsum 2 | Lorem Ipsum 2 | Lorem Ipsum 2 |
Lorem Ipsum 2 | Lorem Ipsum 2 | Lorem Ipsum 2 | Lorem Ipsum 2 |
Lorem Ipsum 2 | Lorem Ipsum 2 | Lorem Ipsum 2 | Lorem Ipsum 2 |
Lorem Ipsum 2 | Lorem Ipsum 2 | Lorem Ipsum 2 | Lorem Ipsum 2 |
Lorem Ipsum 2 | Lorem Ipsum 2 | Lorem Ipsum 2 | Lorem Ipsum 2 |
Lorem Ipsum 2 | Lorem Ipsum 2 <br><br>

So, the overall output of the main HTML file that contains the iframe definition looks similar to Figure 8-6.

Iframe example
Figure 8-6. Iframe example

If the external content loaded by the iframe is important to your parent page, you may want to reconsider your design, as it will not be indexed as part of the parent page.

Iframes present an additional SEO problem if you are trying to use JavaScript to change the page’s content dynamically. We already saw something similar with the use of Ajax. For completeness, here is sample HTML that shows this scenario with iframes:

<HTML>

<head>
<title>Dynamic External File Loading with an IFRAME</title>
</head>

<body>

Click on links to see different  files!
<br><br>

<a href="#" onclick="javascript:externalContent.document.location
 ='file1.html'">Link 1</a> |

<a href="#" onclick="javascript:externalContent.document.location
 ='file2.html'">Link 2</a>

<br><br>

<iframe name="externalContent" scrolling="No" width="500"
height="300" frameborder="Yes" >
</iframe>


</body>
</HTML>

The file1.html file contains a bunch of “Lorem Ipsum 1” text strings. Similarly, file2.html contains a bunch of “Lorem Ipsum 2” text strings. So, when you click on the “Link 1” link, the output of this source code would look similar to Figure 8-7.

Iframe: Dynamic content example
Figure 8-7. Iframe: Dynamic content example

The frameborder property was set to yes to give you a visual hint that whatever is within this border would not be considered part of this page in an SEO sense. It is also possible that search spiders will fail to pick up the two dynamic JavaScript links.

Using External DIVs

Similar types of scenarios can occur if you’re loading external files into DIV tags. For the most part, using DIVs is not a problem. It is the more elegant approach when compared to HTML tables. The major advantage of DIVs is that you can show them in a completely different order compared to their source code order.

The problem comes when you are trying to change their content dynamically with the use of JavaScript. We already saw one example with Ajax loading the content of an external file into a DIV.

The same rule applies here as with JavaScript. Content that search engines need to index should not be generated by DHTML/JavaScript.

Using Graphical Text

Although there are legitimate uses for graphical text, some people make the mistake of unnecessarily using large amounts of text as images. Figure 8-8 shows an example.

Text-in-image example
Figure 8-8. Text-in-image example

You could represent this advertisement in HTML as follows:

<HTML>
<head>
<title>Special Basketball Book</title>
</head>
<body>
<!-- some unrelated text here -->
<a href="somelink.html"><img src="imagesookimage.jpg"></a>
<!-- some unrelated text here -->
</body>
</HTML>

If you are a search engine, what would you think this page or image is about? You would have no clue! Try to think like a search engine; in other words, help the search engine do its job. Here is how the revised HTML code would look:

<HTML>
<head>
<title>Special Basketball Book</title>
</head>
<body>
<!-- some related text here -->
<a href="greatbookonbasketball.html"><img alt="Great book on
Basketball;Learn from basketball professionals ball handling, 1-on-1
defense, zone defense, triple threat position, proper shooting, nba
rules, international rules and much more"
src="imagesooksGreatBookOnBasketBallByJohnSmith.jpg"></a>
<!-- some related text here -->
</body>
</HTML>

In this example, we optimized the code in three different ways:

  • We optimized the actual relative link.

  • We qualified the name of the image file.

  • We added an ALT description of the image.

At first, this may seem like too much to do, but remember that once most pages are designed, they are rarely modified.

Extremely Large Pages

Sometimes if you’re using lots of different third-party JavaScript code, or if you’re using older versions of web design software such as Microsoft FrontPage, you may run into trouble with your page size. Here are some things you can do if you are using large HTML files:

  • Externalize all or most of your JavaScript code and stylesheets (e.g., .css files) to separate files.

  • Place most of the relevant content in the top portion of the HTML.

  • Place most of the important links in the top portion of the HTML.

The following HTML code fragment shows these three guidelines in action:

<!-- first 100k start -->
<HTML>
<head>
<link rel="stylesheet" type="text/css" href="mystylesheet.css" />
<script src="myjavascript.js" type="text/javascript">
</script>
</head>
<body>
<!--
In this section place:
           > important keywords
           > important links
-->
<!-- first 100k end -->
<!-- rest of document -->
</HTML>

Complex HTML and Formatting Problems

You may experience complex HTML and formatting problems when JavaScript and HTML are tightly coupled and JavaScript is all over the HTML code. It is possible that some web spiders might get confused, especially if your HTML code is not well formatted. Here are some examples of how this situation can occur:

  • No opening tag

  • No closing tag

  • Misspelled opening or closing tag

  • JavaScript closing tag not found

  • Incorrect tag nesting

You should always validate your pages before going live. Many free HTML validators are available on the Internet. One such validator is available at http://validator.w3.org/. When you go to this website, you are presented with a simple form in which you are prompted to enter the URL of the web page you want to validate. A few seconds after you submit your page, you’ll get the full report.

Warning

A word of caution regarding HTML validations: you do not have to fix every error or warning you see in a validation report. Focus on the most important points I identified in this section. I use HTML validators in parallel with web spider viewers. Web spider viewers are SEO tools that transform your HTML pages into output resembling what most web spiders will see. This output is essentially clean text with no HTML tags or elements.

If you are new to HTML, you may want to use a user-friendly web page design tool to help you prototype your web pages. In most cases, these tools take care of HTML formatting so that you don’t have to worry about HTML validation. One such tool is Adobe Dreamweaver.

Website Performance Traps

Website performance is important from two perspectives: the web spider’s and the web user’s. If your site has many thousands of pages, you will want to make sure your site response times are reasonable.

Very Slow Pages

Web spiders are busy creatures. If any of your dynamic pages are computationally intensive, the web spiders might give up waiting on your page to finish loading. In technical terms, this is called timing out on a request.

Dynamic pages aren’t the only issue that will cause a web spider to give up. If your website is running on a server that is hosting many other sites, it may be slow to respond because of the overwhelming load caused by one of the other sites. As a result, your website might take many seconds to respond. Another problem could be with your web host if it experiences network latency due to limited bandwidth.

You can do some things to remedy these situations. The basic idea is to speed up page transitions to any web client, not just the web spider. Consider using the following:

  • Web server compression

  • Web page caching

Web server compression

The best way to understand web server compression is to think of sending ZIP files instead of uncompressed files from your web server to your web user. Sending less data over the network will minimize network latency and your web users will get the file faster.

The same thing applies to web spiders, as the major ones support HTTP 1.1. In fact, search engines would appreciate the fact that they will need to use a lot less network bandwidth to do the same work.

Web server compression is a technology used on the web server where you are hosting your pages. If you have full control of the web server, you can set up this compression to occur automatically for all websites or pages this server is hosting.

If you do not have this luxury, you can set this up in your code. To set up web server compression in PHP, you can use the following PHP code:

<?php
ob_start("ob_gzhandler");
?>
<HTML>
<body>
<p>This is the content of the compressed page.</p>
</body>
</HTML>

You can enable web server compression in your code in another way that is even easier than the approach we just discussed. You can use the php.ini file that usually sits in your root web folder. If it does not exist, you can create it. You can also place this file in your subfolders to override the root php.ini settings. Here is the required fragment:

zlib.output_compression=on

Note that web server compression may not be enabled on your web hosting provider. If it is, you should consider using it, as it does have a lot of benefits, including the following:

  • It speeds up the loading of your pages on clients’ PCs.

  • It reduces the overall web traffic from your hosting provider to your web visitors.

  • It is supported by virtually all newer web browsers.

Note

The only caveat for the use of web server compression is that some older browsers, including specific versions of Internet Explorer 6, have a bug in handling web server compression (which is a standard based on HTTP 1.1). The chances of someone coming to your site with these older browsers are pretty slim, though.

Web page caching

Many commercial and noncommercial products come with caching options. Inspect your applications and evaluate whether you need to use caching. If you are developing a new application or are using an existing application that does not have a caching capability, you can implement one yourself with very little programming.

There are many ways to accomplish web caching. In this subsection, I will explain how to implement one of the simplest web caching techniques. It involves obtaining a particular dynamic web page snapshot in time and then serving this static version to end users.

Furthermore, to ensure content freshness, there is usually a scheduled, automated task that repeats the process at defined intervals. The most popular use of web application caching is for news websites.

Making this work well requires knowing and using the so-called welcome files in the right order. You can think of welcome files as default web page files that will open if a web user requests a URL, only specifying a directory at the end of the URL. The web server then looks at the welcome file list to determine what to serve. Usually, on PHP hosts the first file to be served is index.html or index.htm. This is followed by index.php.

Here is the general procedure for adding a caching capability to any of your news websites. Let’s suppose your home page is computationally intensive, with lots of page elements needing lots of time to generate their HTML fragments:

  1. Issue a wget -output-document=index.html http://www.abcde.com/index.php command to grab a snapshot of your current page while storing the snapshot as index.html.

  2. FTP the index.html file to your root web folder as index.html.

  3. Do this over and over again by creating a scheduled task in your PC.

Figure 8-9 illustrates the information flow.

Caching content for performance
Figure 8-9. Caching content for performance

Use the setup I am describing here if you are using a shared host. You can also perform the scheduled task on the server end. Some shared hosting providers offer the so-called cron job interface using their CPANEL software.

Error Pages

If your website is using server-side scripting languages, chances are you have seen some error pages. In most cases, these error messages are very cryptic or technical and do not provide any help whatsoever. Imagine a web spider hitting a web page that is producing a cryptic error message. This is precisely what search engines will index: the actual error message instead of your desired content.

Different types of errors can occur. One of the most common is the 404 message, which says the page cannot be found. The next most common is the 500 message, which usually signifies a problem in your code. Another frequent message type is the 403 message, which occurs when someone tries to access a web page for which she does not have permission.

To help this situation, you can set up custom web pages to be shown when these errors occur. Although there are different ways to do this, I will mention the simplest one. This method uses the good old .htaccess file that usually resides in your web root folder.

With .htaccess, you can tell the server to display a special page to the user in case of an error. Ideally, this page should tell the user that something is wrong, downplay the fact that someone messed up (probably you as the webmaster!), and provide a set of links to the major sections of the site so that the user can at least look for what is missing.

Another use is to include a search form on the 404 page. In addition, you can tell the server to run a CGI script instead of simply displaying a static page. If you do this, you can tell the script to log the error for you, or the script can send you an email about the error.

If the file is not there, simply create one with a text editor and add these lines in the file while using your specific path on your server:

ErrorDocument 403 /some-relative-path/403.html
ErrorDocument 404 /some-relative-path/404.html
ErrorDocument 500 /some-relative-path/500.html

After your .htaccess file has been created or updated and placed in the web root of your server, carefully create the corresponding HTML files. In these files, you may want to provide ways for web users—or in our case, web spiders—to gracefully continue on their path, as opposed to seeing nothing to go to in the unhandled error page scenario. If you are familiar with server-side web programming languages, you can handle any sort of runtime errors with even finer-grained control.

Session IDs and URL Variables

Session IDs are legitimate ways to keep user sessions, as some web users have their browser cookies disabled. Many web applications use session IDs. These are unique identifiers that map each user’s web session to the current application state. Even Google admits that session IDs are not a great idea. Lots of web users do not allow cookies on their web browsers. This scenario forces web applications to store the session IDs as part of the URLs.

Here are some examples of URLs with session IDs and/or other variables:

  • http://www.abcde.com/product.do?id=123&page=201&items=10&

  • http://www.abcde.com/product.php?id=fji5t9io49fk3et3h4to489

  • http://www.abcde.com/product.cgi?sid=485fjh4toi49f4t9iok3et3

  • http://www.abcde.com/product.asp?sessionid=h49fk5et3489fji4t9io4to

The basic problem with session IDs stems from the fact that the ID will always be different the next time the web spiders hit your website, thereby giving the impression of a different page (which could be considered as duplicate content).

To help search engine spiders, you can attempt to detect web spider signatures and disable the usage of session IDs in your URLs. Doing so may also get you into trouble, though, as you are now handling web spiders in a different fashion when compared to regular human visitors. This could be considered search engine cloaking.

The situation is a bit easier to handle when it comes to other URL parameters, as long as there are not too many of them. Many website owners deal with URL parameters by way of obfuscation. They use techniques of URL rewriting in tandem with the .htaccess file. The net effect is the seeming disappearance of the URL variables. The URL essentially looks static after applying the URL rewriting filter.

Consider the following link:

http://www.abcde.com/product.php?id=30

What if you wanted to rewrite the preceding link to the following?

http://www.abcde.com/basketball-30.html

You could accomplish that with the following snippet of code in the .htaccess file:

RewriteEngine on
RewriteRule ^basketball-([0-9]+).html$ product.php?id=$1

Keep in mind that the .htaccess file is a very powerful tool for all kinds of things. Be careful anytime you edit or create this file. If you edit this file, be sure to create a backup before making any changes. Do lots of testing before making a move!

Splash or Doorway Pages

Splash/doorway pages usually do not contain any content. To keep this in perspective, we will address legitimate uses of splash or doorway pages. Figure 8-10 illustrates the concept.

Sample Flash doorway or splash page
Figure 8-10. Sample Flash doorway or splash page

If all you had was a Flash file as shown in Figure 8-10, the web spider would be stuck and would not have anywhere to go. Ensure that the Skip Intro button is outside the Flash application and is part of your HTML text content. It is imperative that you provide additional links or information around your splash Flash application, such as a link to your Sitemap.

Robots.txt

Yes, that’s right! If you are not very careful with the robots.txt file, you could be blocking web spiders from crawling content you do want to be indexed. Most reputable web spiders will obey your instructions within robots.txt. So, be very careful when you use and edit the robots.txt file. Also note that Google Webmaster Tools provides a robots.txt checker, so you can enter a URL and confirm whether a given URL will be crawled.

Summary

In this chapter, you saw lots of ways to confuse search engines. As a rule of thumb, anything that you want to be indexed for a particular keyword should be in HTML. It is as simple as that. Although Google and others are continuously improving their search engine technology, stick to the basics to be safe. You’ll be surprised at what you can accomplish.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.222.118.90