Appendix B. Joomla! robots.txt and .htaccess

The robots.txt and .htaccess files are important to help you gain more traffic from search engines. The robots.txt file opens up or restricts access to files on your server for Search Engine Robots. The .htaccess file takes care of creating great looking, search engine friendly, and easy to remember URLs for your web site.

However, they can also create havoc and dismay if used the wrong way, leaving Search Engine Robots locked outside your web site. It can also result in displaying those nice looking 404 pages under every link you touch on your web site. So, how do you know if the files are okay? Testing is the keyword here!

Making sense of robots.txt

The Googlebot and other Search Engine Robots will crawl your web site based on the rules you provide in your robots.txt file. This file needs to be in the root of your domain or Joomla! installation directory.

Setting your rules for robots

There are just a few rules that robots will take into account if they visit your web site. Some of the rules are in the robots.txt file and you can add another set of rules, either on a page-by-page basis or on a link in your web site.

In the robots.txt file you will see commands such as:

Allow: /folder1/myfile.html
Disallow: /folder1/

You can also have a link to the sitemap of your web site:

Sitemap: http://www.gstatic.com/s2/sitemaps/profiles-sitemap.xml

This will give the link to your XML or .html sitemap to the robots if you don't have an XML file. Small difference, large effect!

The following rule looks like it does the same thing, but it doesn't:

User-agent: *
Disallow: /

The "/" in the second line tells the robots not to visit your site's pages. In the following example, the robots are allowed to visit all pages.

User-agent: *
Disallow:

The previous example is to show that you really need to make sure to use the right syntax in your robots.txt file.

Standard Joomla! robots.txt

Joomla! comes with a standard robots.txt file:

User-agent: *
Disallow: /administrator/
Disallow: /cache/
Disallow: /components/
Disallow: /images/
Disallow: /includes/
Disallow: /installation/
Disallow: /language/
Disallow: /libraries/
Disallow: /media/
Disallow: /modules/
Disallow: /plugins/
Disallow: /templates/
Disallow: /tmp/
Disallow: /xmlrpc/

As you can see, most special directories are blocked from the Search Engine Robots. There is no need to let them visit and index these special pages that hold the core of the system.

Improving the standard for image searchers

In the standard Joomla! robots.txt file, the directory images is blocked by the following line:

Disallow: /images/

However, this is one line that you need to remove. In the images directory you have all the images that you so carefully named, to be included in the image search pages of the major search engines.

Make sure that the robots get access to this directory by removing that line from your robots.txt file. This will open up a new flood of visitors. If you installed the SEF patch from JoomlAtWork.com site, this is already done for you.

A complete example

The following is the complete robots.txt file of the site www.cblandscapegardenign.com—notice the long line for sitemap:, it must be on one line in your robots.txt file.

User-agent: *
Disallow: /administrator/
Disallow: /cache/
Disallow: /components/
Disallow: /includes/
Disallow: /installation/
Disallow: /language/
Disallow: /libraries/
Disallow: /media/
Disallow: /modules/
Disallow: /plugins/
Disallow: /templates/
Disallow: /tmp/
Disallow: /xmlrpc/
sitemap: http://www.cblandscapegardening.com/component/option,com_xmap/lang,en/no_html,1/sitemap,1/view,xml/

Full access is now granted to include the images and stories directories, and a sitemap link is provided for all Search Engine Robots. The way in which pages and links are handled by the robots is a part of your content creation and that explanation is covered in Chapter 4, How to write keyword-rich articles.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.37.190