Truth 51. Sometimes you don’t want to be found

Search engine optimization is the art and the science of making websites—and the content on those sites—visible and accessible to search engines and to searchers. That’s not always a good thing. Sometimes, you don’t want to be found. What then?

Meet robots.txt.

Think of robots.txt as a spider barrier. The robot exclusion standard, also known as the Robots Exclusion Protocol or robots.txt protocol, is a convention to prevent most species of web spiders and other robots from accessing all or part of a website.

Adding a robots.txt file to a website means requesting that cooperative robots ignore specific files or directories. There are lots of good reasons for doing so, with privacy being among the top reasons you might not want information to appear in search engine results. Some sites have sections containing content that’s irrelevant to the primary function of the site, which would, in turn, skew the relevance of the site in search indices. Many publisher sites have duplicate content in “print this page” functionality. Robots.txt easily eliminates this duplicate content issue, which would otherwise lead to penalization.

If you want to see a robots.txt file in action, go to SiteOfYourChoice.com/robots.txt in your browser. You’ll see a list of the directories the site owner is requesting the search engines to ignore.

Websites with multiple subdomains require each subdomain to have its own robots.txt file. If YourSite.com has a robots.txt file but blog.YourSite.com doesn’t, the robots.txt file won’t apply to the blog.

Generally, site directories such as /cgi-bin/, /wp-admin/, /cart/, /scripts/, and others that might include sensitive data, such as e-mail addresses and phone numbers, are good candidates for robots.txt. But for heaven’s sake, be careful. An improperly implemented robots.txt file can stop search engines from indexing the main content of your website. If you’re reading this book, that’s unlikely to be your goal.

Be careful not to individually list every page you don’t want indexed in the robots.txt file—stick to directories. That way, you’re not creating a list of files you don’t want to be found, making it easy for technically savvy users to zero in on them.

And bear in mind that robots.txt is never a guarantee of privacy. Some site administrators have applied the protocol in the blithe belief that they’re blocking access to specific content from the world and Web at large. Robots.txt is a request to spiders and web robots to ignore information. It’s not a shield of darkness and should never be considered—or used—as such.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.142.173.40