
Robocops
By:
Philip Nicosia
The Robots.txt protocol, also called the "robots exclusion standard" is
designed to lock out web spiders from accessing part of a website. It is a
security or privacy measure, the equivalent of hanging a "Keep Out" sign on
your door.
This protocol is used by web site administrators when there are sections or
files that they would rather not be accessed by the rest of the world. This
could include employee lists, or files that they are circulating internally.
For example, the White House website uses robots.txt to block any inquiries
on speeches by the Vice President, a photo essay of the First Lady, and
profiles of the 911 victims.
How does the protocol work? It lists the files that shouldn't be scanned,
and places it in the top-level directory of the website. The robots.txt
protocol was created by consensus in June 1994 by members of the robots
mailing list (robots-request@nexor.co.uk). There is no official standards
body or RFC for the protocol, so it's difficult to legislate or mandate that
the protocol be followed. In fact, the file is treated as strictly advisory,
and does not have absolute guarantee that those contents won't be read.
In effect, robot.txt requires cooperation by the web spider and even the
reader, since anything that is uploaded into the internet becomes publicly
available. You aren't locking them out of those pages, you are just making
it harder for them to get in. But it takes very little for them to ignore
these instructions. Computer hackers can also easily penetrate the files and
retrieve information. So the rule of thumb is-if it's that sensitive, it
shouldn't be on your website to begin with.
Care, however, should be taken to ensure that the Robots.txt protocol
doesn't block the website robots from other areas of the website. This will
dramatically affect your search engine ranking, as the crawlers rely on the
robots to count the keywords, review metatags, titles and crossheads, and
even register the hyperlinks.
One misplaced hyphen or dash can have catastrophic effects. For example, the
robots.txt patterns are matched by simple substring comparisons, so care
should be taken to make sure that patterns matching directories have the
final '/' character appended: otherwise all files with names starting with
that substring will match, rather than just those in the directory intended.
To avoid these problems, consider submitting your site to a search engine
spider simulator, also called search engine robot simulator. These
simulators-which can be bought or downloaded from the internet- use the same
processes and strategies of different search engines and give you a "dry
run" of how they will read your site. They will tell you which pages are
skipped, which links are ignored, and which errors are encountered. Since
the simulators will also reenact how the bots will follow your hyperlinks,
you'll see if your robot.txt protocol is interfering with the search
engine's ability to read through all the necessary pages.
It's also important to review your robot.txt files, which will enable you to
spot any problems and correct them before you submit them to real search
engines.
Author Bio
XML-Sitemaps.com provides free online tools for webmasters including a
search engine spider simulator and a Google sitemaps XML validator.
Article Source:
http://www.ArticleGeek.com - Free Website Content