Robocops

The Robots.txt protocol, also called the ?robotsCare, however, should be taken to ensure that the
exclusion standard? is designed to lock out webRobots.txt protocol doesn't block the website robots
spiders from accessing part of a website. It is afrom other areas of the website. This will dramatically
security or privacy measure, the equivalent of hangingaffect your search engine ranking, as the crawlers rely
a ?Keep Out? sign on your door.on the robots to count the keywords, review
This protocol is used by web site administrators whenmetatags, titles and crossheads, and even register the
there are sections or files that they would rather nothyperlinks.
be accessed by the rest of the world. This couldOne misplaced hyphen or dash can have catastrophic
include employee lists, or files that they are circulatingeffects. For example, the robots.txt patterns are
internally. For example, the White House website usesmatched by simple substring comparisons, so care
robots.txt to block any inquiries on speeches by theshould be taken to make sure that patterns matching
Vice President, a photo essay of the First Lady, anddirectories have the final '/' character appended:
profiles of the 911 victims.otherwise all files with names starting with that
How does the protocol work? It lists the files thatsubstring will match, rather than just those in the
shouldn't be scanned, and places it in the top-leveldirectory intended.
directory of the website. The robots.txt protocol wasTo avoid these problems, consider submitting your site
created by consensus in June 1994 by members ofto a search engine spider simulator, also called search
the robots mailing list (robots-request@nexor.co.uk).engine robot simulator. These simulators?which can be
There is no official standards body or RFC for thebought or downloaded from the internet? use the
protocol, so it's difficult to legislate or mandate that thesame processes and strategies of different search
protocol be followed. In fact, the file is treated asengines and give you a ?dry run? of how they will
strictly advisory, and does not have absoluteread your site. They will tell you which pages are
guarantee that those contents won't be read.skipped, which links are ignored, and which errors are
In effect, robot.txt requires cooperation by the webencountered. Since the simulators will also reenact
spider and even the reader, since anything that ishow the bots will follow your hyperlinks, you'll see if
uploaded into the internet becomes publicly available.your robot.txt protocol is interfering with the search
You aren't locking them out of those pages, you areengine's ability to read through all the necessary pages.
just making it harder for them to get in. But it takesIt's also important to review your robot.txt files, which
very little for them to ignore these instructions.will enable you to spot any problems and correct them
Computer hackers can also easily penetrate the filesbefore you submit them to real search engines.
and retrieve information. So the rule of thumb is?if it'sprovides free online tools for webmasters including a
that sensitive, it shouldn't be on your website to beginsearch engine spider simulator and a Google sitemaps
with.XML validator.