| The Robots.txt protocol, also called the ?robots | | | | Care, however, should be taken to ensure that the |
| exclusion standard? is designed to lock out web | | | | Robots.txt protocol doesn't block the website robots |
| spiders from accessing part of a website. It is a | | | | from other areas of the website. This will dramatically |
| security or privacy measure, the equivalent of hanging | | | | affect your search engine ranking, as the crawlers rely |
| a ?Keep Out? sign on your door. | | | | on the robots to count the keywords, review |
| This protocol is used by web site administrators when | | | | metatags, titles and crossheads, and even register the |
| there are sections or files that they would rather not | | | | hyperlinks. |
| be accessed by the rest of the world. This could | | | | One misplaced hyphen or dash can have catastrophic |
| include employee lists, or files that they are circulating | | | | effects. For example, the robots.txt patterns are |
| internally. For example, the White House website uses | | | | matched by simple substring comparisons, so care |
| robots.txt to block any inquiries on speeches by the | | | | should be taken to make sure that patterns matching |
| Vice President, a photo essay of the First Lady, and | | | | directories have the final '/' character appended: |
| profiles of the 911 victims. | | | | otherwise all files with names starting with that |
| How does the protocol work? It lists the files that | | | | substring will match, rather than just those in the |
| shouldn't be scanned, and places it in the top-level | | | | directory intended. |
| directory of the website. The robots.txt protocol was | | | | To avoid these problems, consider submitting your site |
| created by consensus in June 1994 by members of | | | | to a search engine spider simulator, also called search |
| the robots mailing list (robots-request@nexor.co.uk). | | | | engine robot simulator. These simulators?which can be |
| There is no official standards body or RFC for the | | | | bought or downloaded from the internet? use the |
| protocol, so it's difficult to legislate or mandate that the | | | | same processes and strategies of different search |
| protocol be followed. In fact, the file is treated as | | | | engines and give you a ?dry run? of how they will |
| strictly advisory, and does not have absolute | | | | read your site. They will tell you which pages are |
| guarantee that those contents won't be read. | | | | skipped, which links are ignored, and which errors are |
| In effect, robot.txt requires cooperation by the web | | | | encountered. Since the simulators will also reenact |
| spider and even the reader, since anything that is | | | | how the bots will follow your hyperlinks, you'll see if |
| uploaded into the internet becomes publicly available. | | | | your robot.txt protocol is interfering with the search |
| You aren't locking them out of those pages, you are | | | | engine's ability to read through all the necessary pages. |
| just making it harder for them to get in. But it takes | | | | It's also important to review your robot.txt files, which |
| very little for them to ignore these instructions. | | | | will enable you to spot any problems and correct them |
| Computer hackers can also easily penetrate the files | | | | before you submit them to real search engines. |
| and retrieve information. So the rule of thumb is?if it's | | | | provides free online tools for webmasters including a |
| that sensitive, it shouldn't be on your website to begin | | | | search engine spider simulator and a Google sitemaps |
| with. | | | | XML validator. |