| The Robots.txt protocol, also called the | | | | Care, however, should be taken to ensure that |
| ?robots exclusion standard? is designed to | | | | the Robots.txt protocol doesn't block the |
| lock out web spiders from accessing part of a | | | | website robots from other areas of the |
| website. It is a security or privacy measure, | | | | website. This will dramatically affect your |
| the equivalent of hanging a ?Keep Out? sign | | | | search engine ranking, as the crawlers rely |
| on your door. | | | | on the robots to count the keywords, review |
| | | | metatags, titles and crossheads, and even |
| This protocol is used by web site | | | | register the hyperlinks. |
| administrators when there are sections or | | | | |
| files that they would rather not be accessed | | | | One misplaced hyphen or dash can have |
| by the rest of the world. This could include | | | | catastrophic effects. For example, the |
| employee lists, or files that they are | | | | robots.txt patterns are matched by simple |
| circulating internally. For example, the | | | | substring comparisons, so care should be |
| White House website uses robots.txt to block | | | | taken to make sure that patterns matching |
| any inquiries on speeches by the Vice | | | | directories have the final '/' character |
| President, a photo essay of the First Lady, | | | | appended: otherwise all files with names |
| and profiles of the 911 victims. | | | | starting with that substring will match, |
| | | | rather than just those in the directory |
| How does the protocol work? It lists the | | | | intended. |
| files that shouldn't be scanned, and places | | | | |
| it in the top-level directory of the website. | | | | To avoid these problems, consider submitting |
| The robots.txt protocol was created by | | | | your site to a search engine spider |
| consensus in June 1994 by members of the | | | | simulator, also called search engine robot |
| robots mailing list | | | | simulator. These simulators?which can be |
| (robots-request@nexor.co.uk). There is no | | | | bought or downloaded from the internet? use |
| official standards body or RFC for the | | | | the same processes and strategies of |
| protocol, so it's difficult to legislate or | | | | different search engines and give you a ?dry |
| mandate that the protocol be followed. In | | | | run? of how they will read your site. They |
| fact, the file is treated as strictly | | | | will tell you which pages are skipped, which |
| advisory, and does not have absolute | | | | links are ignored, and which errors are |
| guarantee that those contents won't be read. | | | | encountered. Since the simulators will also |
| | | | reenact how the bots will follow your |
| In effect, robot.txt requires cooperation by | | | | hyperlinks, you'll see if your robot.txt |
| the web spider and even the reader, since | | | | protocol is interfering with the search |
| anything that is uploaded into the internet | | | | engine's ability to read through all the |
| becomes publicly available. You aren't | | | | necessary pages. |
| locking them out of those pages, you are just | | | | |
| making it harder for them to get in. But it | | | | It's also important to review your robot.txt |
| takes very little for them to ignore these | | | | files, which will enable you to spot any |
| instructions. Computer hackers can also | | | | problems and correct them before you submit |
| easily penetrate the files and retrieve | | | | them to real search engines. |
| information. So the rule of thumb is?if it's | | | | |
| that sensitive, it shouldn't be on your | | | | provides free online tools for webmasters |
| website to begin with. | | | | including a search engine spider simulator |
| | | | and a Google sitemaps XML validator. |