| THE ROBOTS.TXT FILE | | | | META tag to the head section of any HTML |
| You know that search engines have been created to | | | | document. |
| help people find information quickly on the Internet, and | | | | In example, a tag like the following tells robots not to |
| the search engines acquire much of their information | | | | index and not to follow links on a particular page:meta |
| through robots (also known as spiders or crawlers), | | | | name="ROBOTS" content="NOINDEX, NOFOLLOW" |
| that look for web pages for them. | | | | Support for the META tag among robots is not so |
| The spiders or crawlers robots explore the web | | | | frequent as the Robots Exclusion Protocol, but most |
| looking for and recording all kinds of information. They | | | | of major web indexes currently support it. |
| usually start with URL submitted by users, or from links | | | | NEWS POSTINGS |
| they find on the web sites, the sitemap files or the top | | | | If you want to keep the search engines out of your |
| level of a site. | | | | news postings, you can create an an "X-no-archive" |
| Once the robot accesses the home page then | | | | line in of your postings' headers: |
| recursively accesses all pages linked from that page. | | | | X-no-archive: yes |
| But the robot can also check out all the pages that | | | | But although common news clients allow you to add |
| can find on a particular server. | | | | an X-no-archive line to the headers of your news |
| After the robot finds a web page it works indexing the | | | | postings, some of them don´t permit you to do so. |
| title, the keywords, the text, etc. But sometimes you | | | | The problem is that most search engines assume that |
| might want to prevent search engines from indexing | | | | all information they find is public unless marked |
| some of your web pages like news postings, and | | | | otherwise. |
| specially marked web pages (in example: affiliate´s | | | | So be careful because though the robot and archive |
| pages), but whether individual robots comply to these | | | | exclusion standards may help keep your material out |
| conventions is pure voluntary. | | | | of major search engines there are some others that |
| ROBOTS EXCLUSION PROTOCOL | | | | respect no such rules. |
| So if you want robots to keep out from some of your | | | | If you're highly concerned about the privacy of your |
| web pages, you can ask robots to ignore the web | | | | e-mail and Usenet postings, you must use some |
| pages that you don´t want indexed, and to do that | | | | anonymous remailers and PGP. You can read about it |
| you can place a robots.txt file on the local root server | | | | here: www dot well dot com/user/abacard/remail.html |
| of your web site. | | | | www dot io dot com/~combs/htmls/crypto.html world |
| In example if you have a directory called e-books and | | | | dot std dot com/~franl/pgp/ |
| you want to ask robots to keep out of it, your | | | | Even if you are not particularly concerned about |
| robots.txt file should read: | | | | privacy, remember that anything you write will be |
| User-agent: * Disallow: e-books/ | | | | indexed and archived somewhere for eternity, so use |
| When you don´t have enough control over your | | | | the robots.txt file as much as you need it. |
| server to set up a robots.txt file, you can try adding a | | | | Written by Dr. Roberto A. |