| In this article we will discuss search engine spiders and | | | | Disallow for that resource. If you wish to exclude |
| what they do. You will also learn how to create a | | | | some of your pages from search engine indexing, this |
| robots.txt file and why you might need one. | | | | is the tool approved by the search engines. Creating a |
| Search engine spiders are automated software | | | | robots.txt file that guides spiders is simple. |
| programs that crawl the Web looking for pages to | | | | If you want to allow the spiders to crawl your site but |
| feed to search engines. They are also called crawlers, | | | | exclude directories of your choice, copy and paste the |
| robots and bots. Spiders are one of the most useful | | | | following into a blank txt file: |
| programs on the internet. They are a key part in how | | | | User-agent: * |
| the search engines operate. Spiders allow your site to | | | | Disallow: /directory1/ |
| be found by the millions of people who use search | | | | Disallow: /directory2/ |
| engines. Feed the spiders right and they will tell the | | | | Disallow: /directory3/ |
| search engines about your site. | | | | To exclude files of your choice, type in the path to the |
| How Spiders Work | | | | files you want to exclude: |
| A search engine is an index to the Internet, search | | | | User-agent: * |
| engines point to relevant web sites depending on your | | | | Disallow: /directory1/page1.html |
| search. Search engines need a tool that is able to visit | | | | Disallow: /directory2/page2.html |
| websites, navigate the websites, decide what the | | | | Disallow: /directory3/page3.html |
| website is about and add that data to the search | | | | To exclude all the search engine spiders from your |
| engine. | | | | entire web site, copy and paste the following into the |
| Spiders are essentially programs that "crawl" sites and | | | | txt file: |
| report back to their boss their findings. Their purpose in | | | | User-agent: * |
| life is to make it easy for your site to get listed in | | | | Disallow: / |
| search engines. | | | | This will keep a specific search engine spider from |
| Spiders work by finding links to web sites, visiting those | | | | indexing your site: |
| web sites, going through the content of a web site and | | | | User-agent: Name_of_Robot |
| then reporting the content of the site back to the | | | | Disallow: / |
| database of the search engine they work for. From | | | | To allow a single robot and exclude all other robots: |
| there, the information is added to the search engine, | | | | User-agent: Googlebot |
| and the site then shows up in search results. | | | | Disallow: |
| The robots.txt file | | | | User-agent: * |
| By defining a few rules, you can tell robots to not | | | | Disallow: / |
| crawl certain directories or files, within your site. Web | | | | There can only be one robots.txt on a site, and you |
| sites do not absolutely have to have a robots.txt file, | | | | may not have blank lines in a record. Once you have it |
| they can get along just fine without one. Most spiders | | | | the way you want, save the file as "robots" and as a |
| look for a robots.txt file as soon as they arrive on your | | | | .txt file. Uploading the file to the root directory of your |
| site. Take a look at your site statistics. If your statistics | | | | site, that is the directory where your home page or |
| has a "files not found" section, you may see many | | | | index page is. Put the robots.txt file right alongside the |
| entries where spiders failed to find the file on your site. | | | | index file. |
| The default behavior is to allow all unless you have a | | | | |