| In this article we will discuss search engine | | | | simple. |
| spiders and what they do. You will also learn | | | | |
| how to create a robots.txt file and why you | | | | If you want to allow the spiders to crawl |
| might need one. | | | | your site but exclude directories of your |
| | | | choice, copy and paste the following into a |
| Search engine spiders are automated software | | | | blank txt file: |
| programs that crawl the Web looking for pages | | | | |
| to feed to search engines. They are also | | | | User-agent: * |
| called crawlers, robots and bots. Spiders are | | | | |
| one of the most useful programs on the | | | | Disallow: /directory1/ |
| internet. They are a key part in how the | | | | |
| search engines operate. Spiders allow your | | | | Disallow: /directory2/ |
| site to be found by the millions of people | | | | |
| who use search engines. Feed the spiders | | | | Disallow: /directory3/ |
| right and they will tell the search engines | | | | |
| about your site. | | | | To exclude files of your choice, type in the |
| | | | path to the files you want to exclude: |
| How Spiders Work | | | | |
| | | | User-agent: * |
| A search engine is an index to the Internet, | | | | |
| search engines point to relevant web sites | | | | Disallow: /directory1/page1.html |
| depending on your search. Search engines need | | | | |
| a tool that is able to visit websites, | | | | Disallow: /directory2/page2.html |
| navigate the websites, decide what the | | | | |
| website is about and add that data to the | | | | Disallow: /directory3/page3.html |
| search engine. | | | | |
| | | | To exclude all the search engine spiders from |
| Spiders are essentially programs that "crawl" | | | | your entire web site, copy and paste the |
| sites and report back to their boss their | | | | following into the txt file: |
| findings. Their purpose in life is to make it | | | | |
| easy for your site to get listed in search | | | | User-agent: * |
| engines. | | | | |
| | | | Disallow: / |
| Spiders work by finding links to web sites, | | | | |
| visiting those web sites, going through the | | | | This will keep a specific search engine |
| content of a web site and then reporting the | | | | spider from indexing your site: |
| content of the site back to the database of | | | | |
| the search engine they work for. From there, | | | | User-agent: Name_of_Robot |
| the information is added to the search | | | | |
| engine, and the site then shows up in search | | | | Disallow: / |
| results. | | | | |
| | | | To allow a single robot and exclude all other |
| The robots.txt file | | | | robots: |
| | | | |
| By defining a few rules, you can tell robots | | | | User-agent: Googlebot |
| to not crawl certain directories or files, | | | | |
| within your site. Web sites do not absolutely | | | | Disallow: |
| have to have a robots.txt file, they can get | | | | |
| along just fine without one. Most spiders | | | | User-agent: * |
| look for a robots.txt file as soon as they | | | | |
| arrive on your site. Take a look at your site | | | | Disallow: / |
| statistics. If your statistics has a "files | | | | |
| not found" section, you may see many entries | | | | There can only be one robots.txt on a site, |
| where spiders failed to find the file on your | | | | and you may not have blank lines in a record. |
| site. | | | | Once you have it the way you want, save the |
| | | | file as "robots" and as a .txt file. |
| The default behavior is to allow all unless | | | | Uploading the file to the root directory of |
| you have a Disallow for that resource. If you | | | | your site, that is the directory where your |
| wish to exclude some of your pages from | | | | home page or index page is. Put the |
| search engine indexing, this is the tool | | | | robots.txt file right alongside the index |
| approved by the search engines. Creating a | | | | file. |
| robots.txt file that guides spiders is | | | | |