| Search engines look at millions of Web pages to come | | | | This allows all spiders to spider all pages on your site. |
| up with search results. They do this with what we call | | | | The * is a wildcard that means "all spiders." |
| "search engine spiders." This makes sense - spiders | | | | User-agent: * |
| crawling around on the Web. But another word for | | | | Disallow: |
| them is "robots" because they are simply unmanned | | | | This is the opposite of the above example. This one |
| programs gathering data automatically. | | | | tells all spiders to NOT spider your whole site. You |
| In the beginning, these robots spidered every page, | | | | might want this if you have a test site, for example, |
| every file, attached to the Web. This caused problems | | | | that is not live yet. |
| for both the search engines and the people using them. | | | | User-agent: * |
| Pages that really weren't worth looking at, such as, | | | | Disallow: / |
| say, header files to be included in all pages on a site, | | | | This example tells all robots to stay out of the cgi-bin |
| were being spidered and showed up in search results. | | | | and images folders. |
| Have you ever searched on Google and gotten a | | | | User-agent: * |
| partial page as a result? | | | | Disallow: /cgi-bin/ |
| The solution was for Google and other search engines | | | | Disallow: /images/ |
| to begin looking for a robots.txt file in the root folder of | | | | This example tells only the WebFerret robot to not |
| each site (http:// www. mydomain. com/ robots.txt) to | | | | spider the page ferret.htm. It's only an example. I have |
| determine what should and shouldn't be searched. This | | | | nothing against WebFerret. The user agent code for |
| is named, "The Robots Exclusion Standard." This | | | | Google is googlebot. |
| simple text file, created with Notepad or other simple | | | | User-agent: WebFerret |
| text editor, gives you complete control by telling the | | | | Disallow: ferret.htm |
| robots not to spider certain folders in your site. The | | | | It is important that the file is a simple text file - do not |
| result is happier visitors who come to your site from | | | | use Microsoft Word to create it. And be careful of |
| search engines and get only full pages that you want | | | | how you type - it must look exactly like the above |
| them to see, not partial, test or script pages you don't | | | | examples, with caps only for the first letter, just the |
| want them to see. | | | | right spacing, etc. A poorly done robots.txt file could |
| Let's look at some examples to get started: | | | | harm your site more than help it. |