| Learn how to control the Search Engine Spiders and | | | | applicable to all search engine spiders and to stop the |
| prevent them from dismissing parts of your site as | | | | spiders from indexing the faq, cgi-bin the images |
| duplicate or irrelevant content. Duplicate content is one | | | | directories and a specific page called faqs.html |
| of the problems that we regularly come across as | | | | contained within the root directory, the robots.txt file |
| part of the search engine optimization services we | | | | would look like this: |
| offer. If the search engines determine your site | | | | User-agent: * |
| contains similar content, this may result in penalties and | | | | Disallow: /faq/ |
| even exclusion from the search engines. Fortunately | | | | Disallow: /cgi-bin/Disallow: /images/ |
| it's a problem that is easily rectified. | | | | Disallow: /faqs.html |
| Your primary weapon of choice against duplicate | | | | Explanation |
| content can be found within "The Robot Exclusion | | | | The use of the asterisk with the "User-agent" means |
| Protocol" which has now been adopted by all the | | | | this robots.txt file applies to all search engine spiders. |
| major search engines. There are two ways to control | | | | Preventing access to the directories is achieved by |
| how the search engine spiders index your site. | | | | naming them, and the specific page is referenced |
| 1. The Robot Exclusion File or "robots.txt" and | | | | directly. The named files & directories will now not |
| 2. The Robots TagThe Robots Exclusion File | | | | be indexed by any search engine spiders. |
| (Robots.txt) | | | | Example 3 ScenarioIf you wanted to make the .txt file |
| This is a simple text file that can be created in | | | | applicable to the Google spider, googlebot and stop it |
| Notepad. Once created you must upload the file into | | | | from indexing the faq, cgi-bin, images directories and a |
| the root directory of your website e.g. Before a | | | | specific html page called faqs.html contained within the |
| search engine spider indexes your website they look | | | | root directory, the robots.txt file would look like this: |
| for this file which tells them exactly how to index your | | | | User-agent: googlebot |
| site's content. The use of the robots.txt file is most | | | | Disallow: /faq/ |
| suited to static html sites or for excluding certain files in | | | | Disallow: /cgi-bin/ |
| dynamic sites. If the majority of your site is dynamically | | | | Disallow: /images/ |
| created then consider using the Robots Tag. | | | | Disallow: /faqs.html |
| Creating your robots.txt file | | | | Explanation |
| Example 1 ScenarioIf you wanted to make the .txt file | | | | By naming the particular search spider in the |
| applicable to all search engine spiders and make the | | | | "User-agent" you prevent it from indexing the content |
| entire site available for indexing. The robots.txt file | | | | you specify. Preventing access to the directories is |
| would look like this: | | | | achieved by simply naming them, and the specific page |
| User-agent: * | | | | is referenced directly. The named files & |
| Disallow: | | | | directories will not be indexed by Google. |
| Explanation | | | | That's all there is to it!As mentioned earlier the |
| The use of the asterisk with the "User-agent" means | | | | robots.txt file can be difficult to implement in the case |
| this robots.txt file applies to all search engine spiders. | | | | of dynamic sites and in this case it's probably |
| By leaving the "Disallow" blank all parts of the site are | | | | necessary to use a combination of the robots.txt and |
| suitable for indexing. | | | | the robots tag. |
| Example 2 ScenarioIf you wanted to make the .txt file | | | | |