| Learn how to control the Search Engine | | | | .txt file applicable to all search engine |
| Spiders and prevent them from dismissing | | | | spiders and to stop the spiders from indexing |
| parts of your site as duplicate or irrelevant | | | | the faq, cgi-bin the images directories and a |
| content. Duplicate | | | | specific page called faqs.html contained |
| content is one of the problems that we | | | | within the root directory, the robots.txt |
| regularly come across as part of the search | | | | file would look like this: |
| engine optimization services we offer. If the | | | | |
| search engines determine your site contains | | | | User-agent: * |
| similar content, this may result in penalties | | | | |
| and even exclusion from the search engines. | | | | Disallow: /faq/ |
| Fortunately it's a problem that is easily | | | | |
| rectified. | | | | Disallow: /cgi-bin/Disallow: /images/ |
| | | | |
| Your primary weapon of choice against | | | | Disallow: /faqs.html |
| duplicate content can be found within "The | | | | |
| Robot Exclusion Protocol" which has now been | | | | Explanation |
| adopted by all the major search engines. | | | | |
| There are two ways to control how the search | | | | The use of the asterisk with the "User-agent" |
| engine spiders index your site. | | | | means this robots.txt file applies to all |
| | | | search engine spiders. Preventing access to |
| 1. The Robot Exclusion File or "robots.txt" | | | | the directories is achieved by naming them, |
| and | | | | and the specific page is referenced directly. |
| | | | The named files & directories will now |
| 2. The Robots TagThe Robots Exclusion File | | | | not be indexed by any search engine spiders. |
| (Robots.txt) | | | | |
| | | | Example 3 ScenarioIf you wanted to make the |
| This is a simple text file that can be | | | | .txt file applicable to the Google spider, |
| created in Notepad. Once created you must | | | | googlebot and stop it from indexing the faq, |
| upload the file into the root directory of | | | | cgi-bin, images directories and a specific |
| your website e.g. Before a search engine | | | | html page called faqs.html contained within |
| spider indexes your website they look for | | | | the root directory, the robots.txt file would |
| this file which tells them exactly how to | | | | look like this: |
| index your site's content. The use of the | | | | |
| robots.txt file is most suited to static html | | | | User-agent: googlebot |
| sites or for excluding certain files in | | | | |
| dynamic sites. If the majority of your site | | | | Disallow: /faq/ |
| is dynamically created then consider using | | | | |
| the Robots Tag. | | | | Disallow: /cgi-bin/ |
| | | | |
| Creating your robots.txt file | | | | Disallow: /images/ |
| | | | |
| Example 1 ScenarioIf you wanted to make the | | | | Disallow: /faqs.html |
| .txt file applicable to all search engine | | | | |
| spiders and make the entire site available | | | | Explanation |
| for indexing. The robots.txt file would look | | | | |
| like this: | | | | By naming the particular search spider in the |
| | | | "User-agent" you prevent it from indexing the |
| User-agent: * | | | | content you specify. Preventing access to the |
| | | | directories is achieved by simply naming |
| Disallow: | | | | them, and the specific page is referenced |
| | | | directly. The named files & directories |
| Explanation | | | | will not be indexed by Google. |
| | | | |
| The use of the asterisk with the "User-agent" | | | | That's all there is to it!As mentioned |
| means this robots.txt file applies to all | | | | earlier the robots.txt file can be difficult |
| search engine spiders. By leaving the | | | | to implement in the case of dynamic sites and |
| "Disallow" blank all parts of the site are | | | | in this case it's probably necessary to use a |
| suitable for indexing. | | | | combination of the robots.txt and the robots |
| | | | tag. |
| Example 2 ScenarioIf you wanted to make the | | | | |