How to prevent duplicate content with effective use of the robots.txt and robots tag.

Learn how to control the Search Engine Spiders andapplicable to all search engine spiders and to stop the
prevent them from dismissing parts of your site asspiders from indexing the faq, cgi-bin the images
duplicate or irrelevant content. Duplicate content is onedirectories and a specific page called faqs.html
of the problems that we regularly come across ascontained within the root directory, the robots.txt file
part of the search engine optimization services wewould look like this:
offer. If the search engines determine your siteUser-agent: *
contains similar content, this may result in penalties andDisallow: /faq/
even exclusion from the search engines. FortunatelyDisallow: /cgi-bin/Disallow: /images/
it's a problem that is easily rectified.Disallow: /faqs.html
Your primary weapon of choice against duplicateExplanation
content can be found within "The Robot ExclusionThe use of the asterisk with the "User-agent" means
Protocol" which has now been adopted by all thethis robots.txt file applies to all search engine spiders.
major search engines. There are two ways to controlPreventing access to the directories is achieved by
how the search engine spiders index your site.naming them, and the specific page is referenced
1. The Robot Exclusion File or "robots.txt" anddirectly. The named files & directories will now not
2. The Robots TagThe Robots Exclusion Filebe indexed by any search engine spiders.
(Robots.txt)Example 3 ScenarioIf you wanted to make the .txt file
This is a simple text file that can be created inapplicable to the Google spider, googlebot and stop it
Notepad. Once created you must upload the file intofrom indexing the faq, cgi-bin, images directories and a
the root directory of your website e.g. Before aspecific html page called faqs.html contained within the
search engine spider indexes your website they lookroot directory, the robots.txt file would look like this:
for this file which tells them exactly how to index yourUser-agent: googlebot
site's content. The use of the robots.txt file is mostDisallow: /faq/
suited to static html sites or for excluding certain files inDisallow: /cgi-bin/
dynamic sites. If the majority of your site is dynamicallyDisallow: /images/
created then consider using the Robots Tag.Disallow: /faqs.html
Creating your robots.txt fileExplanation
Example 1 ScenarioIf you wanted to make the .txt fileBy naming the particular search spider in the
applicable to all search engine spiders and make the"User-agent" you prevent it from indexing the content
entire site available for indexing. The robots.txt fileyou specify. Preventing access to the directories is
would look like this:achieved by simply naming them, and the specific page
User-agent: *is referenced directly. The named files &
Disallow:directories will not be indexed by Google.
ExplanationThat's all there is to it!As mentioned earlier the
The use of the asterisk with the "User-agent" meansrobots.txt file can be difficult to implement in the case
this robots.txt file applies to all search engine spiders.of dynamic sites and in this case it's probably
By leaving the "Disallow" blank all parts of the site arenecessary to use a combination of the robots.txt and
suitable for indexing.the robots tag.
Example 2 ScenarioIf you wanted to make the .txt file