How robots work


How to prevent duplicate content with effective use of the robots.txt and robots tag.

Learn how to control the Search Engine.txt file applicable to all search engine
Spiders and prevent them from dismissingspiders and to stop the spiders from indexing
parts of your site as duplicate or irrelevantthe faq, cgi-bin the images directories and a
content. Duplicatespecific page called faqs.html contained
content is one of the problems that wewithin the root directory, the robots.txt
regularly come across as part of the searchfile  would  look  like  this:
engine optimization services we offer. If the
search engines determine your site containsUser-agent:  *
similar content, this may result in penalties
and even exclusion from the search engines.Disallow:  /faq/
Fortunately it's a problem that is easily
rectified.Disallow:  /cgi-bin/Disallow:  /images/
Your primary weapon of choice againstDisallow:  /faqs.html
duplicate content can be found within "The
Robot Exclusion Protocol" which has now beenExplanation
adopted by all the major search engines.
There are two ways to control how the searchThe use of the asterisk with the "User-agent"
engine  spiders  index  your  site.means this robots.txt file applies to all
search engine spiders. Preventing access to
1. The Robot Exclusion File or "robots.txt"the directories is achieved by naming them,
andand the specific page is referenced directly.
The named files & directories will now
2. The Robots TagThe Robots Exclusion Filenot  be indexed by any search engine spiders.
(Robots.txt)
Example 3 ScenarioIf you wanted to make the
This is a simple text file that can be.txt file applicable to the Google spider,
created in Notepad. Once created you mustgooglebot and stop it from indexing the faq,
upload the file into the root directory ofcgi-bin, images directories and a specific
your website e.g. Before a search enginehtml page called faqs.html contained within
spider indexes your website they look forthe root directory, the robots.txt file would
this file which tells them exactly how tolook  like  this:
index your site's content. The use of the
robots.txt file is most suited to static htmlUser-agent:  googlebot
sites or for excluding certain files in
dynamic sites. If the majority of your siteDisallow:  /faq/
is dynamically created then consider using
the  Robots Tag.Disallow:  /cgi-bin/
Creating  your  robots.txt  fileDisallow:  /images/
Example 1 ScenarioIf you wanted to make theDisallow:  /faqs.html
.txt file applicable to all search engine
spiders and make the entire site availableExplanation
for indexing. The robots.txt file would look
like  this:By naming the particular search spider in the
"User-agent" you prevent it from indexing the
User-agent:  *content you specify. Preventing access to the
directories is achieved by simply naming
Disallow:them, and the specific page is referenced
directly. The named files & directories
Explanationwill  not  be  indexed  by  Google.
The use of the asterisk with the "User-agent"That's all there is to it!As mentioned
means this robots.txt file applies to allearlier the robots.txt file can be difficult
search engine spiders. By leaving theto implement in the case of dynamic sites and
"Disallow" blank all parts of the site arein this case it's probably necessary to use a
suitable  for  indexing.combination of the robots.txt and the robots
tag.
Example 2 ScenarioIf you wanted to make the



1 A B C D 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105