A Beginners Guide to Robots.txt

Search engines use robots to crawl or spider webthat all pages are allowed to spider. Here one care is
pages on the web, these robots or crawlers areto be taken in the disallow field that each file to be
nothing else but special programs written for readingdisallowed should be declared on a new line. In other
web page information including text, links, graphics,words multiple files should not be written against single
headings etc. These crawlers or robots tend to followdisallow directive. For example for multiple files to be
a special specification file known as the robots.txt file.disallowed we will define robots.txt as :
For example if a search robot visits a site then it firstUser-agent: Googlebot
looks for the robots text file at If found then the robotDisallow: information.html
follows the instructions in that file is having about howDisallow: private.html
to index that site which pages to read and which notDisallow: shipping.html
to read. This robots.txt file guides the search robotUser-agent: Architext
which part of a website to index and which not toDisallow: /
index. The robots specification was developed in 1993In this example Googlebot is disallowed three pages to
came to be known as the 'The Robots Exclusioncrawl and Architext, the spider of Excite, is disallowed
Standard' and still remains the standard for directingall the pages of the site. Similarly all spiders can be
robots with almost all search engines following it. Youinstructed if you know their names otherwise use ' * '.
can learn to define and place a robots file further in thisHowever if the file that is to be protected is residing in
article.a folder other than root folder( / ) then complete path
Basically robots.txt as the file extension implies is just aof the file can be specified. Now the question arises
simple text file without any scripting or programmingthat where should robots.txt be placed on a website.
code in it. It can be created using a simple text editorThe answer is root directory( / ) where the index file is
like notepad and consists of simple text directives.placed. Remember that there should always be just
Complex word processors should never be usedone Robots.txt file on a website. Website
because their formatting can create problems and leadaddresses(URL's) are case-sensitive, and "robots.txt"
to removal of the site. Almost every website hasstring must be all in lower-case and exactly same in
certain privileged pages containing sensitive andname. Blank lines are not permitted within a single
confidential information that is not intended for generalrecord in the "robots.txt" file and there must be exactly
users those pages can be disallowed for reading byone "User-agent" field per record. If robots file is placed
search engines with robots file. Robots.txt file can bein wrong folder then it looses its functionality and
customized to allow only specific search robots tospiders ignore it making it useless.
spider the site, and to disallow reading specificAdvantages of having a Robots.txt
directories or files. Let us create a simple robots.txt fileIt helps to hide and protect sensitive and confidential
here. Open a simple text editor i.e. notepad write theinformation by disallowing spiders to index them.
following lines and save as robots:It helps in search engine specific optimization of a
#this is a typical example of robots filewebsite (making web pages for particular search
#comments are placed after hash.engines).
User-agent: *This file should be very carefully written according to
Disallow: /cgi-bin/the format specified before uploading to a website
This is a typical example of robots.txt file thebecause a simple mistake can result in index removal
User-agent line directive specifies the name of theof a complete website from search engines. Don't
robot or spider that is visiting the website for exampleindulge in the activity of making too many copies of
"User-agent: googlebot" specifies Googles robot andweb pages to be optimized for every search engine
the instructions following down will be for that robot. Apresent instead be reasonable with the number and
" User-agent: * " value means all robots on the web.keep the target of the major five or seven engines. So
Further comes the "Disallow" directive. The disallownow you know What is a robots.txt file? How to
directive line specifies the file name or folder name thatdefine it? How to use it? and Where to place it?
is to be disallowed to read by that specific robot.Enjoy!
Disallow field can be left blank also which will specify