| Search engines use robots to crawl or spider web | | | | that all pages are allowed to spider. Here one care is |
| pages on the web, these robots or crawlers are | | | | to be taken in the disallow field that each file to be |
| nothing else but special programs written for reading | | | | disallowed should be declared on a new line. In other |
| web page information including text, links, graphics, | | | | words multiple files should not be written against single |
| headings etc. These crawlers or robots tend to follow | | | | disallow directive. For example for multiple files to be |
| a special specification file known as the robots.txt file. | | | | disallowed we will define robots.txt as : |
| For example if a search robot visits a site then it first | | | | User-agent: Googlebot |
| looks for the robots text file at If found then the robot | | | | Disallow: information.html |
| follows the instructions in that file is having about how | | | | Disallow: private.html |
| to index that site which pages to read and which not | | | | Disallow: shipping.html |
| to read. This robots.txt file guides the search robot | | | | User-agent: Architext |
| which part of a website to index and which not to | | | | Disallow: / |
| index. The robots specification was developed in 1993 | | | | In this example Googlebot is disallowed three pages to |
| came to be known as the 'The Robots Exclusion | | | | crawl and Architext, the spider of Excite, is disallowed |
| Standard' and still remains the standard for directing | | | | all the pages of the site. Similarly all spiders can be |
| robots with almost all search engines following it. You | | | | instructed if you know their names otherwise use ' * '. |
| can learn to define and place a robots file further in this | | | | However if the file that is to be protected is residing in |
| article. | | | | a folder other than root folder( / ) then complete path |
| Basically robots.txt as the file extension implies is just a | | | | of the file can be specified. Now the question arises |
| simple text file without any scripting or programming | | | | that where should robots.txt be placed on a website. |
| code in it. It can be created using a simple text editor | | | | The answer is root directory( / ) where the index file is |
| like notepad and consists of simple text directives. | | | | placed. Remember that there should always be just |
| Complex word processors should never be used | | | | one Robots.txt file on a website. Website |
| because their formatting can create problems and lead | | | | addresses(URL's) are case-sensitive, and "robots.txt" |
| to removal of the site. Almost every website has | | | | string must be all in lower-case and exactly same in |
| certain privileged pages containing sensitive and | | | | name. Blank lines are not permitted within a single |
| confidential information that is not intended for general | | | | record in the "robots.txt" file and there must be exactly |
| users those pages can be disallowed for reading by | | | | one "User-agent" field per record. If robots file is placed |
| search engines with robots file. Robots.txt file can be | | | | in wrong folder then it looses its functionality and |
| customized to allow only specific search robots to | | | | spiders ignore it making it useless. |
| spider the site, and to disallow reading specific | | | | Advantages of having a Robots.txt |
| directories or files. Let us create a simple robots.txt file | | | | It helps to hide and protect sensitive and confidential |
| here. Open a simple text editor i.e. notepad write the | | | | information by disallowing spiders to index them. |
| following lines and save as robots: | | | | It helps in search engine specific optimization of a |
| #this is a typical example of robots file | | | | website (making web pages for particular search |
| #comments are placed after hash. | | | | engines). |
| User-agent: * | | | | This file should be very carefully written according to |
| Disallow: /cgi-bin/ | | | | the format specified before uploading to a website |
| This is a typical example of robots.txt file the | | | | because a simple mistake can result in index removal |
| User-agent line directive specifies the name of the | | | | of a complete website from search engines. Don't |
| robot or spider that is visiting the website for example | | | | indulge in the activity of making too many copies of |
| "User-agent: googlebot" specifies Googles robot and | | | | web pages to be optimized for every search engine |
| the instructions following down will be for that robot. A | | | | present instead be reasonable with the number and |
| " User-agent: * " value means all robots on the web. | | | | keep the target of the major five or seven engines. So |
| Further comes the "Disallow" directive. The disallow | | | | now you know What is a robots.txt file? How to |
| directive line specifies the file name or folder name that | | | | define it? How to use it? and Where to place it? |
| is to be disallowed to read by that specific robot. | | | | Enjoy! |
| Disallow field can be left blank also which will specify | | | | |