Back to Search Engine Optimization Techniques home

Robots.txt and the Robots Meta tag

There are two methods of controlling search engine spiders. One is the robots.txt file, which is used to control where robots can go on a website.

NOTE : Robots and spiders are in essence the same thing. Robots and spiders are used as automated programs that need no human assistance to complete a task (where robots came from), and also spiders which were designed to "crawl" the "web". Try not to get these two confused.

The robots.txt file is a plain text file that you would store on the main directory of your website domain. For example if yahoo were to have a robots.txt file on their site it would be located at:

http://www.yahoo.com/robots.txt

Robots Meta Tag

Another method of stopping robots from going to certain places on your website is to use the robots meta tag which is usually in the format:

<meta name="ROBOTS" content="INDEX, NOFOLLOW">

You can change the way robots look at your website and follow your links. In the above tag, you can see that in the content section of the tag, INDEX, NOFOLLOW is there. This means that the spider can "INDEX" the page it is on, but it will "NOFOLLOW" your links. By default, all search engine spiders will use "INDEX, FOLLOW" even if there isn't a tag there. So usually, a search engine spider will try and index your page and follow the links on your page. If you want the spider to not index anything on your page, you can then use the following tag:

<meta name="ROBOTS" content="NOINDEX, NOFOLLOW">

In theory, this will stop robots from indexing your page (some search engine spiders have been known to use browser user agents to index pages), and also stop it from following the links on your page if any.

These methods are usually used if you don't want people off of the internet finding things on your page. Usually for security purposes. For example you wouldn't want the general public to be able to see the directory for all your customer information from a "search engine". This could be bad for your company, and maybe worse for your customers.

NOTE : You usually wouldn't use just the robots.txt file to stop a spider from entering a database or source of important information. It would be a good idea to password protect any confidential information that you can access from your website.

Meta revisit tag

There is also another tag to do with robots and page crawling. This is called the meta revisit-after tag. It is used by the search engines to identify when the spider should come back to your site (ideal for webmasters that constantly change the content of their page). The meta revisit-after tag is usually in the form:

<meta name="revisit-after" content="X days">

Place a number in place of the "X" in the above tag, to identify the amount of days to pass before the search engine spider recrawls your site.

This tag sits in your <head> tags and used for the search engines to determine when to "revisit" your site. Again, this is more for the webmaster that is constantly updating their page. The search engine spiders will generally revisit your page regardless of this tag "if" your site is already in the database. Ie. if your site is listed in google it will be recrawled or "revisited" around about every 60 days or 2 months.

The Robots.txt File

You usually wont see the robots.txt file on websites as the only person that's really interested in seeing it is the search engine spiders.

The robots.txt file is a bit more complex than the robots meta tag. The robots.txt file needs to be in a specific format for it to work efficiently and correctly. This format is as follows:

# go away
User-agent: *
Disallow: /

This above section if in a robots.txt file will not allow robots to go any further than the domain name. So in other words, if this were at http://www.yahoo.com/, the robots could not index that page or go any further within the page (can not follow links). The first line:

# go away

Is a comment. Anything after a # is a comment in the robots.txt file. This is used to help identify what parts of the file do. So in the following example:

# robots.txt for an example page

User-agent: *
Disallow: /cyberworld/map/ # This is a map of our office space
Disallow: /tmp/ # These are temporary customer entries for our database
Disallow: /members/foo.html # This is part of our members section

The sections in green above are the comments of the robots.txt file. These parts are not looked at by the robots. These are only used as labels to help people understand parts of the robots.txt file. Now lets look at the parts within the robots.txt file that the spiders actually look at:

User-agent: * <--- This part specifies the user agent

User-agent - this is the part of a client user that identifies what software they are using to get to the site. With internet explorer browsers the agent string is as follows:

Mozilla/4.0 (compatible; MSIE 5.5; Windows 98)

You can be as specific in this section as you want. You can specify a specific spider agent here to stop just one spider from a specific engine from seeing that page. So, for example, you can stop the google spider from going to a page optimized for altavista. By leaving a star or asterisk (*) this will stop all user agents from being able to follow through to that section (this is strictly for "spiders" only. Browsers are human operated usually, so the links can simply be clicked on. Search engine spiders have to abide by a set of rules and standards).

Disallow: /cyberworld/map/ <--- This part specifies the subdirectory

The above section will stop the spider from going to the /cyberworld/map/ section of the website. So if your website was http://www.yahoo.com/, then the above section of the robots.txt file will stop robots from visting http://www.yahoo.com/cyberworld/map/ on the website. Again, you can be as specific or non specific as you want to be here. You can make it so the robots don't go past your domain name which would then look like:

Disallow: /

This would then stop spiders from indexing the main page at your domain. So for example spiders will come to http://www.yahoo.com and stop there. They have been told by the robots.txt file to not go any further.

If you need to check the syntax of your robots.txt file, check out this link:

robots.txt syntax checker

That's about all I have to say about this topic right now, so if you have any comments and/or suggestions feel free to

NOTE : References to yahoo are fictitious. These addresses are mearly used as an example. Any actual relevance to this information is purely circumstancial.

© Dion Foster 2002