robots.txt in the real world

August 22nd, 2003 by Hen

In the course of my playing with robots.txt, I spent a short while wgetting robots.txt files from popular sites.

The most amusing one is IBM’s, with the historic comment of:

# $l1- 19950130 epc finally understood what the file was for!

Sun’s was also amusing with a refute to an article of Bertrand Meyer’s.

Whitehouse.gov give a nice sitemap as such with theirs, and to show that it’s not just big companies, my workplace’s robots.txt is going to be a good test as it has a comment on a rule line. Something very few people seem to do.

Some people seem quite hopeful with their entries. For example Sun have:

Disallow: /*_print.html$

which I believe is not legal [wildcards are not a part of the robots.txt spec, except for a User-agent].

Regardless of other uses, a simple program to print out the rules a robots.txt is stating would be useful for a lot of companies. One must exist out there…

One Response to “robots.txt in the real world”

  1. joe peer Says:

    he he he, sun’s robot.txt file is really cute - - I found this part particularly funny:

    # The purpose of the “robots.txt” file is to keep these directories
    # from being indexed so that the average user doesn’t stumble across them
    # while performing searches, and those that should be accessing these
    # directories will do so through the URL that requires them to register.
    # Of course, having the contents of this file advertised in “comp.risks”
    # diminishes its purpose. Thanks Bertrand. ;-)

    :-))
    joe