robots.txt in the real world
August 22nd, 2003 by HenIn the course of my playing with robots.txt, I spent a short while wgetting robots.txt files from popular sites.
The most amusing one is IBM’s, with the historic comment of:
# $l1- 19950130 epc finally understood what the file was for!
Sun’s was also amusing with a refute to an article of Bertrand Meyer’s.
Whitehouse.gov give a nice sitemap as such with theirs, and to show that it’s not just big companies, my workplace’s robots.txt is going to be a good test as it has a comment on a rule line. Something very few people seem to do.
Some people seem quite hopeful with their entries. For example Sun have:
Disallow: /*_print.html$
which I believe is not legal [wildcards are not a part of the robots.txt spec, except for a User-agent].
Regardless of other uses, a simple program to print out the rules a robots.txt is stating would be useful for a lot of companies. One must exist out there…

August 22nd, 2003 at 5:54 am
he he he, sun’s robot.txt file is really cute - - I found this part particularly funny:
# The purpose of the “robots.txt” file is to keep these directories
# from being indexed so that the average user doesn’t stumble across them
# while performing searches, and those that should be accessing these
# directories will do so through the URL that requires them to register.
# Of course, having the contents of this file advertised in “comp.risks”
# diminishes its purpose. Thanks Bertrand.
:-))
joe