robots.txt
August 22nd, 2003 by HenWhile the wife is away, the husband will….
…implement a robots.txt handler in Java. Apparantly.
It’s pretty fun to do. Simple enough RFC that any moron [meaning me] can grokk it, with a clear set of examples and enough subtle surprise that it’s easy to screw up. It’s also a perfect task for a class I think.
At first I assumed that just having:
User-agent: *
Disallow:
meant disallow all user agents [as have my wife’s company I notice]. Instead it means, allow all user agents.
I also realised I’ve had one of my robots.txt files wrong all this time, not having leading slashes on the entries.
What’s coolest about implementing a robots.txt handler is that it’s perfectly designed to unit-test first code. I admit that my initial roll out of code was being ‘unit tested’ by a main method. However, once that single method was working, I upgraded to a unit test and added in every entry in the rfc. Lots of failures, lots of digging, and things slowly got developed/fixed.
I still think there are two potential problems with unit tests:
1) They force you to open up your API sometimes. If an API only allows ‘http://…’ urls, and you want to test locally without a web server, you have to have some way of getting a file:/// under the radar. Assuming you put tests in the same package as the code, it’s not a major security problem, but I do still feel hesitant about the fact that unit tests are forcing me to create units, and I’m not sure if a unit is always the best choice. I don’t like the test forcing me to be less defensive.
2) You can code without thinking. By creating a large amount of unit tests, then slowly fixing one at a time, I would be able to evolve the code and not understand it. I worry that some developers will try to solve bugs by random modification until the unit test works. The problem there is that there is a 50/50 chance that the unit test is wrong. In fact, I had this occur once when the data for my unit test turned out to be illegal.
Still, a fun evening of hackery. Anyone want a robots.txt handler? Was written for my site-scraping engine, Scabies.

August 21st, 2003 at 11:52 pm
Since starting down the path of unit testing, your #1 has been the biggest non-obvious win for me. I’ve grown to like having the fetching code seperated from the parsing with an InputStream or Reader as the parameter passed to the parser. That way its easy to test with a StringReader hard coded into the unit tests or with a FileReader pointed a test file.
August 22nd, 2003 at 10:25 pm
Ended up responding to this comment as a full entry as it was getting big:
http://weblogs.flamefew.net/bayard/archives/000728.html#000728