While the wife is away, the husband will….
…implement a robots.txt handler in Java. Apparantly.
It’s pretty fun to do. Simple enough RFC that any moron [meaning me] can grokk it, with a clear set of examples and enough subtle surprise that it’s easy to screw up. It’s also a perfect task for a class I think.
At first I assumed that just having:
User-agent: *
Disallow:
meant disallow all user agents [as have my wife’s company I notice]. Instead it means, allow all user agents.
I also realised I’ve had one of my robots.txt files wrong all this time, not having leading slashes on the entries.
What’s coolest about implementing a robots.txt handler is that it’s perfectly designed to unit-test first code. I admit that my initial roll out of code was being ‘unit tested’ by a main method. However, once that single method was working, I upgraded to a unit test and added in every entry in the rfc. Lots of failures, lots of digging, and things slowly got developed/fixed.
I still think there are two potential problems with unit tests:
1) They force you to open up your API sometimes. If an API only allows ‘http://…’ urls, and you want to test locally without a web server, you have to have some way of getting a file:/// under the radar. Assuming you put tests in the same package as the code, it’s not a major security problem, but I do still feel hesitant about the fact that unit tests are forcing me to create units, and I’m not sure if a unit is always the best choice. I don’t like the test forcing me to be less defensive.
2) You can code without thinking. By creating a large amount of unit tests, then slowly fixing one at a time, I would be able to evolve the code and not understand it. I worry that some developers will try to solve bugs by random modification until the unit test works. The problem there is that there is a 50/50 chance that the unit test is wrong. In fact, I had this occur once when the data for my unit test turned out to be illegal.
Still, a fun evening of hackery. Anyone want a robots.txt handler? Was written for my site-scraping engine, Scabies.