Test bed with stable results

Well, I suspect it's total number of battles that matters most. A recent version of Diamond dropped 0.12 APS ''after'' 2000 battles in the rumble, but perhaps that was a fluke due to someone's client skipping turns. My APS test beds are 100 bots and I run 5 seasons (500 battles) to get an idea, 10 seasons I consider accurate enough even for pretty minor tweaks, or 20 seasons if I want to feel very confident. I'm not sure what the statistical accuracy is, but +/- 0.05 for 20 seasons (2000 battles) would be about my guess. That's pretty much my experience with the accuracy of RoboRumble results after 2k battles, too.

I've been thinking about adding to RoboResearch the feature to calculate the confidence interval of the overall result, assuming normal distributions. It should be pretty easy.