calculating confidence of an APS score
Hey resident brainiacs - I'm displaying confidence using standard error calculations on a per bot basis in RoboRunner now. What I'm not sure of is how to calculate the confidence of the average of the overall score.
If I had the same number of battles for each bot, then the average of all battles would equal the average of all per bot scores. So I think then I could just calculate the overall average and standard error, ignoring per bot averages, and get the confidence interval of overall score that way. But what I want is the average of the individual bot scores, each of which has a different number of battles.
Something like (average standard error / sqrt(num bots)) makes intuitive sense, but I have no idea if it's right. Or maybe sqrt(average(variance relative to per bot average)) / sqrt(num battles)?
This would also allow me to measure the benefits of the smart battle selection.
I don't actually think this can be correctly modelled by a unimodal distribution - you will be adding thin gaussians to fat gaussians, making horrible bumps which don't like to be approximated by a single gaussian mean+stdev. I almost wonder if some sort of Monte-Carlo solution wouldn't be most accurate in this instance - at least the math would be easy to understand.
Good call! That was super easy. I don't recall this Monte-Carlo stuff, but the name rings a bell so maybe I learned about it at some point.
So I calculate 100 random versions of the overall score. For each battle that goes into it, instead of the real score, I generate a random score, assuming a normal distribution using the mean and standard deviation I have for that bot. Then I take the standard deviation of those randomized overall scores and multiply by 1.96 for the confidence interval. Seems like a lot of calculations, but only taking a few hundredths of a second even with 250 bots/3000 battles, so I can afford to do it even when I print the overall score after every battle. Nice!
I'm curious - did you use the Monte-Carlo method for calculating the non-smart-battles deviations?
Also, how long did it take to get the 3000 battles compared to the non-smart-battles?
I'm using the same Monte-Carlo method for confidence either way. I hadn't run too many side by sides yet, but I'll do some more soon. Over night, I ran a test of 25 seasons of TCRM in regular vs smart battles mode on my laptop. They took about the same amount of time, and both ended up showing +- 0.363. But the smart battles came out to 89.32, very close to the 89.31 I got when I ran 100 (non-smart) seasons before, while the normal battles ended at 88.76.
So I'm a little disappointed it wasn't faster nor showed a better confidence, but it was a lot closer to the true average. And I guess my confidence calculation sucks or something weird happened, since 88.76 is much farther than .363 from the true average. (And yes, my TCRM score has tanked that much since its glory days!)
Are you sure that you're first averaging all the scores into each bot before averaging the scores together for the section? It wouldn't make a difference in the old method, since they all had the same number of battles, but it would affect things in the new one.
I guess the other possibility is that Diamond is so much slower than the bots it is facing that it doesn't make much difference which one you face. What was the spread of battles like on the TCRM? Were they spread fairly evenly, or were certain battles highly prioritised?
Yeah, that's a good point, especially with the TC bots that are just simple random movements and no gun. If the variation in confidence is higher than the variation in speed, it could take longer for same number of battles. I guess the puzzling thing is the overall confidence calculation showing the same both ways. With a limited amount of sample data, I guess it can only be so accurate, but I'm thinking I may have a bug there. The spread was:
apv.AspidMovement 1.0: 95.6 +- 0.83 (16 battles) dummy.micro.Sparrow 2.5TC: 98.43 +- 0.64 (13 battles) kawigi.mini.Fhqwhgads 1.1TC: 96.95 +- 1.11 (21 battles) emp.Yngwie 1.0: 98.15 +- 0.77 (14 battles) kawigi.sbf.FloodMini 1.4TC: 94.91 +- 1.25 (24 battles) abc.Tron 2.01: 88.15 +- 1.42 (26 battles) wiki.etc.HTTC 1.0: 88.83 +- 1.45 (28 battles) wiki.etc.RandomMovementBot 1.0: 92.23 +- 1.04 (22 battles) davidalves.micro.DuelistMicro 2.0TC: 86.22 +- 1.61 (31 battles) gh.GrubbmGrb 1.2.4TC: 81.29 +- 1.87 (33 battles) pe.SandboxDT 1.91: 85.48 +- 1.8 (31 battles) cx.mini.Cigaret 1.31TC: 86.82 +- 1.62 (31 battles) kc.Fortune 1.0: 80.6 +- 1.77 (29 battles) simonton.micro.WeeklongObsession 1.5TC: 87.02 +- 1.48 (26 battles) jam.micro.RaikoMicro 1.44TC: 79.16 +- 1.8 (30 battles)
Going to leave some tests with Diamond 1.8.16 in real battles running today and see how that compares.
Those +-, are they the standard error or the stddev?
The only thing I can think of testing is whether you are calculating the right number of random battles for each in the Monte-Carlo method. If you were only doing one battle for each, then the numbers you are getting would be the same for the standard as for the smart battles. It looks like the prioritisation is working well though - Sparrow and Yngwie both have low number of battles as well as low error/stddev.
The per bot +- is the 95% (or 97.5%?) confidence = 1.96 * standard error = 1.96 * standard deviation / sqrt(num battles).
It probably is something silly like the one battle per bot you mentioned, but at a glance it seems like the overall confidence calculation isn't doing anything stupid. I'll have a longer look this evening. I do think the smart battles are working well, though, I'd just like to have some numbers to back me up. =)
The spread is a bit more interesting in real battles. HOT bots with 99.9% scores will get 2-3 battles in 12 seasons. RamBots get lots of battles because they have fairly high variance and run super fast.
Some results with normal battles. Diamond 1.8.16 vs 50 random bots for 10 seasons.
- Dumb battles: took 6338.8s, 89.87 +- 0.188
- Smart battles: took 6010.6s, 89.94 +- 0.148
Looks like it hit ~0.18 by 5 seasons with smart battles. Right now I'm using a much rougher calculation for printing overall confidence between battles, for speed. I will be improving this with some caching of the random samples for the overall scores. I do a much more thorough calculation for the final score.
It's a slightly different calculation with the scoring groups, so maybe I only have a bug there. Or maybe there just wasn't much difference in the TCRM. Or maybe TC scores are so far from normally distributed that it throws it off. Or maybe it was just a fluke - the same confidence down to 3 digits seems pretty unlikely even with the same battle selection.
Well, the verdict is in. Looks like a combination of fluke and the TCRM battles just not being particularly optimizable. I ran another 25 seasons each way and got:
- Dumb battles: Took 2690.4s, 89.13 +- 0.362
- Smart battles: Took 2858.8s, 89.4 +- 0.338
So this time smart battles actually took longer, but had a better confidence and were again much closer to the true average. I also tested that the groups and non-groups versions of overall confidence were giving the same for TCRM (because groups are of equal size). I'm going to skip any fancy attempts to optimize for a more accurate overall confidence between battles, round the final confidence to 2 digits instead of 3, and get this posted.