Outlier resistant APS system
Bad uploads are becoming a recurrent issue.
But there is a way to shield the ranking from these uploads using median instead of mean when calculating pairing APS, as long as bad uploads don´t outnumber the others. The drawback would be a performance hit in CPU/database during uploads.
I think an outlier resistant APS system may be good, but I do have a concern that using the median may 1) distort scores when valid data would cause a distribution that has a skew, and 2) In the cases where there are no outliers for it to fix anyway, it would generate more noisy values (The median generally has larger fluctuations than the mean as samples are added).
I'd think it may be worth considering statistical methods to calculate the probability of a data point being an outlier, and ignoring it if it's beyond a threshold. It may be possible for such methods to not alter the means of skew caused by valid data, and will have smaller score fluctuations.
Another thought is, regardless of if we change the APS system or not, it may make sense for the rumble to have a page that lists recent outlier results, to make it easier to spot them.
One way to see skewed distributions is median taking it into account while mean assuming all distributions are symmetric. So it is not "distortion", but it may affect APS as we are used to.
But yes, mean needs less battles than median when the true average is near 50% (symmetric distributions) and there are no outliers.
There are other more sofisticated statistical methods for dealing with outliers, like percentile, which is somewhere between mean and median. But for me, median is good enough and is fully automated.
(I would never even imagine these things exist if it were not for Robocode and the quest for the ultimate statistical gun)
About the skewed distributions, fair enough. I still am concerned about the greater noise of medians though.
The more sophisticated method that was coming to my mind, was calculating the z-score of each sample per pairing, tossing out results that have too extreme of a z-score value, and using the mean of the remaining samples. The reason this appeals to me, is because it changes the existing scoring system as little as possible.
Most bad results we see are near-zero scores which should be quite distinctly detected by a z-score test, so reliably tossing them out without changing the overall scoring system would be quite doable I think.
Using z-score as threshold will need tuning to work properly, because it has unpredictable robustness. Remembering sampled standard deviations are also affected by outliers.
Choosing a very low percentile and a very high percentile as boundaries and averaging everything in between may have the effect you want without relying on noisy sampled deviations. Like 25%/75% (1st/3rd quartiles).
Taking that a step further, you could take the probability of a data point being wrong and look at how many of those are coming from each user, and over a certain threshold, all their uploads (or all within X hours of this happening) are ignored. (Or an email is fired off to Darkcanuck to look into it. =)) I like the idea of public page listing outlier results, too.
The median idea also seems good though, and very simple. I'd be curious to see if/how much the Rumble scores change using median instead of mean. My guess is by no noticeable amount.
The deviations from all data points for each uploader could be combined into a statistic about how "suspicious" each uploader is and then list it in a public page. I thought about that too.
But median is simple, with robust statistics theory backing it and fully automated. No need to bother Darkcanuck unless the server is attacked with tons of bad uploads.