Talk:Darkcanuck/RRServer/Ratings

Battles per Pairing

I just wanted to comment on the statement, "It's uncertain how well it works with less battles or incomplete pairings." My experiment with the MC2K7 shows that separate runs of 75 battles can still show more than 1% variation for a given pairing. This affects any scoring system, and is a fact that we have to live with. The reliability of output can only be as good as input, no matter how fancy the interpolation is for incomplete pairings. The hope is that the variance will become a wash when seen over 600+ pairings. --Simonton 15:25, 26 September 2008 (UTC)

[View source↑]
[History↑]

You cannot post new threads to this discussion page because it has been protected from new threads, or you do not currently have permission to edit.

Outlier resistant APS system

Bad uploads are becoming a recurrent issue.

But there is a way to shield the ranking from these uploads using median instead of mean when calculating pairing APS, as long as bad uploads don´t outnumber the others. The drawback would be a performance hit in CPU/database during uploads.

MN‎

I think an outlier resistant APS system may be good, but I do have a concern that using the median may 1) distort scores when valid data would cause a distribution that has a skew, and 2) In the cases where there are no outliers for it to fix anyway, it would generate more noisy values (The median generally has larger fluctuations than the mean as samples are added).

I'd think it may be worth considering statistical methods to calculate the probability of a data point being an outlier, and ignoring it if it's beyond a threshold. It may be possible for such methods to not alter the means of skew caused by valid data, and will have smaller score fluctuations.

Another thought is, regardless of if we change the APS system or not, it may make sense for the rumble to have a page that lists recent outlier results, to make it easier to spot them.

Rednaxela‎

One way to see skewed distributions is median taking it into account while mean assuming all distributions are symmetric. So it is not "distortion", but it may affect APS as we are used to.

But yes, mean needs less battles than median when the true average is near 50% (symmetric distributions) and there are no outliers.

There are other more sofisticated statistical methods for dealing with outliers, like percentile, which is somewhere between mean and median. But for me, median is good enough and is fully automated.

(I would never even imagine these things exist if it were not for Robocode and the quest for the ultimate statistical gun)

MN‎

About the skewed distributions, fair enough. I still am concerned about the greater noise of medians though.

The more sophisticated method that was coming to my mind, was calculating the z-score of each sample per pairing, tossing out results that have too extreme of a z-score value, and using the mean of the remaining samples. The reason this appeals to me, is because it changes the existing scoring system as little as possible.

Most bad results we see are near-zero scores which should be quite distinctly detected by a z-score test, so reliably tossing them out without changing the overall scoring system would be quite doable I think.

Rednaxela‎

Using z-score as threshold will need tuning to work properly, because it has unpredictable robustness. Remembering sampled standard deviations are also affected by outliers.

Choosing a very low percentile and a very high percentile as boundaries and averaging everything in between may have the effect you want without relying on noisy sampled deviations. Like 25%/75% (1st/3rd quartiles).

MN‎

Taking that a step further, you could take the probability of a data point being wrong and look at how many of those are coming from each user, and over a certain threshold, all their uploads (or all within X hours of this happening) are ignored. (Or an email is fired off to Darkcanuck to look into it. =)) I like the idea of public page listing outlier results, too.

The median idea also seems good though, and very simple. I'd be curious to see if/how much the Rumble scores change using median instead of mean. My guess is by no noticeable amount.

Voidious‎

The deviations from all data points for each uploader could be combined into a statistic about how "suspicious" each uploader is and then list it in a public page. I thought about that too.

But median is simple, with robust statistics theory backing it and fully automated. No need to bother Darkcanuck unless the server is attacked with tons of bad uploads.

MN‎

ISDA compliant tie-breaking for PL

I was thinking in another tie-breaking for the PL ranking system (Copeland). Currently it uses full APS as tie-breaking, but it sacrifices ISDA, which is a key property from Copeland to prevent low ranked competitors from interfering with the top ones (king-making).

Tie-breaking could use only APS between tied competitors instead, to preserve ISDA.

Example 1: DrussGT, Diamond and Tomcat are all tied in first place. Tie-breaking score for DrussGT would use only Diamond and Tomcat. Score for Diamond would use DrussGT and Tomcat. Score for Tomcat would use DrussGT and Diamond. As if those 3 were the only competitors in the rumble.

Example 2: DrussGT and Tomcat are tied in first place. Tie-breaking score for DrussGT would use only Tomcat. Score for Tomcat would use DrussGT. It would be a head-to-head comparison between the 2.

Competitors not tied to anyone else can have any tie-breaking score, as it would not change the ranking.

MN‎

I had forgotten to respond to this when I first read it. But I quite like this idea. It has for a long time seemed strange to me that PL used APS for tie breaking, considering the goals/philosophies of the systems are so different.

Rednaxela‎

Mad idea about rankings

Just mad idea about rankings. What if robot's rating is: <percents from 1st place's APS in 1-vs-1> + <percents from 1st place's APS in melee> * 0.6 + <percents from 1st place's APS in temas> * 0.3?

For example for DrussGT it will be 100 + 0 * 0.6 + 0 * 0.3 = 100, for Diamond 99.23 + 99.8 * 0.6 + 0 * 0.3 = 158,88, for Shadow 94.43 + 98.68 * 0.6 + 100 * 0.3 = 183,64

It allow develop melee and teams rumbles

Jdev‎

I think that's kind of looking at it backwards. I don't think the combined ranking would make people more interested in Melee/Teams, but I do think if more people were building cross-divisional bots, such a ranking would be more interesting. =)

I don't think it's the lack of rankings that make people uninterested in Melee/Teams, I think people just like 1v1. And in that case, people will just focus on 1v1 rankings, even if combined rankings are available. I think the best thing you can do to get people interested in Melee/Teams is to work on them yourself!

I also think about other rule sets occasionally. Sometimes I think about building a MegaBot Twin Duel team, even if it couldn't compete in the rumble.

Voidious‎

APS/Schulze demonstration

It´s been a while, but I finally implemented a Schulze ranking using everything that Schulze paper offers. It has tie-breaking and path reconstruction, together with some statistics and coloring to make easier to track what the method did. It uses pairing APS as ballot.

Tried using % wins in Schulze but it failed miserably as everyone from rank 8 to 817 were tied together, making the system deteriorate into a random ballot ranking. Shadow tied together with Worst with tie-breaking relying on a coin-flip was not what I had in mind. As far as I understood what went wrong, normalizing wins dividing them by the number of battles sacrificed resolvability.

The results can be downloaded here (data using Darkcanuck 2011-08-09 backup): [1] [2] [3] [4]

I also added other ranking systems, like APS/Elo (batch), Premier League and APS that everyone is used to. Also added a Simpson-Kramer ranking, due to Skilgannon request to see who is performing best against each bot. Simpson-Kramer is also good as statistics for Schulze.

I took Chase "for fun" advice seriously and tried to make something more appealing.

MN‎

Mixed Rating Systems

You know, since there's a little bit of disagreement about what kind of rating system best reflects the kind of improvements valued in robots, I wonder if it would make sense to have a mixed system for overall ranking? What about using a condorcet voting method to resolve the rankings, where each ranking system is a "ballot" that fills out the rank of all robots? One "ballot" could be APS, another "ballot" could be W%, and perhaps another "ballot" could be S%. The resulting rankings would essentially be based on the consensus of the three component ranking systems. I think it's natural for robots that excel in all three metrics to rank high, and is perhaps a more universally agreed and more general notion of strength than just one metric alone.

Thoughts?

Rednaxela‎

I like that approach myself. Though I think I would limit it to APS and W%.

Skotty‎

Nevermind, forget what I said. I didn't read it close enough. I was thinking that I would like something that averages APS and W%, with the weight of each not necessarily being equal.

Skotty‎

Averaged would be nice in that it would allow easier differentiation between small changes, but on the other hand, condorcet ranking has the property of not allowing one metric to dominate. I think there are advantages to each way of mixing.

Rednaxela‎

Both condorcet and averaging can allow one kind of metric to dominate through clones. Similar methods together will overweight other unique methods. Ballots need to be independent from the candidates (methods) to avoid the clone problem.

MN‎

Thread title	Replies	Last modified
Outlier resistant APS system	6	21:37, 16 February 2012
ISDA compliant tie-breaking for PL	1	20:50, 16 February 2012
Mad idea about rankings	1	21:00, 7 October 2011
APS/Schulze demonstration	0	06:39, 7 October 2011
Mixed Rating Systems	4	06:53, 7 October 2011

Talk:Darkcanuck/RRServer/Ratings

Battles per Pairing

Contents

Outlier resistant APS system

ISDA compliant tie-breaking for PL

Mad idea about rankings

APS/Schulze demonstration

Mixed Rating Systems

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools