View source for Talk:Darkcanuck/RRServer/Ratings

From Robowiki
Jump to navigation Jump to search

Explanations Behind Ratings

This page does a nice job of explaining what some of the ratings are, but it still assumes certain existing knowledge. Somewhere, perhaps on this page, there needs to be descriptions for all rating terms. This could be on the wiki or even just a legend at the darkcanuck ratings site.

Some examples of what is missing -- No where...anywhere...can I find out what "PBI" stands for or what it's significance is. I don't see anywhere that explains "Specialization" either. Also "LRP". Skotty 00:04, 26 May 2011 (UTC)

I'm pretty sure that sort of information mostly existed on the old wiki before, but yeah. Well, "PBI" means "ProblemBot Index", and it is the difference between your percent score, and the "expected" percent score based on the ELO scores of the two. (A system based on how it compares based on near-ranking-bots rather than ELO would be more accurate IMO, but ELO isn't that bad of a predictor when pairings are complete I guess)

"LRP" stands for "Linear Regression Plot". Well, the term is awfully vague IMO, but it refers to oldwiki:RoboRumble/LRP. It is a plot of PBI vs ELO score. Essentially, the plot can give you a quick visual display that can highlight outliers (i.e. things your bot does exceptionally well/poorly against) and show some general trends such as "Does it fare more favorably against high ranking bots, or against low ranking bots?". --Rednaxela 11:59, 26 May 2011 (UTC)

Battles per Pairing

I just wanted to comment on the statement, "It's uncertain how well it works with less battles or incomplete pairings." My experiment with the MC2K7 shows that separate runs of 75 battles can still show more than 1% variation for a given pairing. This affects any scoring system, and is a fact that we have to live with. The reliability of output can only be as good as input, no matter how fancy the interpolation is for incomplete pairings. The hope is that the variance will become a wash when seen over 600+ pairings. --Simonton 15:25, 26 September 2008 (UTC)

I think David Alves commented that targeting challenge scores also varied by almost 1% at 15 seasons, so I agree there's lots of evidence that more battles per pairing are needed, which would take a very, very long time in a 600+ competitor environment. You're right that as the number of competitors increases, variabilities cancel each other out. But at the same time, the bigger the competition, the more risk of a "black swan" competitor whose scores are all skewed in one direction. -- Darkcanuck 15:31, 26 September 2008 (UTC)

After scratching some things down on paper which are mostly intuition rather than statistics, I believe the odds of having such a "black swan" are either exactly the same or reduced by increasing the number of bots. --Simonton 16:05, 26 September 2008 (UTC)

Well, if there are 3 bots, the chance of one getting lucky against both others is 1/4th, multiply by 3 bots, and the chance of a "black swan" in 3 bots is 75% I believe. With 4 bots, the chance of one getting lucky against against all others is 1/8th, multiply by 4 bots, and the chance of a black swan is 50%. For 5 bots... it is 31.25% chance of a black swan. For 650 bots with one pairing each, the chance of a bot having above average score in every pairing is about 1 to 2.78*10^193. So if we presume getting lucky is anything above the mean score and there's a 50% chance of that in any pairing, and that a "black swan" is only when all pairings are lucky, then the chance of a black swan sharply decreases as the number of bots becomes larger. Of course perhaps what would be more useful than simply chance of there being a bot with all pairings lucky, would be the chance of luck making the score 1% different. I could calculate this, but only if I had a number of what the "standard deviation" of the percent score of an average robocode battle is. --Rednaxela 16:28, 26 September 2008 (UTC)

My intuitive hypothesis remains unshaken, but I don't have any numbers to prove it. But I can't argue with something to the power of 193. :) I'll look into adding standard deviation to some of the tables. What would be most useful, within a pairing, across all pairings, or across all final scores? -- Darkcanuck 16:43, 26 September 2008 (UTC)

Ah, now that you put the statistics that way I can see how to do it. With 3 bots each has 2 pairings, so the chance of both coin flips being "lucky" is indeed 25%. However, the chance of at least 1 of those bots hitting its 25% is (1 - 75%^3) ~= 57.8%. Generalized, this formula is 1 - (1 - .5^(bots - 1))^bots. If you graph that you can see it reduces to pretty much zero pretty quickly. --Simonton 17:18, 26 September 2008 (UTC)

Oh right, I got slightly mixed up and was multiplying by 3 when I should have been working with powers. --Rednaxela 17:35, 26 September 2008 (UTC)
You've got an extra leading paren, but that makes sense to me. Nothing to worry about! -- Darkcanuck 18:05, 26 September 2008 (UTC)

Glicko-2 Rating System

Looking at things as my recently added versions have gained battles, it's seeming like Glicko-2 seems FAR faster to converge to a realistic expected score far quicker than ELO or Glicko-1, and seems quite stable. Glicko-2's performance seems to really impress me. I wonder if maybe we should remove ELO and Glicko-1 at time point maybe, and just keep APS and Glicko-2? (Would that make uploading a little faster?) Also, maybe it would be good to make a modified APS that uses the Glicko-2 ratings to estimate the scores of missing pairings, in order to make the APS ranking less distorted by cases when there are incomplete pairings still? --Rednaxela 21:25, 25 November 2008 (UTC)

Is it possible to modify the 'deviation' so that we have a similar 'spread' in the rankings as the ELO does? And a second to the using G-2 ratings for estimating the score for missing pairings in the APS rankings.--Skilgannon 21:41, 25 November 2008 (UTC)

I'm glad you guys are comparing the ranking systems. From what I can tell, Elo and Glicko-2 ratings seem to settle to the same ranking order as APS, although I've never noticed which converges faster. The Glicko-1 scores haven't worked out so well, so they're not really viable -- I may try replacing that column with a Glicko-2 rating which updates only using the result of the last pairing result. This would speed up uploads since it would eliminate the full pairing query. The three current methods all rely on the same data so just removing one or two won't make a noticeable difference. I suppose we could fill in the APS using Glicko-2 expected scores, that would be interesting. And I could probably scale the Glicko-2 ratings to match the current Elo scores, if that's what you meant, Skilgannon. --Darkcanuck 06:25, 26 November 2008 (UTC)
Yes, that's it exactly. --Skilgannon 06:43, 26 November 2008 (UTC)

Premier League rating calculation

Is the PL score being calculated using only the last battle for each pairing? If so, I would suggest averaging all battles in the same pairing, like the APS system does. For example, if after 4 battles you win 3 times, (2+2+2+0)/4 = 1.5. The PL ranking would be more stable and there would be less ties. --MN 20:58, 16 July 2011 (UTC)

I believe the PL league works like the following (internally), every win adds one to a running total, every loss subtracts one, every actual tie does nothing. Then taking this total, if above zero is considered a win. If below zero considered a loss, and at zero is considered a tie. It then assigns points based on this win/tie/loss, win = 2, tie = 1, loss = 0. — Chase-san 21:16, 16 July 2011 (UTC)
A bot's rankings data (APS, Glicko2, PL, etc) gets updated every time a new battle involving that bot is uploaded. The PL score is based on the APS for each pairing: you get 2 points for every pair where your APS is > 50%. It's a winner-take-all system, so there's no credit given for losing at 49.999%. I didn't even implement the 1-point for a tie, since it's so unlikely to ever happen (except for crashing bots paired together). --Darkcanuck 22:44, 16 July 2011 (UTC)
A lot better than what I was suggesting... --MN 03:53, 17 July (UTC)

Also, can we have a ranking order for the PL league in the rankings page? --MN 20:58, 16 July 2011 (UTC)

In this case I would ask the actual rank column does not get sorted, so whatever it is we sort by we see by its rank. So we could see the ranks for the top worst survival scores as rank 1 to 800+. Just not sorting this column would be far simpler then asking for a rank for every sortable column. The sortable nature of the rank column I think would be used less often. — Chase-san 21:20, 16 July 2011 (UTC)
I could probably link to it (I think the sorting code already exists) but I think most of us just re-sort using the PL column. It gives an interesting view of APS rank vs PL score. --Darkcanuck 22:44, 16 July 2011 (UTC)
Re-sorting using the PL column works, but if the bot is not near the top you have to manually count the rows. Chase suggestion would be perfect. --MN 03:53, 17 July (UTC)
Please no special link for PL. I just got into the top-10, nobody needs to know that I get beaten by that much. ;-) --GrubbmGait 23:16, 16 July 2011 (UTC)

Offline_batch_ELO_rating_system

I added another page with a modified ELO system I made: Offline_batch_ELO_rating_system --MN 12:46, 12 August 2011 (UTC)

Premier League

Perhaps scores within .5 or 1 to 50 should be treated as a tie (since these are hard to hammer out taking many many seasons, and might still be wrong). So 49.5 to 50.5 OR 49.0 to 51.0 splits could be counted as 1 instead of one getting 2 and the other getting 0. This might motivate people to work to break the tie. After all unlike something with a low integer scoring (like soccer/football) ties are harder to come by. If no one agrees, I propose dividing all PL scores by 2 instead. :) — Chase-san 20:45, 26 August 2011 (UTC)

This tie range discussion is an APS-only issue. Elo naturally takes close matches in account by measuring how often each side wins. Also, voting systems, like Condorcet, clearly define 1 vote as tie range. --MN 23:31, 26 August 2011 (UTC)
Dividing PL scores by 2 makes it look a lot like round-robin chess scoring which I also prefer. :) --MN 23:31, 26 August 2011 (UTC)
Seems a bit artificial/arbitrary, you would still get pairings teetering on the same thresholds. I'm actually curious what PL would look like if instead of 1 point for APS > 50, each bot got points based on the fraction of battles won. So a close matchup would score ~0.5 points while a decisive series of battles would yield up to 1 point for the winner. (and then multiply the sum by 2) --Darkcanuck 21:43, 26 August 2011 (UTC)
That actually makes more sense, PL based off survival. — Chase-san 22:14, 26 August 2011 (UTC)
Averaged Winning Rate ranking system? One of the simplest ranking systems, at the same time one of the most fair/accurate, and still very king maker resistant. Only flaw is it doesn´t have a tie-breaker system on its pure form. But other tie-breaker systems like Neustadtl can be easily adapted for this "AWR" system. Also doesn´t handle incomplete pairings very well, but APS suffers from the same problem and is still the main ranking system until now. --MN 23:31, 26 August 2011 (UTC)
I added a column for AWR (W%), seems like a simple and fair ranking method. --Darkcanuck 20:11, 27 August 2011 (UTC)

I noticed a huge difference between PL and APS/Schulze in the Alternative Rankings page. I thought the difference between them was in the tie-breaker criteria only. Are they using the same battle data? --MN 12:16, 27 August 2011 (UTC)

Sorry for my ignorance in Condorcet methods. Schulze also differs in the circular ambiguity resolution criteria. --MN 13:48, 27 August 2011 (UTC)

Schulze & Tideman Condorcet

I like both of these methods. Looking at the scoring, they have some very interesting results. I do somewhat like Tideman's more because of results. — Chase-san 21:05, 26 August 2011 (UTC)

Well, I like Tideman because it catapults Pris to #3, but the result doesn't make much sense to me... --Darkcanuck 21:13, 26 August 2011 (UTC)

I read up on Schulze, sounds like a very cool ranking method. Still gotta investigate Tideman. Perhaps the most interesting result to me is Hydra and Phoenix dropping into the 60s or 80s - holy cow! But now I remember the likely reason: they lose to BulletCatcher. --Voidious 15:48, 27 August 2011 (UTC)

And don't forget the BasicGFSurfer bulletpower bug. I missed almost all discussion during my vacation, and I am reading up on it. Although I am much better ranked in APS, I do see the value of PL and similar rankings. I have mixed feelings though. If 3 BulletCatchers and a few BasicGFSurfer derivates have such an influence as stated above by Voidious, I don't know if I can trust such ranking. To me it seems that a bunch of anti-xxx routines thrown together is better rewarded than a sole generalist implementation. I have to read more and think more to make up my mind definately, although my focus probably won't change. --GrubbmGait 10:26, 28 August 2011 (UTC)

That is kind of like splitting hairs if you ask me. Tuning for any specific situation for point gain is how the game is played, the only variable is how prevalent that situation is. If every robot was bad against 0.05 bullet power then you would have to be insane to 'not' tune your robot to fire for that. The thing is that, if every robot tunes for it, then everyone has the benefit from it. It is like saying it isn't really fair to tune your bot in a more specific way, or to counter a specific set of enemy robots and get overly rewarded for that.
Though I say all this however. I believe you gain more score for tuning towards and improving your general ability then for a specific situation. Tuning for those can only improve your score, not make it. Tuning for enemies that are bad at 0.05 bullet power doesn't matter if you don't fire well enough to hit them in the first place. Nene for example has no anti-xxx routines, not even a flattener, she doesn't seem to do to badly. Seraphim however has a great deal of anti-xxx routines, but does rather badly overall. :(
Chase-san 11:49, 28 August 2011 (UTC)
Exactly, thats why Seraphim does much better in Tideman and Schulze (and PL), even though Nene has a far better base. There is nothing wrong to counteract on specific tactics/strategy (f.e. ramming, bulletcatchers, surfers), I just fear that it will end up to specifically. A bot with 10 specific routines to defeat the top-20 is something I really don't like. The PL ranking is a valuable and respected ranking, and it nags me that I have the worst PL ranking in the history of the top-10, but for me it is not the most important ranking. It will improve automatically when my code improves. The same with Schulze and Tideman Condorcet. It's just not my cup of tea. Still have to read what everybody said and think it over, with the focus on what is the best for the rumble, not for me personally. --GrubbmGait 13:12, 28 August 2011 (UTC)
Well, with Condorcet methods, the prevalence of them actually wouldn't have any increased impact - losing to 10 BulletCatchers is basically as bad as losing to the worst one, not 10x like with APS. But consider how much those specializations have an impact under the rating system. Phoenix losing to a Bullet Shielding implementation is 1 loss in the PL or .01 APS, while he drops from 5 APS/8 PL to 60 or 80 with these Condorcet methods. It's an order of magnitude (or more) impact from a single loss to a gimmicky (if brilliant) strategy. Do you consider WaveShark, MicroAspid, or Tron 2 stronger than Phoenix? I don't, and he crushes them all handily. If a ranking system ranks them above Phoenix, it's hard not to consider it a strike against that rating system. All that said, I still really like these methods and I think this is a really, really extreme case. --Voidious 13:38, 28 August 2011 (UTC)
This is a reverse king-maker scenario: one (or two?) bots are able to send Phoenix (an overall strong bot which rates very high in both APS & PL) spiralling down the rankings. It's not just an extreme case, theoretically anyone could release an anti-InsertBotNameHere and achieve a similar result. But it's more likely with bots who aren't actively maintained. I find these methods interesting, but don't think they're valid for the rumble. The only method I like so far is the AWR (W% column) proposed by MN: similar to PL but with very few ties and less impact from marginal wins. --Darkcanuck 14:22, 28 August 2011 (UTC)

%Wins/Schulze or Score/Schulze Condorcet

I would like to see one of these systems. They take out the APS formula entirely and follow Condorcet principles closer (majority rule instead of averaging). Score/Schulze in particular having no averaging, making the system the closest to Condorcet as I can think of. Tie-breaks are treated entirely inside the Schulze system. --MN 23:31, 26 August 2011 (UTC)

I'm working on %win variations now. Schulze without some sort of normalization will not work since it will give more weight to pairings with more battles. --Darkcanuck 01:03, 27 August 2011 (UTC)
It's done, see SchW and TRPW columns. I didn't implement Schulze tie-breaking (or even tie detection) so that may skew the results (algorithm was complicated enough as-is). You'll notice that TRPW generated a huge number of ties (marked by asterisks) vs using APS only. --Darkcanuck 20:15, 27 August 2011 (UTC)
The results were totally unexpected. Combat got 29 in %Wins and 75 in %Wins/Schulze... and I thought they were similar systems. Maybe I´ll try making a %Wins/Schulze ranking with tie-detection to see what happens. --MN 00:09, 28 August 2011 (UTC)
Reviewing my code, ties are handed to the bot with the higher overall APS. Even in the APS-based Schulze, the results are quite unexpected. I doubt that ties are a major factor... --Darkcanuck 03:59, 28 August 2011 (UTC)

Battle count in Alternative Rankings page

Can battle count per competitor be added as a column in the page? It would be useful to see how sensitive each system is to battle count. --MN 12:16, 27 August 2011 (UTC)

Done. I also added tie indication to the W% column (but unfortunately only marked the second of the tied pair; in the Tideman columns, all tied rankings are marked). --Darkcanuck 03:45, 28 August 2011 (UTC)

Contents

Thread titleRepliesLast modified
Outlier resistant APS system620:37, 16 February 2012
ISDA compliant tie-breaking for PL119:50, 16 February 2012
Mad idea about rankings120:00, 7 October 2011
APS/Schulze demonstration005:39, 7 October 2011
Mixed Rating Systems405:53, 7 October 2011

Outlier resistant APS system

Bad uploads are becoming a recurrent issue.

But there is a way to shield the ranking from these uploads using median instead of mean when calculating pairing APS, as long as bad uploads don´t outnumber the others. The drawback would be a performance hit in CPU/database during uploads.

MN14:18, 16 February 2012

I think an outlier resistant APS system may be good, but I do have a concern that using the median may 1) distort scores when valid data would cause a distribution that has a skew, and 2) In the cases where there are no outliers for it to fix anyway, it would generate more noisy values (The median generally has larger fluctuations than the mean as samples are added).

I'd think it may be worth considering statistical methods to calculate the probability of a data point being an outlier, and ignoring it if it's beyond a threshold. It may be possible for such methods to not alter the means of skew caused by valid data, and will have smaller score fluctuations.

Another thought is, regardless of if we change the APS system or not, it may make sense for the rumble to have a page that lists recent outlier results, to make it easier to spot them.

Rednaxela17:46, 16 February 2012

One way to see skewed distributions is median taking it into account while mean assuming all distributions are symmetric. So it is not "distortion", but it may affect APS as we are used to.

But yes, mean needs less battles than median when the true average is near 50% (symmetric distributions) and there are no outliers.

There are other more sofisticated statistical methods for dealing with outliers, like percentile, which is somewhere between mean and median. But for me, median is good enough and is fully automated.

(I would never even imagine these things exist if it were not for Robocode and the quest for the ultimate statistical gun)

MN19:05, 16 February 2012

About the skewed distributions, fair enough. I still am concerned about the greater noise of medians though.

The more sophisticated method that was coming to my mind, was calculating the z-score of each sample per pairing, tossing out results that have too extreme of a z-score value, and using the mean of the remaining samples. The reason this appeals to me, is because it changes the existing scoring system as little as possible.

Most bad results we see are near-zero scores which should be quite distinctly detected by a z-score test, so reliably tossing them out without changing the overall scoring system would be quite doable I think.

Rednaxela19:33, 16 February 2012

Using z-score as threshold will need tuning to work properly, because it has unpredictable robustness. Remembering sampled standard deviations are also affected by outliers.

Choosing a very low percentile and a very high percentile as boundaries and averaging everything in between may have the effect you want without relying on noisy sampled deviations. Like 25%/75% (1st/3rd quartiles).

MN20:37, 16 February 2012
 
 
 

Taking that a step further, you could take the probability of a data point being wrong and look at how many of those are coming from each user, and over a certain threshold, all their uploads (or all within X hours of this happening) are ignored. (Or an email is fired off to Darkcanuck to look into it. =)) I like the idea of public page listing outlier results, too.

The median idea also seems good though, and very simple. I'd be curious to see if/how much the Rumble scores change using median instead of mean. My guess is by no noticeable amount.

Voidious17:52, 16 February 2012

You do not have permission to edit this page, for the following reasons:

  • The action you have requested is limited to users in the group: Users.
  • You must confirm your email address before editing pages. Please set and validate your email address through your user preferences.

You can view and copy the source of this page.

Return to Thread:Talk:Darkcanuck/RRServer/Ratings/Outlier resistant APS system/reply (4).

 
 

ISDA compliant tie-breaking for PL

I was thinking in another tie-breaking for the PL ranking system (Copeland). Currently it uses full APS as tie-breaking, but it sacrifices ISDA, which is a key property from Copeland to prevent low ranked competitors from interfering with the top ones (king-making).

Tie-breaking could use only APS between tied competitors instead, to preserve ISDA.

Example 1: DrussGT, Diamond and Tomcat are all tied in first place. Tie-breaking score for DrussGT would use only Diamond and Tomcat. Score for Diamond would use DrussGT and Tomcat. Score for Tomcat would use DrussGT and Diamond. As if those 3 were the only competitors in the rumble.

Example 2: DrussGT and Tomcat are tied in first place. Tie-breaking score for DrussGT would use only Tomcat. Score for Tomcat would use DrussGT. It would be a head-to-head comparison between the 2.

Competitors not tied to anyone else can have any tie-breaking score, as it would not change the ranking.

MN01:20, 3 January 2012

I had forgotten to respond to this when I first read it. But I quite like this idea. It has for a long time seemed strange to me that PL used APS for tie breaking, considering the goals/philosophies of the systems are so different.

Rednaxela19:50, 16 February 2012
 

Mad idea about rankings

Just mad idea about rankings. What if robot's rating is: <percents from 1st place's APS in 1-vs-1> + <percents from 1st place's APS in melee> * 0.6 + <percents from 1st place's APS in temas> * 0.3?

For example for DrussGT it will be 100 + 0 * 0.6 + 0 * 0.3 = 100, for Diamond 99.23 + 99.8 * 0.6 + 0 * 0.3 = 158,88, for Shadow 94.43 + 98.68 * 0.6 + 100 * 0.3 = 183,64

It allow develop melee and teams rumbles

Jdev12:42, 7 October 2011

I think that's kind of looking at it backwards. I don't think the combined ranking would make people more interested in Melee/Teams, but I do think if more people were building cross-divisional bots, such a ranking would be more interesting. =)

I don't think it's the lack of rankings that make people uninterested in Melee/Teams, I think people just like 1v1. And in that case, people will just focus on 1v1 rankings, even if combined rankings are available. I think the best thing you can do to get people interested in Melee/Teams is to work on them yourself!

I also think about other rule sets occasionally. Sometimes I think about building a MegaBot Twin Duel team, even if it couldn't compete in the rumble.

Voidious20:00, 7 October 2011
 

APS/Schulze demonstration

It´s been a while, but I finally implemented a Schulze ranking using everything that Schulze paper offers. It has tie-breaking and path reconstruction, together with some statistics and coloring to make easier to track what the method did. It uses pairing APS as ballot.

Tried using % wins in Schulze but it failed miserably as everyone from rank 8 to 817 were tied together, making the system deteriorate into a random ballot ranking. Shadow tied together with Worst with tie-breaking relying on a coin-flip was not what I had in mind. As far as I understood what went wrong, normalizing wins dividing them by the number of battles sacrificed resolvability.

The results can be downloaded here (data using Darkcanuck 2011-08-09 backup): [1] [2] [3] [4]

I also added other ranking systems, like APS/Elo (batch), Premier League and APS that everyone is used to. Also added a Simpson-Kramer ranking, due to Skilgannon request to see who is performing best against each bot. Simpson-Kramer is also good as statistics for Schulze.

I took Chase "for fun" advice seriously and tried to make something more appealing.

MN05:39, 7 October 2011

Mixed Rating Systems

You know, since there's a little bit of disagreement about what kind of rating system best reflects the kind of improvements valued in robots, I wonder if it would make sense to have a mixed system for overall ranking? What about using a condorcet voting method to resolve the rankings, where each ranking system is a "ballot" that fills out the rank of all robots? One "ballot" could be APS, another "ballot" could be W%, and perhaps another "ballot" could be S%. The resulting rankings would essentially be based on the consensus of the three component ranking systems. I think it's natural for robots that excel in all three metrics to rank high, and is perhaps a more universally agreed and more general notion of strength than just one metric alone.

Thoughts?

Rednaxela15:31, 21 September 2011

I like that approach myself. Though I think I would limit it to APS and W%.

Skotty16:30, 21 September 2011
 

Nevermind, forget what I said. I didn't read it close enough. I was thinking that I would like something that averages APS and W%, with the weight of each not necessarily being equal.

Skotty16:54, 21 September 2011
 

Averaged would be nice in that it would allow easier differentiation between small changes, but on the other hand, condorcet ranking has the property of not allowing one metric to dominate. I think there are advantages to each way of mixing.

Rednaxela17:56, 21 September 2011
 

Both condorcet and averaging can allow one kind of metric to dominate through clones. Similar methods together will overweight other unique methods. Ballots need to be independent from the candidates (methods) to avoid the clone problem.

MN05:53, 7 October 2011