Talk:Offline batch ELO rating system

From Robowiki
Revision as of 22:25, 12 August 2011 by MN (talk | contribs) (Rating criteria reply)
Jump to navigation Jump to search

Quite interesting. I fail to see how this makes "less assumptions on results than APS" however, since I don't believe APS makes any assumptions whatsoever, being a simple averaging of all pairings a robot is involved in. I don't see this as being less biased. Similarly valid certainly, but not less biased.
One thing I think is important to note is, I'm pretty sure the fact that you're rounding the result to a win/loss/draw probably explains the *vast* majority of the difference between this and the APS/ELO/Glicko-2 on the rumble server. I suspect that APS or the iterative roborumble ELO result wouldn't be that different if they performed the same rounding.
--Rednaxela 13:12, 12 August 2011 (UTC)

APS assumes a bots strength is proportional to the proportional difference in scores.
If a bot scores 2 and the opponent scores 1 (67%/33%), it is stronger than if it scored 3 and bot B scored 2 (60%/40%). Both bots increased the score by the same amount, but the APS system assumes one is stronger than the other.
If a bot scores 100% against MyFirstRobot and 40% against DrussGT, it is stronger than if it scored 60% against MyFirstRobot and 60% against DrussGT. APS assumes that +40% against MyFirstRobot is worth more than the +20% against DrussGT, leading to king-making.
Arpad Elo original rating system inferred strength on frequency of wins/losses alone, not proportional score difference. If a bot beats another, it is only better, not a little better or a lot better. --MN 16:32, 12 August 2011 (UTC)
If it scores 100% against MyFirstRobot and 40% against DrussGT, it is stronger than if it scored 60% against MyFirstRobot and 60% against DrussGT. In my opinion anyway, getting 100% against any robot is more time consuming and bug hunting and hard work then getting an okayish score against many robots (which many rambots, mirror bots and and random targeters can do). Are you telling me you think a random targeter is stronger then a highly tuned statistical targeting? — Chase-san 17:36, 12 August 2011 (UTC)
A simple random targeter will probably perform badly because it is unstable and the frequency of wins will decrease. Remember, its about the win frequency, not the score average. Pushing the average score alone above 50% only works in PL.
But I believe a more sofisticated random targeting with max escape angle/precise prediction can do a nice job as an anti-surfing gun. The same is true for wave surfing, as strong bots usually switch to a simpler curve flattening movement when they face strong opposition. They are both in my bots to-do list. --MN 19:56, 12 August 2011 (UTC)
In practice "the frequency of wins" does not decrease substantially in that case when you use whole-battle score. Now... if you counted every *round* of the battle as a separate win/loss, I could agree with you perhaps, but because of the way that 35 rounds has a relatively low standard deviation... --Rednaxela 20:34, 12 August 2011 (UTC)
APS (and the server's ELO & Glicko-2 ratings which are based on it) is simply a measure of bot perfection. To get the best APS, you need to beat all competitors by the widest margin possible. That's one way of determining the "best" or "strongest" bot, but certainly not the only way. If you want to differentiate your scoring method from APS, you should outline what the criteria for "best" or "strongest" are, with realistic examples taken from the rumble (for example, there are bots ranked higher in PL than APS). Then it will be easier to compare versus the current system. --Darkcanuck 19:37, 12 August 2011 (UTC)
The main criteria is winning frequency. The stronger a competitor is, the more wins it should get from the opposition. Another criteria is wins against stronger competitors being worth more than wins against weaker ones. To get the best rating possible (infinite), the competitor needs to be "unbeatable". To climb the ranking consistently, you need to be better than stronger and stronger opponents as you go up. I noticed Shadow and SandbotDT climbed a lot of positions with this system. It feels a lot like PL, but logistical distribution is more popular and it is cool. :-P --MN 20:43, 12 August 2011 (UTC)
There is a catch though. Since the rating system makes the environment more competitive (strong competitors against strong competitors instead of everyone against weak competitors), it rewards strategical algorithms more, and cool but strategically inferior AI algorithms less. For example, random algorithms are more powerful strategically than deterministic ones, because randomness is an effective counter against learning algorithms. So, random targeting can be rewarded more than guess factor targeting as Chase complained.
Before anyone asks what I mean by "strategical", it is taking in account how your opponents will react to your algorithms and designing hard-to-counter algorithms in the first place. --MN 21:10, 12 August 2011 (UTC)
Of course, applying a boolean stronger/weaker to those two results is subjective. The only thing you can say objectively is that each has different strengths, like comparing the intelligence of chimps to dolphins to dogs. I often wonder how different our Robocode landscape would look if we hadn't settled on APS / "crush the weak" style measurements of robot strength. What if the thing that separated #1 from #2 is that #1 actually beat #2, and everyone else, over hundreds of rounds/battles? Adaptive, anti-adaptive, anti-anti-anti-adaptive movement and targeting would be a lot further along than they are now, I reckon. I'd like to put some focus on that front myself, sometime. I'll check these results when I am somewhere that I can unRAR. =) --Voidious 17:50, 12 August 2011 (UTC)
Thanks Voidious, at least someone understood the value of this rating system. As I remember, percentage score was only used by Albert a long time ago because he feared the rankings would be too slow to stabilize otherwise. --MN 19:56, 12 August 2011 (UTC)
The whole point of the Arpad Elo origignal rating system inferring strength based on frequency of wins/losses alone, is because in Chess wins/losses are the only measure of performance that the rules of the game provide. There is no (widely accepted) way of measuring the range between "barely won" and "obliterated" in Chess.
The same system is used in other games, like Go and soccer, which also have an obvious score. The difference is some of them increase the K adjust value when the scores are too far apart, but they still converge to the same setting. I made a bit of research on rating systems before making this one. --MN 19:56, 12 August 2011 (UTC)
It makes assumptions too. Original ELO assumes that your chances of winning against an opponent are proportional to rank difference, and that any deviations are due to random chance during measurement. Essentially it assumes all competitors are generalists, which is relatively true with humans in Chess, but is relatively untrue in Robocode with a variety of different strategies with all sorts of different performance characteristics against different subsets of enemies. It's by no means assumptionless.
Really though, any method you pick has subjective aspects. Also, the "score" number itself is not really objective either, and thus a win/loss (or anything else) based on it is not objective either anyway.
--Rednaxela 18:21, 12 August 2011 (UTC)
Time for that StrongestBotsSurvivalistPremierLeagueRumble? =) --Voidious 18:27, 12 August 2011 (UTC)
Well, I'm more curious about SurvivalistCondorcetRumble actually. I think such methods could be relatively fair and interesting. :P --Rednaxela 18:34, 12 August 2011 (UTC)
Oh, yes - even better! =) Funny - I'm half joking around, half totally in love with the idea. =) --Voidious 18:39, 12 August 2011 (UTC)

Practical Considerations

If you want a new rating system added to the server, it should:

  • use an online algorithm
  • execute on the order of < 1 second worst-case, preferably < 10 milliseconds
  • not require an excessive amount of data retrieval per scoring iteration

--Darkcanuck 19:43, 12 August 2011 (UTC)