Talk:Offline batch ELO rating system

From Robowiki
Revision as of 15:18, 4 September 2011 by Peltco (talk | contribs)
Jump to navigation Jump to search

Quite interesting. I fail to see how this makes "less assumptions on results than APS" however, since I don't believe APS makes any assumptions whatsoever, being a simple averaging of all pairings a robot is involved in. I don't see this as being less biased. Similarly valid certainly, but not less biased.
One thing I think is important to note is, I'm pretty sure the fact that you're rounding the result to a win/loss/draw probably explains the *vast* majority of the difference between this and the APS/ELO/Glicko-2 on the rumble server. I suspect that APS or the iterative roborumble ELO result wouldn't be that different if they performed the same rounding.
--Rednaxela 13:12, 12 August 2011 (UTC)

APS assumes a bots strength is proportional to the proportional difference in scores.
If a bot scores 2 and the opponent scores 1 (67%/33%), it is stronger than if it scored 3 and bot B scored 2 (60%/40%). Both bots increased the score by the same amount, but the APS system assumes one is stronger than the other.
If a bot scores 100% against MyFirstRobot and 40% against DrussGT, it is stronger than if it scored 60% against MyFirstRobot and 60% against DrussGT. APS assumes that +40% against MyFirstRobot is worth more than the +20% against DrussGT, leading to king-making.
Arpad Elo original rating system inferred strength on frequency of wins/losses alone, not proportional score difference. If a bot beats another, it is only better, not a little better or a lot better. --MN 16:32, 12 August 2011 (UTC)
If it scores 100% against MyFirstRobot and 40% against DrussGT, it is stronger than if it scored 60% against MyFirstRobot and 60% against DrussGT. In my opinion anyway, getting 100% against any robot is more time consuming and bug hunting and hard work then getting an okayish score against many robots (which many rambots, mirror bots and and random targeters can do). Are you telling me you think a random targeter is stronger then a highly tuned statistical targeting? — Chase-san 17:36, 12 August 2011 (UTC)
A simple random targeter will probably perform badly because it is unstable and the frequency of wins will decrease. Remember, its about the win frequency, not the score average. Pushing the average score alone above 50% only works in PL.
But I believe a more sofisticated random targeting with max escape angle/precise prediction can do a nice job as an anti-surfing gun. The same is true for wave surfing, as strong bots usually switch to a simpler curve flattening movement when they face strong opposition. They are both in my bots to-do list. --MN 19:56, 12 August 2011 (UTC)
In practice "the frequency of wins" does not decrease substantially in that case when you use whole-battle score. Now... if you counted every *round* of the battle as a separate win/loss, I could agree with you perhaps, but because of the way that 35 rounds has a relatively low standard deviation... --Rednaxela 20:34, 12 August 2011 (UTC)
I'm not sure I agree with your interpretation of what a 'strong' bot is. --Skilgannon 12:46, 14 August 2011 (UTC)
There is no interpretation of strong. But there is an assumption of what a stronger competitor is. If a competitor is stronger than another and you put then against each other, the stronger will win. That's it. But since there are more than 2 competitors in the league and the ranking is uni-dimensional, one more assumption is necessary. If competitor A is stronger than competitor B, and competitor B is stronger than competitor C, then competitor A is stronger than competitor C. Both assumptions can't be always true as there are close matches and rock-paper-scissor situations, but the rating system tries to minimize the amount of situations where the assumptions fail and a competitor with lower rating wins against a competitor with higher rating. --MN 16:36, 14 August 2011 (UTC)
I didn't invent those assumptions. This is how rating systems work in most sports around the world and I wanted to bring it to RoboRumble. The only thing I really changed was trying to make all the calculations offline, a luxury viable in RoboRumble, but not in most sports. --MN 16:36, 14 August 2011 (UTC)
Y'know, while I'm unconvinced there are fundamental problems with APS, ratings systems that work in the way you discuss are interesting as well. You may be interested to read about condorcet method voting systems for instance. By definition they are a class of various voting system where if a candidate would win in a run-off with any opponent, they would win overall. The thing is, completely ignoring the magnitude of wins is not the only way to interpret "strong" with those types of assumptions. --Rednaxela 17:20, 14 August 2011 (UTC)
Why wasn't it done since the birth of RoboRumble? Hard to answer, but reading posts around this Wiki and the old one, I believe it was because, at the time, the limited infrastructure available couldn't handle all the necessary computation, online or offline. --MN 16:36, 14 August 2011 (UTC)
I do not see how computational limits would have been a factor. If one ignores the drift prevention mechanisms and such, I believe what you propose here is at it's core rounding scores to win/tie/loss, which I cannot imagine to be a substantial computational burden. --Rednaxela 17:20, 14 August 2011 (UTC)
If a bot only manages to get 60% against MyFirstRobot there is no way I could consider it strong, no matter how well it pounded DrussGT. You see, it is much easier to write MyFirstRobot, so in my mind, in the robocode wild, there are likely to be thousands upon thousands of scrawny little MyFirstRobots crawling around. Any bot which has a 40% damage inflicted upon them by each MyFirstRobot encountered wouldn't last long enough to even face a DrussGT, or a Shadow, or a YersiniaPestis.--Skilgannon 12:46, 14 August 2011 (UTC)
This is not entirely true. Being hurt more by weaker bots when trying to dodge shots from stronger bots is a problem for anyone trying to use a curve flattening movement. Stronger bots use curve flattening, weaker bots do not. But in melee your mind is accurate, being shot often by MyFirstRobot will kill you fast. But I believe this pounding will be taken in account as a reduced winning rate. --MN 16:36, 14 August 2011 (UTC)
If I disable my curve flattening in DrussGT I will probably lose 1 or 2 pairings, maybe 3. That means that in 10 years of robocode only 3 bots have been created against which turning on my curve flattening would help enough to change the results in your ranking system. There are a lot of bots that are ranked highly against which turning on curve flattening actually hurts my score. Of course, my curve flattening is, by default, turned off so that I specifically do not try to dodge stronger bots when against weaker ones. Only a high hitrate against my regular surfing can turn my curve flattening on. There is much more of a challenge in getting a robot to perform well across a wide variety of enemies than specifically optimizing it for a single segment. Against Dookious, for example, using a random gun constrained by precise MEA is just as accurate as any targeting mechanism I've found yet, but using something like this against CunobelinDC is going to hurt my score compared to a low-rolling average coarsely segmented VCS gun. --Skilgannon 06:03, 15 August 2011 (UTC)
Let me put it this way: if you started with infinite energy, and had to face every single bot in the rumble once, then the current APS system does a pretty good job of seeing who would be the least damaged at the end of it all. --Skilgannon 12:46, 14 August 2011 (UTC)
I agree. The APS rating system pushes the 1v1 rumble towards being one big movement/targeting challenge. But it can be more than that. --MN 16:36, 14 August 2011 (UTC)
APS (and the server's ELO & Glicko-2 ratings which are based on it) is simply a measure of bot perfection. To get the best APS, you need to beat all competitors by the widest margin possible. That's one way of determining the "best" or "strongest" bot, but certainly not the only way. If you want to differentiate your scoring method from APS, you should outline what the criteria for "best" or "strongest" are, with realistic examples taken from the rumble (for example, there are bots ranked higher in PL than APS). Then it will be easier to compare versus the current system. --Darkcanuck 19:37, 12 August 2011 (UTC)
The main criteria is winning frequency. The stronger a competitor is, the more wins it should get from the opposition. Another criteria is wins against stronger competitors being worth more than wins against weaker ones. To get the best rating possible (infinite), the competitor needs to be "unbeatable". To climb the ranking consistently, you need to be better than stronger and stronger opponents as you go up. I noticed Shadow and SandbotDT climbed a lot of positions with this system. It feels a lot like PL, but logistic distribution is more popular and it is cool. :-P --MN 20:43, 12 August 2011 (UTC)
There is a catch though. Since the rating system would make the rumble more competitive (strong competitors against strong competitors instead of everyone against weak competitors), it rewards strategical algorithms more, and cool but strategically inferior AI algorithms less. For example, random algorithms are more powerful strategically than deterministic ones, because randomness is an effective counter against learning algorithms. So, random targeting can be rewarded more than guess factor targeting as Chase complained.
Before anyone asks what I mean by "strategical", it is taking in account how your opponents will react to your algorithms and designing hard-to-counter algorithms in the first place. --MN 21:10, 12 August 2011 (UTC)
Ok, I think I see what you're trying to achieve. It wouldn't be very difficult to add an online ranking calculation (say Glicko-2 or logistic Elo) based on win % between pairings. That should give similar results to your batch run, but without the nice scaling. --Darkcanuck 03:52, 13 August 2011 (UTC)
Yes, it can be done online. Simply use original ELO and Glicko-2 rating systems unmodified as they were designed that way. But accuracy will be lower because of the dreaded problem bot rating drift. This problem is also a reality in the chess community. I tried to find a fix to the data ordering dependence of online algorithms as well as suggesting a more king-making resistant system. --MN 14:01, 13 August 2011 (UTC)
Glicko-2 ratings (based on APS) have held up well so far, without significant ratings drift. --Darkcanuck 18:42, 13 August 2011 (UTC)
Albert´s modification on ELO made it an incremental/batch hybrid system, which partially solves the rating drift problem. It processes all pairings of a given bot at the same time. And that´s where the idea in going all the way and making a full batch system came from. --MN 19:49, 13 August 2011 (UTC)
Batch processing seemed a good solution, and it became possible in the rumble thanks to Darkcanuck´s initiative of storing all battle history in the server. It wasn´t possible in Albert´s original server implementation. --MN 17:07, 23 August 2011 (UTC)
Rednaxela pointed a historical contamination problem in melee, which will happen with any system which doesn´t take data ordering in account. Standard incremental Elo/Glicko-2 using winning rates, addresses the king maker problem, and also addresses the historical contamination problem. The way the client/server protocol is implemented now, incremental Elo will be more accurate than batch Elo. Remaking a new protocol from scratch taking melee in account would also fix this issue. --MN 14:01, 13 August 2011 (UTC)
Of course, applying a boolean stronger/weaker to those two results is subjective. The only thing you can say objectively is that each has different strengths, like comparing the intelligence of chimps to dolphins to dogs. I often wonder how different our Robocode landscape would look if we hadn't settled on APS / "crush the weak" style measurements of robot strength. What if the thing that separated #1 from #2 is that #1 actually beat #2, and everyone else, over hundreds of rounds/battles? Adaptive, anti-adaptive, anti-anti-anti-adaptive movement and targeting would be a lot further along than they are now, I reckon. I'd like to put some focus on that front myself, sometime. I'll check these results when I am somewhere that I can unRAR. =) --Voidious 17:50, 12 August 2011 (UTC)
Thanks Voidious, at least someone understood the value of this rating system. As I remember, percentage score was only used by Albert a long time ago because he feared the rankings would be too slow to stabilize otherwise. --MN 19:56, 12 August 2011 (UTC)
The whole point of the Arpad Elo origignal rating system inferring strength based on frequency of wins/losses alone, is because in Chess wins/losses are the only measure of performance that the rules of the game provide. There is no (widely accepted) way of measuring the range between "barely won" and "obliterated" in Chess.
The same system is used in other games, like Go and soccer, which also have an obvious score. The difference is some of them increase the K adjust value when the scores are too far apart, but they still converge to the same setting. I made a bit of research on rating systems before making this one. --MN 19:56, 12 August 2011 (UTC)
It makes assumptions too. Original ELO assumes that your chances of winning against an opponent are proportional to rank difference, and that any deviations are due to random chance during measurement. Essentially it assumes all competitors are generalists, which is relatively true with humans in Chess, but is relatively untrue in Robocode with a variety of different strategies with all sorts of different performance characteristics against different subsets of enemies. It's by no means assumptionless.
Really though, any method you pick has subjective aspects. Also, the "score" number itself is not really objective either, and thus a win/loss (or anything else) based on it is not objective either anyway.
--Rednaxela 18:21, 12 August 2011 (UTC)
Time for that StrongestBotsSurvivalistPremierLeagueRumble? =) --Voidious 18:27, 12 August 2011 (UTC)
Well, I'm more curious about SurvivalistCondorcetRumble actually. I think such methods could be relatively fair and interesting. :P --Rednaxela 18:34, 12 August 2011 (UTC)
Oh, yes - even better! =) Funny - I'm half joking around, half totally in love with the idea. =) --Voidious 18:39, 12 August 2011 (UTC)
Haha, that is great! — Chase-san 07:15, 15 August 2011 (UTC)

Practical Considerations

If you want a new rating system added to the server, it should:

  • use an online algorithm
  • execute on the order of < 1 second worst-case, preferably < 10 milliseconds
  • not require an excessive amount of data retrieval per scoring iteration

--Darkcanuck 19:43, 12 August 2011 (UTC)

I'm curious, why is an online algorithm a hard-requirement? As I see it, if an algorithm really was really worth it, a daily cron job doesn't seem like a huge burden. (That said, live updates are nice of course :P)--Rednaxela 00:02, 13 August 2011 (UTC)

Well, I said "should", not "must". A cron job is always possible, of course. But an online algorithm would be far more efficient in terms of memory and cpu usage. You would have to do a really good sales job to convince me that a batch algorithm is worth it... --Darkcanuck 01:51, 13 August 2011 (UTC)
It doesn´t need to be daily. It can run more often, like after each 10 minutes. Win/draw/loss count per pairing, which took over 10 minutes, can be easily updated online with perfect accuracy. Only logistic regression iterations are tricky to be done online without losing accuracy. They took about a minute and a half in my implementation (215 iterations over 663410 pairings), but I abused the Java Collections framework. The pairing array is declared like this:
Map<BotData, Map<BotData, Pairing>> pairings = new HashMap<BotData, Map<BotData, Pairing>>();

--MN 14:01, 13 August 2011 (UTC)

How much accuracy is really lost by doing this online? --Darkcanuck 18:42, 13 August 2011 (UTC)
Don´t know. Accuracy depends on what K value is used, the ordering/quantity of battles and the competitors involved. The idea behind batch processing was not having to answer this question. Instead of worrying about accuracy, you worry about performance instead. And the choice on Rprop was a performance improvement, as the original K-adjust algorithm was not stabilizing the rankings even after 2000 iterations. Probably a bad K value, but with Rprop you don´t have to worry about guessing a good K value anymore. --MN 19:49, 13 August 2011 (UTC)
But I think there is a way to estimate accuracy with a given ranking. Add the squared PBI average of all competitors together and you will have a good guess, the lower the better. With Rprop batch... after 217 iterations, the total was near 0.0000000001. --MN 19:49, 13 August 2011 (UTC)

"King-making"

Could you elaborate a bit on what you mean by "king-making", and why it's bad? I hadn't heard the term before, and after a bit of reading up on its meanings, I'm still not sure how it applies to the RoboRumble and APS rankings. --Voidious 17:59, 14 August 2011 (UTC)

Sorry, I used the wrong term. It is king-maker, not king-making. I created another page using APS examples. --MN 02:50, 14 August 2011 (UTC)

Some snags, IMO

I think there is one minor problem with this ranking system. Surely, once you've beaten everybody in all your battles there is no room for improvement? =) And surely, by making the amount you beat them by not important, bots will be willing to give up strategies which help them against a vast majority of bots, just because they hurt against 1 or 2 'problem' bots? I think that the concept you have of some sort of scoring mechanism which gives higher weights to battles against strong bots certainly has merit, but having a binary win/lose is quite a bit of deadband, don't you think? There is a big difference between getting 49% against Shadow and getting 22% against Shadow... --Skilgannon 06:20, 15 August 2011 (UTC)

It doesn´t give higher weight to stronger bots, it gives equal weight to both stronger and weaker bots. Since it´s easier to increase advantage over weaker bots, taking the difference in account ends up putting more weight on pairings against them. --MN 14:00, 15 August 2011 (UTC)
Not sure I agree with you there, "increasing the advantage over weak bots" is very hard when you're already approaching an absolute maximum. Going from 98.5% to 99% against DevilFish is a lot harder than going from 38.5% to 39% against Shadow. Against DevilFish you have essentially improved your score by a third, whereas against Shadow this improvement is lost in the background noise, perhaps caused by tweaking a weighting value in a virtual gun. Not taking that hard-earned extra 0.5% against weak bots into account is my main problem with this scoring mechanism. --Skilgannon 09:40, 16 August 2011 (UTC)

Did you realize that the APS rating system also makes increasing from 98.5% to 99% against DevilFish the same as 38.5% to 39% against Shadow? But it also makes going from 50.5% to 70% against DevilFish worth more than going from 38.5% to 50.5% against Shadow. And... going from 50.5% to 100% against SittingDuck worth more than going from 38.5% to 50.5% against Shadow... and 2 Shadows... and 3 Shadows... and 3 Shadows plus 7 hard-earned extra 0.5% against DevilFish. And that´s why king-maker scenarios suck so much. --MN 01:37, 17 August 2011 (UTC)

When it's entirely possible to get above 94% average without even using wave surfing, that essentially gives a grand total of 6% worth of score variation that any complete wavesurfing algorithm should be able to eke out of it. I'm not sure where you're getting your 50.5% from. Similarly against SittingDuck, sitting still and firing directly at them gets 100%. That takes about 2 lines of code. I'm really not following your logic here. --Skilgannon 06:15, 17 August 2011 (UTC)

Sitting Duck has the strength of 3 Shadows! That´s why I love the sample bots. :P --MN 01:37, 17 August 2011 (UTC)

Well... any bot getting 50.5% against SittingDuck ought to be ashamed of itself I'd say :P --Rednaxela 02:16, 17 August 2011 (UTC)
Or be proud. Actually, it is harder to score 50.5% against SittingDuck than 100%. You need a score control algorithm to avoid crushing it. --MN 03:46, 17 August 2011 (UTC)
So this is what your new scoring algorithm is promoting? I fail to see how this is relevant. --Skilgannon 06:15, 17 August 2011 (UTC)
And going from 38.5% to 39% against anything? That's a meaningless change, well within noise margins, whereas going from 98.5% to 99% against anything tends to be hard earned change, because getting hit 33% less generally takes work to accomplish no matter where the hitrate started. --Rednaxela 02:16, 17 August 2011 (UTC)
But any ideas how to value these hard earned +0.5% if a bunch of +40% from king makers come together with it? The 0.5% is just noise with the presence of king makers interference. --MN 03:46, 17 August 2011 (UTC)
This bot IS one that, under your criteria, would be considered a king maker. The good bots are able to get 98% against it, yet it still beats the pants off of some bots. The thing is, everybody has the opportunity to get 98% against it, if they put in the effort and make a robust adaptive movement. As such, the playing field is levelled. Interestingly, the same algorithms that get 98% against this bot are also the ones that do the best against adaptive guns. In an APS scoring system, there is no requirement to score highly against 'king maker' bots. The thing about using a real dataset is that there are strong bots, and there are weak bots. Sure, against the weak bots you might get +90%. But if others are also getting +90% then it doesn't mean that they're being 'weighted more highly'. It only means that the average score against them is higher. Which is how the entire APS system works. If anything, I would say that, due to the absolute limit that is approached or met by many top bots, the APS system actually counts the weaker bots less because there is less available score variation. After all, it is the difference in scores between bots that differentiates the top bots from the weaker bots, not how high everybody scores relative to some arbitrary measure. --Skilgannon 06:15, 17 August 2011 (UTC)
Just to play Devil's Advocate... A reason this might not be the case could be taken from other sports. If all Shadow cared about was winning, once he got far ahead in score, he might employ the strategy that makes him most likely to win, which might have a lower average score % than what he would do otherwise. Like in basketball, if a team is up by 20, they will slow down the game in the 4th quarter - they might only win by 4 or 5 by the end, but that doesn't mean their chance of winning was ever in jeopardy; if they had kept the pedal to the floor and kept the rate of scoring high, it might have opened the door to a comeback. Then consider a game that's tied until the last minute, but spreads to 4 or 5 in the meaningless closing seconds when the game is already decided. Both have the same final score difference, but it's not indicative of how close the losing team was to winning. --Voidious 13:44, 15 August 2011 (UTC)
Increasing competition can lead to that kind of strategy. In game theory terms, it pushes strategies towards Nash equilibrium. And Nash equilibrium is often defensive, but not by any means weaker. --MN 14:00, 15 August 2011
Just a quick little comment. Regarding, a robot employing a "most likely to win" strategy of that sort, I currently have a bulletpower selection/optimization algorithm in the works, which could in theory be quite easily tuned to such goals (or, just about any other such metric that is chosen to be the goal)... and... maybe even tune preferred distance to such too ;) --Rednaxela 03:27, 17 August 2011 (UTC)

Maybe slightly off topic but since i can't really find a robocode forum/chat room i thought i'd post it here anyways. 1) has anyone ever considered seperate leagues for advancedbots/robots/junior bots. I'm new to robocode and still in the process of looking at the exact differences between these bots (other than advanced bots take damage when hitting a wall). Might also be interesting for beginner to be able to compete in a juniorrobot league. 2) start a new league every year. it seems a lot of the robocode discussion/bots are fairly old (could be wrong here but that's just my perception) it might be worth looking at delisting people who havn't touched their bot in x amount of time. anyone could add their bot at any time but it might get rid of some of the "i build a bot in 2003 and forgot about it" entries. especially in getting the ranking stable it would go faster if the set of active bots was smaller. --Peltco 4 September 2011