Difference between revisions of "Talk:King maker"

From Robowiki
Jump to navigation Jump to search
(reply)
(binomial distribution reply)
 
(29 intermediate revisions by 7 users not shown)
Line 131: Line 131:
  
 
: Funny, it's actually Schulze method that I'm dealing with tonight too. I'm rather curious exactly how you're translating the pairing data into input for the Schulze method, because there's a variety of ways you could do that. I'm pretty sure the way I'm implementing it here would be "immune" to being made unstable by close pairings. --[[User:Rednaxela|Rednaxela]] 00:02, 18 August 2011 (UTC)
 
: Funny, it's actually Schulze method that I'm dealing with tonight too. I'm rather curious exactly how you're translating the pairing data into input for the Schulze method, because there's a variety of ways you could do that. I'm pretty sure the way I'm implementing it here would be "immune" to being made unstable by close pairings. --[[User:Rednaxela|Rednaxela]] 00:02, 18 August 2011 (UTC)
 +
 
:: I wanted to focus on getting the method working first (some of the results above have me questioning that) then look at the input method.  RIght now it just uses the pairing's APS, straight from the database.  I plan to test win% (# wins / battles) later on.  If you (or anyone) has other suggestions... --[[User:Darkcanuck|Darkcanuck]] 00:33, 18 August 2011 (UTC)
 
:: I wanted to focus on getting the method working first (some of the results above have me questioning that) then look at the input method.  RIght now it just uses the pairing's APS, straight from the database.  I plan to test win% (# wins / battles) later on.  If you (or anyone) has other suggestions... --[[User:Darkcanuck|Darkcanuck]] 00:33, 18 August 2011 (UTC)
 +
 +
:::I would suggest total wins (full score) as main ranking, and total wins (survival score) as secondary ranking. And the opposite in Twin Duel. (total wins = # wins + 0.5 # draws). I think there is no need to normalize the score dividing it by # battles in Condorset. --[[User:MN|MN]] 03:12, 18 August 2011 (UTC)
 +
 +
::::Actually it would skew the vote counts if some pairings had many more battles.  Normalized is the way to go.  --[[User:Darkcanuck|Darkcanuck]] 05:00, 18 August 2011 (UTC)
 +
 +
::::But % wins is nice for statistics. --[[User:MN|MN]] 03:12, 18 August 2011 (UTC)
 +
 +
:::Total wins or % wins takes variance in account while APS doesn´t. --[[User:MN|MN]] 03:12, 18 August 2011 (UTC)
 +
 +
:::: I believe anything 50%-score-threshold-based (win count) is quite simply, doomed to be highly unstable when faced with close matches. The thresholding in the presence of close fights, inherently creates more instability than I consider acceptable. As such, the good things that come of such methods have to be achieved by different means instead. Hence my current line of investigation.. --[[User:Rednaxela|Rednaxela]] 03:41, 18 August 2011 (UTC)
 +
 +
:::::Arpard Elo way of dealing with this was taking the variance in account with win frequency and make the score shift gradually from 100%/0% to 0%/100%. --[[User:MN|MN]] 13:16, 19 August 2011 (UTC)
 +
 +
::: Anyway, Darkcanuck, that's not what I was exactly referring to by a variety of ways it could be done. I meant that regardless of what numbers you use (score, win counts, whatever) there are many ways you could construct the "ballots" of sorts. One "preferential vote ballot" per robot, or consider individual pairings to be separate partially-filled ballots, among a variety of other ways to model it. --[[User:Rednaxela|Rednaxela]] 03:41, 18 August 2011 (UTC)
 +
 +
:::: Oh, I see.  Well, I skipped the ballot step and went straight to pairing tallies using the raw pairing average score (stored as 0-100000 in the database) as the "vote count".  So if DrussGT beats Dookius 55.37 to 44.63, then d(DrussGT,Dookius) = 55370 and d(Dookius,DrussGT) = 44630.  --[[User:Darkcanuck|Darkcanuck]] 05:00, 18 August 2011 (UTC)
 +
 +
::::: Ahh. I just implemented the ballot step here (It treats each "RatingsDetails" page as a ballot essentially, with itself added in the "50% score" rank. The idea is, don't reduce to win/loss, instead reduce to ranks against each competitor. I think it may have interesting results this way...) I also have other methods of ballot construction to try later too. Sleep is necessary though, so getting the results from the tallies will wait till tomorrow. :) --[[User:Rednaxela|Rednaxela]] 05:24, 18 August 2011 (UTC)
 +
 +
::::::I think king-maker scenarios are still there, the unwilling king maker type at least. But it does solve the problem of a single king maker spoiling those "hard-earned extra 0.5%". Actually, the effect of those "hard-earned extra 0.5%" are greatly amplified with that ballot per bot setup, I think to the point that it will make the rumble tends towards the perfectionist crush-the-weak style even more. --[[User:MN|MN]] 13:23, 19 August 2011 (UTC)
 +
 +
::::::But I would prefer that Condorcet/ballot per bot system to APS. My voting ticket would be: 1-Score/Condorcet/Schulze (ballot per battle), 2-Winning rate/Condorcet/Schulze (ballot per pairing), 3-Win/draw/loss Elo (non-Condorcet), 4-Premier League (pairing APS/Condorcet/league APS, ballot per pairing), 5-Pairing APS/Condorcet/Schulze (ballot per bot), 6-APS (non-Condorcet). --[[User:MN|MN]] 13:23, 19 August 2011 (UTC)
 +
 +
::::::: Unless you're using a modified system that allows variable strength ballots, then "Score/Condorcet/Schulze (ballot per battle)" means you still have a threshold on 50% in effect, which means you'll still end up with unstable results no matter how you spin it. (If you ''do'' mean a modified system that allows variable strength ballots, then it might as well be per-pairing as to not have a bias in favor of pairings with a higher battle count)
 +
 +
::::::: The problem is, when you take any signal (be it roborumble battle scores, or RF signals, or wind speed), and threshold it, then whenever the true value is near the threshold, noise will be amplified drastically. This is, which is rather bad for accuracy of the processed result, and the lost information means that no subsequent algorithm in the process has any hope of restoring the accuracy. Now, if you apply the threshold per-battle (your way) instead of per pairing (PL score), the noise amplification will be diminished some after enough repetitions of measurement, however for that to come close to making fixing the noise amplification, hundreds of measurements would be needed, so for close battles we'd need many more battles per pairing for me to consider this acceptable.
 +
 +
::::::: Now, one alternative to a plain per-battle threshold comes to mind, which ''should'' solve the accuracy problem but give extremely similar results to what you want MN... We could take the standard deviation in score per pairing that a robot is involved in, then find the average standard deviation for that robot. This is based on the assumption that the a robot's performance roughly uniformly predictable/unpredictable across opponents. Then for each pairing, you take the average of the average standard deviation for competitor in the pairing. This is the "estimated standard deviation" for the pairing. If the above assumption is mostly true, this estimate of standard deviation will be more accurate than a direct computation of standard deviation for just one pairing's battles (after all, 3 or 4 battles doesn't give an accurate standard deviation). Using this estimated standard deviation, and the average pairing score, you should be able to calculate a ''much'' more accurate win rate for each pairing. There you go, a way to have your thresholding, and (given just one assumption) get more accuracy too ;) --[[User:Rednaxela|Rednaxela]] 14:47, 19 August 2011 (UTC)
 +
 +
::::::::I´m not going to create any (more) assumptions, but repeat Elo´s assumptions. For Elo, what have constant standard deviation is the competitor´s strength (1 gaussian curve), and the variance in matches results are a consequence of that (2 gaussian curves overlapping). So, the closer the competitor´s strength are, the more variance a pairing will have (more overlapping). And strength difference is then inferred from the measured variance (winning rate). It´s a quite different assumption, and instability in rankings (overlapping of strength´s average´s curves) and close matches (overlapping of strength´s curves) are simple accepted, although they diminish gradually with more data. --[[User:MN|MN]] 17:40, 19 August 2011 (UTC)
 +
 +
::::::::: Given the magnitude of the effect in practice, It's simply not reasonable accept that unless we set up a priority battle system that gives vastly more battles (100x at least) to all pairings that are closely matched. Simply thresholding carelessly throws away data that can hint what the win rate would be over a larger number of battles. Okay, perhaps the "uniform unpredictability of performance across opponents" assumption was too big of an assumption. How about... since you don't want any more assumptions than elo, let's take the assumption of the score being a gaussian curve. Let's say a pairing has 4 battles, with scores of say... 45, 45, 60, and 60. Thresholding would go with a "win rate" of 50%. If you're to making an assumption that score varies on a gaussian curve though, regardless of what the standard deviation is, the curve most likely to give the four numbers "45, 45, 60, and 60" does not center at 50%. ;) --[[User:Rednaxela|Rednaxela]] 18:23, 19 August 2011 (UTC)
 +
 +
::::::::::No, it doesn´t center at 50%. But what guarantee that the central limit theorem holds in APS? A Gaussian curve is infinite on both sides, but APS starts at 0% and ends at 100%. Why that 5% below 50% from 45% is worth less than that 10% above 50% from 60%? Why is the first 60% equal to the second 60%? There are infinite different score setups which all end in the same percentage score. But it is assumed they are all equal. Very strong assumptions. And averaging is meaningless without these assumptions. 45, 45, 60, 60 does not center at 50% simply because it is assumed not to. APS appears to have more data, but it has more assumptions instead. --[[User:MN|MN]] 08:28, 20 August 2011 (UTC)
 +
 +
::::::::::: Yes, it's true it "starts at 0% and ends at 100%". Well... let's see... Ignoring some minor artifacts of the scoring system, the result comes down to being a series of bernoulli trials (random hits/misses) with a typical hitrate that's mostly stable for each competitor (even if it's not mostly stable, the poisson limit theorem means the following will tend to be the case anyway). The number of bernoulli trails is large, but what exactly it is doesn't particularly matter. Because there are two separate hitrates for the most part, one for each competitor, the result is that difference between each bot's score can be thought of as the difference of of two binomial distributions. When the number of trials is large enough, the binomial distribution tends to be approximated by a gaussian distribution. The difference of two gaussian distributions is another gaussian distribution. So eh... under the circumstances of battles with reasonable length, there seems to be good reason to be fairly confident that wins by a 10% margin matter more than losses by a 5% margin. The only exceptions really, seem to be when a robot intentionally distorts it, such as by intentionally forgoing score (i.e. conditionally acting depending on how it is scoring so far), rather than trying to maximize score. Is that exception worth caring about? I don't see how a robot could gain from such manipulation, or be subject to it by accident. It's not a flawless system, but surely 50.1 is more likely a win just due to luck, than 55.0 or 60.0 would be.  --[[User:Rednaxela|Rednaxela]] 18:31, 20 August 2011 (UTC)
 +
 +
::::::::::::But APS is not a difference between 2 binomial distributions. APS1 = score1 / (score1 + score2). APS2 = score2 / (score1 + score2). What you can do is make a strong assumption that score1 is a log-normal distribution of competitor1 strength, the same for competitor2. Even then, each battle has a different (score1 + score2) value. 5% or 10% in one battle is different than 5% or 10% in another. Losing with 45% in the first battle and winning with 60% in the second doesn't mean it is performing above average. It only means the first battle probably had a higher (score1 + score2) value. It's like using total turns per battle as metric. There might be a correlation between (score1 + score2) or battle length and strengths difference. But measuring strength this way is inaccurate and it will converge to a wrong (suboptimal) value. The closer the competitors strengths are, the greater the chance it will point to a wrong stronger competitor, and no amount of battles will fix that. In other words, close matches will be stable, but the calculated ranking for them will be meaningless. --[[User:MN|MN]] 01:58, 22 August 2011 (UTC)
 +
 +
::::::::::::Using Score/Condorcet or Winning rate/Elo, the ranking WILL converge to an accurate setting, because those systems do not try to replace facts (data) with assumptions. Although they might not do it right after the first battle. The ranks might shift a few positions until enough battles are uploaded... but it is better than a system which calculates a stable rank on the first go, but with over 20 ranks error and no possibility of making it smaller with more data. --[[User:MN|MN]] 01:58, 22 August 2011 (UTC)
 +
 +
::::::::::In Elo, the score is not varying on 1 Gaussian curve, it is the result of 2 competitor performances varying on 2 Gaussian curves overlapping. And the problem is also assuming APS is a good estimator of the distance between those 2 performances. Which is not as APS was never designed as a 2 Gaussian curve variations distance metric in the first place. --[[User:MN|MN]] 08:28, 20 August 2011 (UTC)
 +
 +
::::::::::In Elo, apart from the strength Gaussian curves, if one performance is better than another, it scores higher, but there is no measure of distance yet. Then, count how often it happens and you can estimate which Gaussian curve center is above the other, and the distance between the centers. The more variance, the closer the centers are. In Elo, variance is a metric, not a flaw. There is no thresholding, there is probability. Average the estimated distances from all pairings, close and distant, and you have an accurate rating. Actually, the instability in pairings is necessary for the system to stay accurate. The worst case scenario is if all pairings get an 100% winning rate with no circular ambiguities, then all ratings diverge to +infinity and -infinity. --[[User:MN|MN]] 08:28, 20 August 2011 (UTC)
 +
 +
::::::::::But that 2 Gaussian curves model is for statistical systems only, like Elo. Condorcet is a voting system and follows a different philosophy, which is the majority being always right no matter what. So, it doesn´t matter if the difference is small or big, the competitor ahead is stronger and that´s it. Take away the sudden flip in pairing evaluation and you contradict the majority rule. Doing this simply kills the system. Condorcet is designed to be fair and simple, not stable. Stability is a statistical concept. But simplicity has its value, as it can be calculated online and no competitor feels cheated because of not understanding the rating system. That´s why I think Score/Condorcet/Schulze/(ballot per battle) is nice. --[[User:MN|MN]] 08:28, 20 August 2011 (UTC)
 +
 +
::::::::::: Simply kills the system? It seems to me the whole notion of the result having any meaning is predicated on the idea that there are enough ballots that the bias/randomness among them does not accumulate large enough to cause it's own unfairness in the results. The results from PL and your batch ELO show that the noise amplification of thresholding can affect the upper rankings substantially. Unless the noise amplification can be compensated for, with either 1) having many more battles per pairing at least for the close pairings, or 2) mechanisms that deal with the fact that 50.1 is less meaningful than 55.0. I'm fine either either, but with neither the result being so affected by randomness that it detracts from how interesting/meaningful it would be to compete in. --[[User:Rednaxela|Rednaxela]] 18:31, 20 August 2011 (UTC)
 +
 +
::::::::::Thinking better, my voting ticket would be:  1-Score/Condorcet/Schulze (ballot per battle), 2-Win/draw/loss Elo (non-Condorcet), 3-Winning rate/Condorcet/Schulze (ballot per pairing), 4-Premier League (pairing APS/Condorcet/league APS, ballot per pairing), 5-Pairing APS/Condorcet/Schulze (ballot per bot), 6-APS (non-Condorcet). --[[User:MN|MN]] 08:28, 20 August 2011 (UTC)
 +
 +
::::::: Now... melee is another story entirely though, because "Score/Condorcet/Schulze (ballot per battle)" no longer means threshold at 50% (instead, the thresholds in battle rank are numerous, and thus amplify noise much less), and it also fixes the glaring "historical contamination" problem with APS in melee, where unlike 1v1 the historic composition melee rumble affects current results. Because pairing scores in melee are contaminated with the historic average strength of other robots in the melee rumble, including bots no longer in the melee rumble, historic artifacts can affect relative APS significantly if the robots are released at different times. So for melee, I'd ''very'' strongly favor a "Score/Condorcet/Schulze (ballot per battle)" system actually, it would really be my first choice. For 1v1 however, it has the "50% threshold instability" problem. --[[User:Rednaxela|Rednaxela]] 14:47, 19 August 2011 (UTC)
 +
 +
::::::::The problem would be restoring the battle composition of competitors in melee, because each pairing is uploaded independently with the current protocol. :( --[[User:MN|MN]] 17:34, 19 August 2011 (UTC)
 +
 +
::::::::: Yeah, the rumble upload protocol does need to be changed for melee and new battles would need to be run to replace the old battles. I don't think that's the big problem though. The protocol is easy enough to get changed, and resetting the melee battles and running new ones merely takes the time of a few computers that would otherwise be idle. What will probably take more effort, is changing the server code and database format, because it currently uses the same pairings code between melee and 1v1 and that would have to be split/replaced and such. It would also significantly affect the robot details page and so on. It'll take work, but IMO it's important for the melee rumble in the long run. --[[User:Rednaxela|Rednaxela]] 18:23, 19 August 2011 (UTC)
 +
 +
::::::::::This hitorical contamination was a consequence of APS replacing the older incremental Elo system. It will also happen with my batch Elo customization and any Condorcet method. --[[User:MN|MN]] 16:39, 20 August 2011 (UTC)
 +
 +
What about linear score division in PL? Something like this:
 +
# APS - 20 / 80, PL - 2 / 0
 +
# APS - 30 / 70, PL - 2 / 0
 +
# APS - 40 / 60, PL - 1.5 / 0.5
 +
# APS - 50 / 50, PL - 1 / 1
 +
# APS - 60 / 40, PL - 1.5 / 0.5
 +
# APS - 70 / 30, PL - 2 / 0
 +
# APS - 80 / 20, PL - 2 / 0
 +
 +
I think it will be pretty stable and it's some compromis between APS and PL --[[User:Jdev|Jdev]] 07:00, 18 August 2011 (UTC)
 +
 +
I was thinking something similar, except using a smooth function. How about <code>atan(K*(APS_pairing - 50))/Pi</code> or something? Where K determines how smooth the transition is from 'win' to 'lose'.
 +
 +
Don't get me wrong, I have nothing against there being a new scoring metric that favours a more 'against-the-best' performance. I just feel, that as it currently is, this one still has problems. I also think that some of the advantages touted aren't necessarily unique to this system, and I don't want there to be misconceptions. The point about YersiniaPestis being the only reason DrussGT has the PL and not Shadow - isn't that a perfect example of king-making? --[[User:Skilgannon|Skilgannon]] 07:12, 18 August 2011 (UTC)
 +
 +
:YersiniaPestis screwed Shadow´s rank because it defeated Shadow. It is a rock-paper-scissor setup, not a rock-rock-scissor setup, with the scissor being the king maker. Rock-paper-scissor scenarios are a different concept. --[[User:MN|MN]] 19:55, 18 August 2011 (UTC)
 +
 +
I was thinking of something similar too, but I thought there should be a small bonus for getting more than 99% against a robot to encourage perfection.  Wow, three people independently coming up with such similar ideas hopefully means many more will like them.--[[User:AW|AW]] 15:43, 18 August 2011 (UTC)
 +
 +
As far as functions to create intermediates between APS and PL, I think [[wikipedia:Sigmoid function|sigmoid curves]] like Skilgannon's one make for good choices yeah.  Currently, intermediates between APS and PL aren't my favored approach as a way to have something more stable but with some PL-like characteristics, but out of APS-PL intermediates, I certainly favor sigmoids :) --[[User:Rednaxela|Rednaxela]] 16:07, 18 August 2011 (UTC)
 +
 +
: Yeah - personally, I have trouble getting motivated for anything besides pure APS or pure "beat every bot in existence". The closer a new ranking system approximates the latter (sigmoids seem good), the more I'll like it. --[[User:Voidious|Voidious]] 17:13, 18 August 2011 (UTC)
 +
 +
: Not crazy about revising the PL scoring in this way.  This sounds like a hybrid or APS + PL, which will just generate rankings somewhere in between the two systems.  Not sure what that would be measuring.  --[[User:Darkcanuck|Darkcanuck]] 13:25, 19 August 2011 (UTC)
 +
 +
On a related note... Maybe we could make a Strongest Bots Rumble - say top 20 PL qualify - where we really will get 100 battles per pairing with BATTLESPERBOT=2000? Then we could reasonably trust the PL/Condorcet ranking without any magic smoothing. It could even be a sub-rumble with battles prioritized after NanoRumble APS. I guess that's probably a lot of work to setup, and of course a lot of CPU time. But if ranking the strongest vs strongest is something we want, we might as well consider taking it all the way. --[[User:Voidious|Voidious]] 17:13, 18 August 2011 (UTC)
 +
 +
That could work quite nicely. Either that, or Darkcanuck makes pairings more likely if the results are in the 40-60 range. What I'd like to see is a constant ranking of [http://old.robowiki.net/robowiki?TheBestScoringBot The Best Scoring Bot] as another column =). Maybe instead of ELO? --[[User:Skilgannon|Skilgannon]] 07:20, 19 August 2011 (UTC)
 +
 +
: As a brief note, I'd expect the ballot-per-bot style condorcet-method ranking I mention above to give results fairly related to "The Best Score Bot" counts actually, though not entirely influenced by that ;) --[[User:Rednaxela|Rednaxela]] 12:10, 19 August 2011 (UTC)
 +
 +
: Hm, I think the Condorcet methods produce a similar result (notice LifeLongObsession gets a high rating here too).  Nice find!  I'm definitely in favour of giving close or unstable match-ups more battles, just need to figure out how to identify them fairly.  I like the strongest bot rumble idea too, would also be easier to do fancy rankings on a smaller subset.  --[[User:Darkcanuck|Darkcanuck]] 13:25, 19 August 2011 (UTC)
 +
 +
::The problem is not identifying unstable match-ups fairly, is identifying them before they actually happen. It is a [[Wikipedia:Multi-armed_bandit|bandit problem]]. --[[User:MN|MN]] 14:34, 19 August 2011 (UTC)

Latest revision as of 03:04, 22 August 2011

Ah, I see what you mean now. The "king-making" references I found were to non-winners intentionally manipulating results to dictate the winner, which is obviously not the case in the RoboRumble. I'm fairly confident DrussGT has the strongest APS in every demographic of RoboRumble participants - low, mid, high-end bots, surfers, Pattern matchers, etc - so simply altering the composition of the rumble would not knock him off his throne. His strength is quite clear and not all that subjective, if you ask me. Only submitting bots with hard-coded behaviors against DrussGT could have an impact, and such a move would probably not go un-noticed and the community as a whole would intervene.

But it is true that with a drastically different RoboRumble population, say only DrussGT's worst 5 matchups =), another bot like Shadow could conceivably be called #1. And it's also reasonable if you want to view results as "a win is a win" - I personally quite like that view, and agree that the APS RoboRumble is more of a shared "challenge" than a direct competition. Though I do consider it a fair challenge, and one in which I still aspire to be #1 again some day. ;)

An important point to make in any scoring system that applies a winner-take-all view of each matchup is that we'd have to significantly alter priority battles to get accurate rankings. For close matchups, you may need 100 or more battles to determine a winner. There really is quite a lot of variance. Given that, we'd probably want to just start a separate participants list with only current and/or strong bots. Or even run a weekly tournament where each match is like best-of-99 or something.

--Voidious 03:48, 15 August 2011 (UTC)

Oh no! That "we need 9999 battles per pairing" accuracy talk again. That's why I went all the way with that as-accurate-as-possible batch algorithm. And the priority battles algorithm was already improved. And I prefer improving the rating system instead of blaming weaker competitors and kicking them out. Leave the sample bots alone! :P --MN 04:51, 15 August 2011 (UTC)
I'm not saying we need that for every pairing. But if the difference between #1 and #2 in the RoboRumble comes down to the winner of the DrussGT vs Shadow matchup, we have a problem if there were only 2-5 battles run. The ranking system you propose puts about a million times as much weight on who wins that matchup, so it better be accurate, and you need at least 100 battles in a close matchup to be reasonably sure you get the right winner. A cool new ranking system is going to be ignored if it's so unstable. --Voidious 12:48, 15 August 2011 (UTC)

First, thank you for clarifying what you were referring to with that page. Honestly though, I don't believe this is a significant problem in the rumble as it stands. Let me explain why.

Looking over things on the wikipedia link, like Voidious also notices, it appears problems of "king-making" are usually about when weaker opponents have an agenda to selectively hurt/help the score of certain other competitors. In the rumble however, I believe there are no known cases of any robots with such biases/agendas, it is certainly not widespread in any case.

Now, presuming no such dirty play is happening, where could the harm be? Well, if your high ranking bot is performing worse against low ranking bots than another high ranking bot? Is that a case of "outcomes are not dictated by a competitor's own performance"? I may be misunderstanding, but I don't believe it is, because so long as no selective biases are present, it is always possible to work to gain that same performance edge that the other high ranking bot has.

Also as far as competitive innovations, say you have a situation where one high ranking bot has an innovation that allows it to score 80-90% against rambots where most other high ranking bots reliably score in the 60-70% range. Is that not a competitive innovation in Robocode? Means of "king-maker" prevention that round things to win/tie/loss also don't value innovation of that sort, which I believe is a big shame. Is it silly that I consider such matters to be notable/interesting innovations?

Ranking methods that are more immune the low-ranking bots are certainly interesting , indeed valuable, and I believe are quite worth having in the rumble, but I don't see them as objectively better or worse. Both seem like equally valid challenges to me, neither with acute problems.

--Rednaxela 03:53, 15 August 2011 (UTC)

King-maker scenarios can also happen involuntarily. Simply make a specialist bot and it's done, you will hurt everyone it's specialized against. Or miss any bug in the implementation.
And there are also the bots with pre-calculated data:
if ("MyFavoriteRobot".equals(bot.getName())) {
    loadPreCalculatedData();
} else {
    System.out.println("Oh no!");
}
This was discussed a long time ago and allowed in the rumble. (can't find the link now)
But yes, it will punish cool algorithms which aim in increasing score far above 50%, because at the other end there is a king-maker allowing to be pushed far below 50%. But I don't know any other way to stop king-maker scenarios from happening. I didn't even try figuring out one, I just copied what is being done in other places. --MN 04:51, 15 August 2011 (UTC)
Well, yes, such scenarios can be involuntary, but consider the magnitude of the effect. There are over 800 robots in the rumble. If one or two are specialized they couldn't affect the rankings substantially. If a large number are "specialized" in the same way, then I'd view it taking advantage of a generic enough weakness that it's just as worthy as rambots hurting those who are not protected against rambots. If a large number are "specialized" in diverse ways, it should tend to average out overall. It seems to me that the sheer size of the rumble provides some amount of protection. --Rednaxela 06:09, 15 August 2011 (UTC)
The huge differences in the main APS ranking, and Premier League or the one I offered for download tells otherwise. --MN 14:19, 15 August 2011 (UTC)
I disagree. The huge differences in say... where SandboxDT ranks for instance, are much more indicative of how SandboxDT specializes itself (Strong against adaptive opponants, not particularly strong against simple opponents), rather than specializations of the low ranking bots which are for the most part relatively generic. --Rednaxela 11:33, 16 August 2011 (UTC)
About precalculated data, this is an issue, and they are allowed in the rumble yes. They are however uncommon except as temporary-novelty-tests and I also doubt they impact the score much. As brief asrobustide, if the community were to decide to get rid of the chance of pre-calculated data though, Robocode does now have the capability to "anonymize" robot names in scan data. Since we as a community have the capability to robustly negate it, I do not feel the scoring algorithm is the proper place to negate pre-calculated data impacts.
Then again, I am mostly thinking in terms of the main rumble. In the nano-codesize rumble, those issues of robots being over-specialized would play a greater role. --Rednaxela 06:09, 15 August 2011 (UTC)
You touched the "community" aspect, so I´ll get very philosophical/political now. There are basically two ways to make things happen. Let people do what they want, guide them indirectly through rewards (rating system) and accept whatever comes out. Or restrict people choices (robust negation?) so they go in the way you want, they wanting it or not. I prefer the first approach. --MN 14:19, 15 August 2011 (UTC)
There are basically two ways to make things happen? Assuming you mean in robocode, I would point out that there is at least one more way to 'make' things happen (really get things to happen). If you convince people that robocode with or without something is more interesting, most of these people will use or not use that thing. Though saved data would probably help any robot that does not have it against over half of the robots in the rumble, most robots in the top 10 do not use it. Why? Robocode is more interesting without it!--AW 15:02, 16 August 2011 (UTC)
It easier if you split the world in only 2 parts. O.o' --MN 02:36, 17 August 2011 (UTC)
Forgive me if I'm misinterpreting, but I think what MN is saying relates to the idea of "soft bans" (eg, [1]). With a good game and rule set, you can expect competitors to do everything possible to maximize their score, and the game retains its depth and balance. With poor games/rules, you end up with extreme imbalance or the need to ban certain tactics to retain the competitive aspects desired by the community. For instance, we kind of have a soft ban on pre-loading data, particularly among the top bots in General 1v1. We would probably also soft ban intentionally building hard-coded Problem Bots for our enemies that tanked against our own bots, if it ever happened. In general, I tend to agree that rule sets should be carefully crafted and then competitors should be expected to take every advantage available. The Robocode community is kind of rare in that such issues rarely become problems and are almost always resolved quickly and peacefully. Though I think we are off on quite a tangent at this point... =) --Voidious 21:15, 16 August 2011 (UTC)
(We are sort of off on a tangent, but I return to the topic later.) Perhaps I was misinterpreting, considering the fact that MN says
"I left robocoding a long time ago after seeing bin-based statistical algorithms owning the rumble, which I never liked. Hard-coded segmentation felt too artificial. More recently, after seeing dynamic clustering owning the rumble, which is a much cooler algorithm, and finally understanding what wave surfing is all about, it brought me back."
He probably doesn't really think there are only two ways to get these things to happen in robocode. What I was getting at was:
1) People make decisions based on much more than those two factors.
2) If enough people find this idea (writing bots that are trying to win the most frequently instead of by the largest margin) interesting, they will probably write these bots anyway. More than 800 robots could be improved by using Diamonds source, but they aren't. I don't think the reason is that people think Diamond's license is too restrictive.
MN, perhaps you could try something like this: User:Chase-san/NeoRoboRumbleParticipants --AW 23:49, 16 August 2011 (UTC)
A new server? I did really think in doing this from the beginning. If the idea does´t catch, it will be wasted effort. If the idea catches, it will split the community in half and damage the accuracy of both servers. If the idea really catches, it will kill Darkcanuck´s server, which held RoboRumble alive for years, and being right will never taste so bitter. The idea is to increase bot competition, not server competition. Since the battle setup is exactly the same, only result evaluation being different, both servers will be a lot similar to each other. But this option is still not discarded entirely.
Option 2: Explain what I see as bad with the current main rumble. And maybe people will agree and RoboRumble as a whole will evolve. If not... then, it is what you are seeing in the discussion pages. But at least 2 new pages in RoboWiki and a lot of wisdom being exchanged. I bet a lot of robocoders noticed something strange in the APS rating system, but didn´t know what exactly. Creation of Premier League, and wondering what would happen if only top bots were in the league are a strong sign of that. "King maker" is the concept behind all that.
Option 3: Build a new server and also a new client which uploads data to more than one server, so all servers keep high accuracy. Why having to choose if you can have all? But it will be trickier to implement as all servers will need to be compatible and it will restrict innovations in the new server and client.
Option 4: Well, 3 options seems enough. --MN 02:36, 17 August 2011 (UTC)
I highly doubt the community would be split in half, or that people will abandon APS rankings in the near future. It's been the primary means of evaluating bot performance for almost 10 years now and I doubt it will ever disappear. I do, however, think some bot authors might take a stronger interest in head-to-head performance if we had a more stable way of ranking it. Even with our horribly unstable "PL ranking", many bot authors enjoy this aspect of Robocode, myself (and most current bot authors) included. Robocoders love having various ways to evaluate their bots - have you seen how many challenges we've built up over the years? =) And, hmm, for starters, maybe I will try running that best-of-99 bracket tourney I mentioned before... --Voidious 15:05, 17 August 2011 (UTC)

The thing about this concept is that makes the assumption that the high score bot B gets against bot C is not available to bot A. What if bot B has figured out a technique that is able to take advantage of an up-till-now unknown weakness in bot C? Surely then bot A will need to develop an equivalent (or better) algorithm to also take advantage of this weakness? Either that, or bot C will need to fix this problem so that it cannot be taken advantage of. Both inspire innovation in the robocode community. A perfect example lies in the robocode history. Paul Evans came up with the idea of adjusting the multiplier on his random movement decision to cause his profile to change depending on what GFs he was getting hit at. ABC figured out the basic wavesurfing idea and put it in Shadow. Jamougha responded by writing(and open-sourcing) RaikoMX, the first bot with a modern wavesurfing algorithm. None of these would have happened if the only thing that mattered was getting more than 50% against more bots than the competition (as the random movement algorithms worked quite well), yet they have led to bots which far out-perform them even by non-ELO/APS metrics. Now THAT is innovation. --Skilgannon 06:35, 17 August 2011 (UTC)

They are all generalist algorithms that are rewarded even more in a non-APS system. Compare the rank of these bots you mentioned in an APS ranking and a wins/draws/losses ranking. --MN 17:00, 17 August 2011 (UTC)

On the other hand, preventing bot-specific behaviour sounds like a good idea, as probably the most effective way of preventing king-making. As long as the bot doesn't know who they're fighting, it's very hard to give a bias that nobody else can figure out and exploit as well. Follow my thinking? --Skilgannon 06:35, 17 August 2011 (UTC)

I know, but the rumble tends towards everyone using the same algorithms against intermediate specialist bots, and get stuck with them until those same bots improve. If they´re development is abandoned as most intermediate bots are, the rumble stagnates. Unwilling king makers are also a problem. --MN 17:00, 17 August 2011 (UTC)
But, if king maker scenarios are dealt with in the ranking system, then if a bot development becomes abandoned, it will gradually lower in ranking and stop interfering with the other bots. And the competition tends towards active development. All innovations came from top bots improving against other top bots. Simple targeting to Anti-head-on/linear/circular to Pattern-matching to Anti-pattern-matching to GuessFactor to Wave Surfing to Segmentation to Dynamic Clustering... --MN 17:00, 17 August 2011 (UTC)
I totally agree that ignoring APS would have delayed or caused us never to find many important innovations, Wave Surfing among them, and that it's an important metric for measuring bot performance. At the same time, I can't help but think that things like Anti-Surfer Targeting and surfing vs strong guns could be a lot further along if people focused on them more. Of course, nothing's stopping anyone from doing so if they find them interesting, and indeed there has been a gradual increase in those activities in recent years. And the funny/ironic part is that I'd advise someone to polish their gun/movement with APS as a metric before bothering to tune it against surfers/strong guns. ;) But a little more glory to the bot that defeats all others doesn't seem so bad, either, and might result in some innovations in those areas. Sometimes I wonder if ABC would still be active if the community focused on PL prowess more. He always seemed kinda bored with APS and more interested in King-crushing. =) --Voidious 15:19, 17 August 2011 (UTC)
 :) Not only am I following this discussion with great interest, but also I tend to agree more with MN's oppinon than you guys seem to. Maybe this kind of rating would shift the focus from the "obsessively perfectionist" type of bot to the more experimental strategist. Seem like a theoretically sound way to mix the two main kinds of rating (ELO+PL). And most importantly, makes Shadow get back into the top 3 without me having to touch it again! ;) --ABC 17:22, 17 August 2011 (UTC)
Before Skilgannon becomes pissed off with all that Shadow fan club, DrussGT is currently the only bot which dethroned it in both APS and PL leagues. Unfortunately, not thanks to its own strength, but to YersiniaPestis´. But in this case because of a rock-paper-scissor setup in the top 20 and not really a king maker scenario. So, DrussGT PL rank is authentic. --MN 17:00, 17 August 2011 (UTC)
Just for clarity - and not to disparage Shadow at all, as it's probably my favorite all-time bot, nor YersiniaPestis, another personal fav... But there seems to be a misconception that Shadow was the undefeatable PL champ until YersiniaPestis came along, perhaps due to wikipedia:Robocode's misinformation. The truth is as far back as 2006/2007, versions of both Dookious and Phoenix were fighting Shadow to at least a draw and holding the PL throne just as often as Shadow. Before that, Ascendant opted to optimize for APS instead of a tweak that let it beat Shadow at the time. Last I checked, Diamond is ~48% vs Shadow (over enough battles to matter). I imagine DrussGT is also pretty close to even with Shadow - I've certainly seen undefeated versions of DrussGT. Now 500-round battles, that might be a different story... =) --Voidious 17:57, 17 August 2011 (UTC)
And as I remember, Shadow´s Wave Surfing was born as a counter to SandboxDT´s GuessFactor, king-crushing style. --MN 17:00, 17 August 2011 (UTC)

Not sure what I can say about this, since I have since changed my style from going for PL to the APS. Seraphim almost ranks the same as Etna in the PL. Robots that are certainly impressive such as Etna wouldn't be considered quite so impressive. It does somewhat bother me my old rank 50 would be considered nearly as good if not better then a top 10, 2100 bot. — Chase-san 19:29, 17 August 2011 (UTC)

As a note Chase, Etna ranks 11th in the win/loss/tie-based system that MN was experimenting with. (As a little side note, note that RougeDC ranks 14th while Midboss ranked 36th despite those two using the same movement. If Etna was switched to RougeDC targeting instead of Midboss targeting, I'm sure it's score in that system would increase further than 11th even. So... Etna is not particularly hurt by win/loss/tie rounding I guess?) --Rednaxela 19:43, 17 August 2011 (UTC)
Looking closer... the cause of the difference appears to primarily be that Etna had too few battles per pairing before. Both PL and Offline batch ELO rating system are very unstable when battles per pairing is low and a bot has many near-ties. Both seem equally subject to the problem of close fights making the rankings change drastically in either direction.
As a little note... I think tonight I'll experiment with some alternative ranking systems, which may have some of those benefits (because there are some), without the problems that result from close fights. --Rednaxela 19:56, 17 August 2011 (UTC)
I have a Schulze (Condorcet) ranking method put together already, just running on the same dataset that MN used as I type. Will probably post a comparison table with standerd rumble rankings & MN's batch ranking when I get it all sorted out. Note that on a small subset of the rumble, this ranking method with pairings APS as an input tended to sort the bots by wins rather than overall score. --Darkcanuck
Sneak peek of the top 20, to spur further discussion (from August 9 data set and still suffers from instability with close match-ups). Overall APS is the 'score' reported for comparison purposes (it's not used in any calculation):
00 = jk.mega.DrussGT 2.0.7 (score=88.01)
01 = abc.Shadow 3.83c (score=83.86)
02 = voidious.Dookious 1.573c (score=85.64)
03 = voidious.Diamond 1.5.38c (score=87.04)
04 = zyx.mega.YersiniaPestis 3.0 (score=83.04)
05 = kc.serpent.WaveSerpent 2.11 (score=86.17)
06 = darkcanuck.Pris 0.88 (score=82.73)
07 = mue.Ascendant 1.2.27 (score=84.15)
08 = florent.XSeries.X2 0.17 (score=82.39)
09 = ar.horizon.Horizon 1.2.2 (score=82.24)
10 = simonton.beta.LifelongObsession 0.5.1 (score=80.06)
11 = Krabb.sliNk.Garm 0.9u (score=82.83)
12 = cs.s2.Seraphim 2.1.4 (score=78.52)
13 = ags.rougedc.RougeDC willow (score=82.63)
14 = positive.Portia 1.26e (score=80.70)
15 = pe.SandboxDT 3.02 (score=78.11)
16 = simonton.mini.WeeksOnEnd 1.10.4 (score=80.49)
17 = florent.test.Toad 0.14t (score=81.00)
18 = darkcanuck.Holden 1.13a (score=81.78)
19 = stelo.Spread 0.3 (score=75.47)
20 = lancel.Lynx 1.09 (score=76.06)
Funny, it's actually Schulze method that I'm dealing with tonight too. I'm rather curious exactly how you're translating the pairing data into input for the Schulze method, because there's a variety of ways you could do that. I'm pretty sure the way I'm implementing it here would be "immune" to being made unstable by close pairings. --Rednaxela 00:02, 18 August 2011 (UTC)
I wanted to focus on getting the method working first (some of the results above have me questioning that) then look at the input method. RIght now it just uses the pairing's APS, straight from the database. I plan to test win% (# wins / battles) later on. If you (or anyone) has other suggestions... --Darkcanuck 00:33, 18 August 2011 (UTC)
I would suggest total wins (full score) as main ranking, and total wins (survival score) as secondary ranking. And the opposite in Twin Duel. (total wins = # wins + 0.5 # draws). I think there is no need to normalize the score dividing it by # battles in Condorset. --MN 03:12, 18 August 2011 (UTC)
Actually it would skew the vote counts if some pairings had many more battles. Normalized is the way to go. --Darkcanuck 05:00, 18 August 2011 (UTC)
But % wins is nice for statistics. --MN 03:12, 18 August 2011 (UTC)
Total wins or % wins takes variance in account while APS doesn´t. --MN 03:12, 18 August 2011 (UTC)
I believe anything 50%-score-threshold-based (win count) is quite simply, doomed to be highly unstable when faced with close matches. The thresholding in the presence of close fights, inherently creates more instability than I consider acceptable. As such, the good things that come of such methods have to be achieved by different means instead. Hence my current line of investigation.. --Rednaxela 03:41, 18 August 2011 (UTC)
Arpard Elo way of dealing with this was taking the variance in account with win frequency and make the score shift gradually from 100%/0% to 0%/100%. --MN 13:16, 19 August 2011 (UTC)
Anyway, Darkcanuck, that's not what I was exactly referring to by a variety of ways it could be done. I meant that regardless of what numbers you use (score, win counts, whatever) there are many ways you could construct the "ballots" of sorts. One "preferential vote ballot" per robot, or consider individual pairings to be separate partially-filled ballots, among a variety of other ways to model it. --Rednaxela 03:41, 18 August 2011 (UTC)
Oh, I see. Well, I skipped the ballot step and went straight to pairing tallies using the raw pairing average score (stored as 0-100000 in the database) as the "vote count". So if DrussGT beats Dookius 55.37 to 44.63, then d(DrussGT,Dookius) = 55370 and d(Dookius,DrussGT) = 44630. --Darkcanuck 05:00, 18 August 2011 (UTC)
Ahh. I just implemented the ballot step here (It treats each "RatingsDetails" page as a ballot essentially, with itself added in the "50% score" rank. The idea is, don't reduce to win/loss, instead reduce to ranks against each competitor. I think it may have interesting results this way...) I also have other methods of ballot construction to try later too. Sleep is necessary though, so getting the results from the tallies will wait till tomorrow. :) --Rednaxela 05:24, 18 August 2011 (UTC)
I think king-maker scenarios are still there, the unwilling king maker type at least. But it does solve the problem of a single king maker spoiling those "hard-earned extra 0.5%". Actually, the effect of those "hard-earned extra 0.5%" are greatly amplified with that ballot per bot setup, I think to the point that it will make the rumble tends towards the perfectionist crush-the-weak style even more. --MN 13:23, 19 August 2011 (UTC)
But I would prefer that Condorcet/ballot per bot system to APS. My voting ticket would be: 1-Score/Condorcet/Schulze (ballot per battle), 2-Winning rate/Condorcet/Schulze (ballot per pairing), 3-Win/draw/loss Elo (non-Condorcet), 4-Premier League (pairing APS/Condorcet/league APS, ballot per pairing), 5-Pairing APS/Condorcet/Schulze (ballot per bot), 6-APS (non-Condorcet). --MN 13:23, 19 August 2011 (UTC)
Unless you're using a modified system that allows variable strength ballots, then "Score/Condorcet/Schulze (ballot per battle)" means you still have a threshold on 50% in effect, which means you'll still end up with unstable results no matter how you spin it. (If you do mean a modified system that allows variable strength ballots, then it might as well be per-pairing as to not have a bias in favor of pairings with a higher battle count)
The problem is, when you take any signal (be it roborumble battle scores, or RF signals, or wind speed), and threshold it, then whenever the true value is near the threshold, noise will be amplified drastically. This is, which is rather bad for accuracy of the processed result, and the lost information means that no subsequent algorithm in the process has any hope of restoring the accuracy. Now, if you apply the threshold per-battle (your way) instead of per pairing (PL score), the noise amplification will be diminished some after enough repetitions of measurement, however for that to come close to making fixing the noise amplification, hundreds of measurements would be needed, so for close battles we'd need many more battles per pairing for me to consider this acceptable.
Now, one alternative to a plain per-battle threshold comes to mind, which should solve the accuracy problem but give extremely similar results to what you want MN... We could take the standard deviation in score per pairing that a robot is involved in, then find the average standard deviation for that robot. This is based on the assumption that the a robot's performance roughly uniformly predictable/unpredictable across opponents. Then for each pairing, you take the average of the average standard deviation for competitor in the pairing. This is the "estimated standard deviation" for the pairing. If the above assumption is mostly true, this estimate of standard deviation will be more accurate than a direct computation of standard deviation for just one pairing's battles (after all, 3 or 4 battles doesn't give an accurate standard deviation). Using this estimated standard deviation, and the average pairing score, you should be able to calculate a much more accurate win rate for each pairing. There you go, a way to have your thresholding, and (given just one assumption) get more accuracy too ;) --Rednaxela 14:47, 19 August 2011 (UTC)
I´m not going to create any (more) assumptions, but repeat Elo´s assumptions. For Elo, what have constant standard deviation is the competitor´s strength (1 gaussian curve), and the variance in matches results are a consequence of that (2 gaussian curves overlapping). So, the closer the competitor´s strength are, the more variance a pairing will have (more overlapping). And strength difference is then inferred from the measured variance (winning rate). It´s a quite different assumption, and instability in rankings (overlapping of strength´s average´s curves) and close matches (overlapping of strength´s curves) are simple accepted, although they diminish gradually with more data. --MN 17:40, 19 August 2011 (UTC)
Given the magnitude of the effect in practice, It's simply not reasonable accept that unless we set up a priority battle system that gives vastly more battles (100x at least) to all pairings that are closely matched. Simply thresholding carelessly throws away data that can hint what the win rate would be over a larger number of battles. Okay, perhaps the "uniform unpredictability of performance across opponents" assumption was too big of an assumption. How about... since you don't want any more assumptions than elo, let's take the assumption of the score being a gaussian curve. Let's say a pairing has 4 battles, with scores of say... 45, 45, 60, and 60. Thresholding would go with a "win rate" of 50%. If you're to making an assumption that score varies on a gaussian curve though, regardless of what the standard deviation is, the curve most likely to give the four numbers "45, 45, 60, and 60" does not center at 50%. ;) --Rednaxela 18:23, 19 August 2011 (UTC)
No, it doesn´t center at 50%. But what guarantee that the central limit theorem holds in APS? A Gaussian curve is infinite on both sides, but APS starts at 0% and ends at 100%. Why that 5% below 50% from 45% is worth less than that 10% above 50% from 60%? Why is the first 60% equal to the second 60%? There are infinite different score setups which all end in the same percentage score. But it is assumed they are all equal. Very strong assumptions. And averaging is meaningless without these assumptions. 45, 45, 60, 60 does not center at 50% simply because it is assumed not to. APS appears to have more data, but it has more assumptions instead. --MN 08:28, 20 August 2011 (UTC)
Yes, it's true it "starts at 0% and ends at 100%". Well... let's see... Ignoring some minor artifacts of the scoring system, the result comes down to being a series of bernoulli trials (random hits/misses) with a typical hitrate that's mostly stable for each competitor (even if it's not mostly stable, the poisson limit theorem means the following will tend to be the case anyway). The number of bernoulli trails is large, but what exactly it is doesn't particularly matter. Because there are two separate hitrates for the most part, one for each competitor, the result is that difference between each bot's score can be thought of as the difference of of two binomial distributions. When the number of trials is large enough, the binomial distribution tends to be approximated by a gaussian distribution. The difference of two gaussian distributions is another gaussian distribution. So eh... under the circumstances of battles with reasonable length, there seems to be good reason to be fairly confident that wins by a 10% margin matter more than losses by a 5% margin. The only exceptions really, seem to be when a robot intentionally distorts it, such as by intentionally forgoing score (i.e. conditionally acting depending on how it is scoring so far), rather than trying to maximize score. Is that exception worth caring about? I don't see how a robot could gain from such manipulation, or be subject to it by accident. It's not a flawless system, but surely 50.1 is more likely a win just due to luck, than 55.0 or 60.0 would be. --Rednaxela 18:31, 20 August 2011 (UTC)
But APS is not a difference between 2 binomial distributions. APS1 = score1 / (score1 + score2). APS2 = score2 / (score1 + score2). What you can do is make a strong assumption that score1 is a log-normal distribution of competitor1 strength, the same for competitor2. Even then, each battle has a different (score1 + score2) value. 5% or 10% in one battle is different than 5% or 10% in another. Losing with 45% in the first battle and winning with 60% in the second doesn't mean it is performing above average. It only means the first battle probably had a higher (score1 + score2) value. It's like using total turns per battle as metric. There might be a correlation between (score1 + score2) or battle length and strengths difference. But measuring strength this way is inaccurate and it will converge to a wrong (suboptimal) value. The closer the competitors strengths are, the greater the chance it will point to a wrong stronger competitor, and no amount of battles will fix that. In other words, close matches will be stable, but the calculated ranking for them will be meaningless. --MN 01:58, 22 August 2011 (UTC)
Using Score/Condorcet or Winning rate/Elo, the ranking WILL converge to an accurate setting, because those systems do not try to replace facts (data) with assumptions. Although they might not do it right after the first battle. The ranks might shift a few positions until enough battles are uploaded... but it is better than a system which calculates a stable rank on the first go, but with over 20 ranks error and no possibility of making it smaller with more data. --MN 01:58, 22 August 2011 (UTC)
In Elo, the score is not varying on 1 Gaussian curve, it is the result of 2 competitor performances varying on 2 Gaussian curves overlapping. And the problem is also assuming APS is a good estimator of the distance between those 2 performances. Which is not as APS was never designed as a 2 Gaussian curve variations distance metric in the first place. --MN 08:28, 20 August 2011 (UTC)
In Elo, apart from the strength Gaussian curves, if one performance is better than another, it scores higher, but there is no measure of distance yet. Then, count how often it happens and you can estimate which Gaussian curve center is above the other, and the distance between the centers. The more variance, the closer the centers are. In Elo, variance is a metric, not a flaw. There is no thresholding, there is probability. Average the estimated distances from all pairings, close and distant, and you have an accurate rating. Actually, the instability in pairings is necessary for the system to stay accurate. The worst case scenario is if all pairings get an 100% winning rate with no circular ambiguities, then all ratings diverge to +infinity and -infinity. --MN 08:28, 20 August 2011 (UTC)
But that 2 Gaussian curves model is for statistical systems only, like Elo. Condorcet is a voting system and follows a different philosophy, which is the majority being always right no matter what. So, it doesn´t matter if the difference is small or big, the competitor ahead is stronger and that´s it. Take away the sudden flip in pairing evaluation and you contradict the majority rule. Doing this simply kills the system. Condorcet is designed to be fair and simple, not stable. Stability is a statistical concept. But simplicity has its value, as it can be calculated online and no competitor feels cheated because of not understanding the rating system. That´s why I think Score/Condorcet/Schulze/(ballot per battle) is nice. --MN 08:28, 20 August 2011 (UTC)
Simply kills the system? It seems to me the whole notion of the result having any meaning is predicated on the idea that there are enough ballots that the bias/randomness among them does not accumulate large enough to cause it's own unfairness in the results. The results from PL and your batch ELO show that the noise amplification of thresholding can affect the upper rankings substantially. Unless the noise amplification can be compensated for, with either 1) having many more battles per pairing at least for the close pairings, or 2) mechanisms that deal with the fact that 50.1 is less meaningful than 55.0. I'm fine either either, but with neither the result being so affected by randomness that it detracts from how interesting/meaningful it would be to compete in. --Rednaxela 18:31, 20 August 2011 (UTC)
Thinking better, my voting ticket would be: 1-Score/Condorcet/Schulze (ballot per battle), 2-Win/draw/loss Elo (non-Condorcet), 3-Winning rate/Condorcet/Schulze (ballot per pairing), 4-Premier League (pairing APS/Condorcet/league APS, ballot per pairing), 5-Pairing APS/Condorcet/Schulze (ballot per bot), 6-APS (non-Condorcet). --MN 08:28, 20 August 2011 (UTC)
Now... melee is another story entirely though, because "Score/Condorcet/Schulze (ballot per battle)" no longer means threshold at 50% (instead, the thresholds in battle rank are numerous, and thus amplify noise much less), and it also fixes the glaring "historical contamination" problem with APS in melee, where unlike 1v1 the historic composition melee rumble affects current results. Because pairing scores in melee are contaminated with the historic average strength of other robots in the melee rumble, including bots no longer in the melee rumble, historic artifacts can affect relative APS significantly if the robots are released at different times. So for melee, I'd very strongly favor a "Score/Condorcet/Schulze (ballot per battle)" system actually, it would really be my first choice. For 1v1 however, it has the "50% threshold instability" problem. --Rednaxela 14:47, 19 August 2011 (UTC)
The problem would be restoring the battle composition of competitors in melee, because each pairing is uploaded independently with the current protocol. :( --MN 17:34, 19 August 2011 (UTC)
Yeah, the rumble upload protocol does need to be changed for melee and new battles would need to be run to replace the old battles. I don't think that's the big problem though. The protocol is easy enough to get changed, and resetting the melee battles and running new ones merely takes the time of a few computers that would otherwise be idle. What will probably take more effort, is changing the server code and database format, because it currently uses the same pairings code between melee and 1v1 and that would have to be split/replaced and such. It would also significantly affect the robot details page and so on. It'll take work, but IMO it's important for the melee rumble in the long run. --Rednaxela 18:23, 19 August 2011 (UTC)
This hitorical contamination was a consequence of APS replacing the older incremental Elo system. It will also happen with my batch Elo customization and any Condorcet method. --MN 16:39, 20 August 2011 (UTC)

What about linear score division in PL? Something like this:

  1. APS - 20 / 80, PL - 2 / 0
  2. APS - 30 / 70, PL - 2 / 0
  3. APS - 40 / 60, PL - 1.5 / 0.5
  4. APS - 50 / 50, PL - 1 / 1
  5. APS - 60 / 40, PL - 1.5 / 0.5
  6. APS - 70 / 30, PL - 2 / 0
  7. APS - 80 / 20, PL - 2 / 0

I think it will be pretty stable and it's some compromis between APS and PL --Jdev 07:00, 18 August 2011 (UTC)

I was thinking something similar, except using a smooth function. How about atan(K*(APS_pairing - 50))/Pi or something? Where K determines how smooth the transition is from 'win' to 'lose'.

Don't get me wrong, I have nothing against there being a new scoring metric that favours a more 'against-the-best' performance. I just feel, that as it currently is, this one still has problems. I also think that some of the advantages touted aren't necessarily unique to this system, and I don't want there to be misconceptions. The point about YersiniaPestis being the only reason DrussGT has the PL and not Shadow - isn't that a perfect example of king-making? --Skilgannon 07:12, 18 August 2011 (UTC)

YersiniaPestis screwed Shadow´s rank because it defeated Shadow. It is a rock-paper-scissor setup, not a rock-rock-scissor setup, with the scissor being the king maker. Rock-paper-scissor scenarios are a different concept. --MN 19:55, 18 August 2011 (UTC)

I was thinking of something similar too, but I thought there should be a small bonus for getting more than 99% against a robot to encourage perfection. Wow, three people independently coming up with such similar ideas hopefully means many more will like them.--AW 15:43, 18 August 2011 (UTC)

As far as functions to create intermediates between APS and PL, I think sigmoid curves like Skilgannon's one make for good choices yeah. Currently, intermediates between APS and PL aren't my favored approach as a way to have something more stable but with some PL-like characteristics, but out of APS-PL intermediates, I certainly favor sigmoids :) --Rednaxela 16:07, 18 August 2011 (UTC)

Yeah - personally, I have trouble getting motivated for anything besides pure APS or pure "beat every bot in existence". The closer a new ranking system approximates the latter (sigmoids seem good), the more I'll like it. --Voidious 17:13, 18 August 2011 (UTC)
Not crazy about revising the PL scoring in this way. This sounds like a hybrid or APS + PL, which will just generate rankings somewhere in between the two systems. Not sure what that would be measuring. --Darkcanuck 13:25, 19 August 2011 (UTC)

On a related note... Maybe we could make a Strongest Bots Rumble - say top 20 PL qualify - where we really will get 100 battles per pairing with BATTLESPERBOT=2000? Then we could reasonably trust the PL/Condorcet ranking without any magic smoothing. It could even be a sub-rumble with battles prioritized after NanoRumble APS. I guess that's probably a lot of work to setup, and of course a lot of CPU time. But if ranking the strongest vs strongest is something we want, we might as well consider taking it all the way. --Voidious 17:13, 18 August 2011 (UTC)

That could work quite nicely. Either that, or Darkcanuck makes pairings more likely if the results are in the 40-60 range. What I'd like to see is a constant ranking of The Best Scoring Bot as another column =). Maybe instead of ELO? --Skilgannon 07:20, 19 August 2011 (UTC)

As a brief note, I'd expect the ballot-per-bot style condorcet-method ranking I mention above to give results fairly related to "The Best Score Bot" counts actually, though not entirely influenced by that ;) --Rednaxela 12:10, 19 August 2011 (UTC)
Hm, I think the Condorcet methods produce a similar result (notice LifeLongObsession gets a high rating here too). Nice find! I'm definitely in favour of giving close or unstable match-ups more battles, just need to figure out how to identify them fairly. I like the strongest bot rumble idea too, would also be easier to do fancy rankings on a smaller subset. --Darkcanuck 13:25, 19 August 2011 (UTC)
The problem is not identifying unstable match-ups fairly, is identifying them before they actually happen. It is a bandit problem. --MN 14:34, 19 August 2011 (UTC)