Difference between revisions of "Talk:King maker"

From Robowiki
Jump to navigation Jump to search
(replies)
(reply)
Line 147: Line 147:
  
 
:::: Oh, I see.  Well, I skipped the ballot step and went straight to pairing tallies using the raw pairing average score (stored as 0-100000 in the database) as the "vote count".  So if DrussGT beats Dookius 55.37 to 44.63, then d(DrussGT,Dookius) = 55370 and d(Dookius,DrussGT) = 44630.  --[[User:Darkcanuck|Darkcanuck]] 05:00, 18 August 2011 (UTC)
 
:::: Oh, I see.  Well, I skipped the ballot step and went straight to pairing tallies using the raw pairing average score (stored as 0-100000 in the database) as the "vote count".  So if DrussGT beats Dookius 55.37 to 44.63, then d(DrussGT,Dookius) = 55370 and d(Dookius,DrussGT) = 44630.  --[[User:Darkcanuck|Darkcanuck]] 05:00, 18 August 2011 (UTC)
 +
 +
::::: Ahh. I just implemented the ballot step here (It treats each "RatingsDetails" page as a ballot essentially, with itself added in the "50% score" rank. The idea is, don't reduce to win/loss, instead reduce to ranks against each competitor. I think it may have interesting results this way...) I also have other methods of ballot construction to try later too. Sleep is necessary though, so getting the results from the tallies will wait till tomorrow. :) --[[User:Rednaxela|Rednaxela]] 05:24, 18 August 2011 (UTC)

Revision as of 06:24, 18 August 2011

Ah, I see what you mean now. The "king-making" references I found were to non-winners intentionally manipulating results to dictate the winner, which is obviously not the case in the RoboRumble. I'm fairly confident DrussGT has the strongest APS in every demographic of RoboRumble participants - low, mid, high-end bots, surfers, Pattern matchers, etc - so simply altering the composition of the rumble would not knock him off his throne. His strength is quite clear and not all that subjective, if you ask me. Only submitting bots with hard-coded behaviors against DrussGT could have an impact, and such a move would probably not go un-noticed and the community as a whole would intervene.

But it is true that with a drastically different RoboRumble population, say only DrussGT's worst 5 matchups =), another bot like Shadow could conceivably be called #1. And it's also reasonable if you want to view results as "a win is a win" - I personally quite like that view, and agree that the APS RoboRumble is more of a shared "challenge" than a direct competition. Though I do consider it a fair challenge, and one in which I still aspire to be #1 again some day. ;)

An important point to make in any scoring system that applies a winner-take-all view of each matchup is that we'd have to significantly alter priority battles to get accurate rankings. For close matchups, you may need 100 or more battles to determine a winner. There really is quite a lot of variance. Given that, we'd probably want to just start a separate participants list with only current and/or strong bots. Or even run a weekly tournament where each match is like best-of-99 or something.

--Voidious 03:48, 15 August 2011 (UTC)

Oh no! That "we need 9999 battles per pairing" accuracy talk again. That's why I went all the way with that as-accurate-as-possible batch algorithm. And the priority battles algorithm was already improved. And I prefer improving the rating system instead of blaming weaker competitors and kicking them out. Leave the sample bots alone! :P --MN 04:51, 15 August 2011 (UTC)
I'm not saying we need that for every pairing. But if the difference between #1 and #2 in the RoboRumble comes down to the winner of the DrussGT vs Shadow matchup, we have a problem if there were only 2-5 battles run. The ranking system you propose puts about a million times as much weight on who wins that matchup, so it better be accurate, and you need at least 100 battles in a close matchup to be reasonably sure you get the right winner. A cool new ranking system is going to be ignored if it's so unstable. --Voidious 12:48, 15 August 2011 (UTC)

First, thank you for clarifying what you were referring to with that page. Honestly though, I don't believe this is a significant problem in the rumble as it stands. Let me explain why.

Looking over things on the wikipedia link, like Voidious also notices, it appears problems of "king-making" are usually about when weaker opponents have an agenda to selectively hurt/help the score of certain other competitors. In the rumble however, I believe there are no known cases of any robots with such biases/agendas, it is certainly not widespread in any case.

Now, presuming no such dirty play is happening, where could the harm be? Well, if your high ranking bot is performing worse against low ranking bots than another high ranking bot? Is that a case of "outcomes are not dictated by a competitor's own performance"? I may be misunderstanding, but I don't believe it is, because so long as no selective biases are present, it is always possible to work to gain that same performance edge that the other high ranking bot has.

Also as far as competitive innovations, say you have a situation where one high ranking bot has an innovation that allows it to score 80-90% against rambots where most other high ranking bots reliably score in the 60-70% range. Is that not a competitive innovation in Robocode? Means of "king-maker" prevention that round things to win/tie/loss also don't value innovation of that sort, which I believe is a big shame. Is it silly that I consider such matters to be notable/interesting innovations?

Ranking methods that are more immune the low-ranking bots are certainly interesting , indeed valuable, and I believe are quite worth having in the rumble, but I don't see them as objectively better or worse. Both seem like equally valid challenges to me, neither with acute problems.

--Rednaxela 03:53, 15 August 2011 (UTC)

King-maker scenarios can also happen involuntarily. Simply make a specialist bot and it's done, you will hurt everyone it's specialized against. Or miss any bug in the implementation.
And there are also the bots with pre-calculated data:
if ("MyFavoriteRobot".equals(bot.getName())) {
    loadPreCalculatedData();
} else {
    System.out.println("Oh no!");
}
This was discussed a long time ago and allowed in the rumble. (can't find the link now)
But yes, it will punish cool algorithms which aim in increasing score far above 50%, because at the other end there is a king-maker allowing to be pushed far below 50%. But I don't know any other way to stop king-maker scenarios from happening. I didn't even try figuring out one, I just copied what is being done in other places. --MN 04:51, 15 August 2011 (UTC)
Well, yes, such scenarios can be involuntary, but consider the magnitude of the effect. There are over 800 robots in the rumble. If one or two are specialized they couldn't affect the rankings substantially. If a large number are "specialized" in the same way, then I'd view it taking advantage of a generic enough weakness that it's just as worthy as rambots hurting those who are not protected against rambots. If a large number are "specialized" in diverse ways, it should tend to average out overall. It seems to me that the sheer size of the rumble provides some amount of protection. --Rednaxela 06:09, 15 August 2011 (UTC)
The huge differences in the main APS ranking, and Premier League or the one I offered for download tells otherwise. --MN 14:19, 15 August 2011 (UTC)
I disagree. The huge differences in say... where SandboxDT ranks for instance, are much more indicative of how SandboxDT specializes itself (Strong against adaptive opponants, not particularly strong against simple opponents), rather than specializations of the low ranking bots which are for the most part relatively generic. --Rednaxela 11:33, 16 August 2011 (UTC)
About precalculated data, this is an issue, and they are allowed in the rumble yes. They are however uncommon except as temporary-novelty-tests and I also doubt they impact the score much. As brief asrobustide, if the community were to decide to get rid of the chance of pre-calculated data though, Robocode does now have the capability to "anonymize" robot names in scan data. Since we as a community have the capability to robustly negate it, I do not feel the scoring algorithm is the proper place to negate pre-calculated data impacts.
Then again, I am mostly thinking in terms of the main rumble. In the nano-codesize rumble, those issues of robots being over-specialized would play a greater role. --Rednaxela 06:09, 15 August 2011 (UTC)
You touched the "community" aspect, so I´ll get very philosophical/political now. There are basically two ways to make things happen. Let people do what they want, guide them indirectly through rewards (rating system) and accept whatever comes out. Or restrict people choices (robust negation?) so they go in the way you want, they wanting it or not. I prefer the first approach. --MN 14:19, 15 August 2011 (UTC)
There are basically two ways to make things happen? Assuming you mean in robocode, I would point out that there is at least one more way to 'make' things happen (really get things to happen). If you convince people that robocode with or without something is more interesting, most of these people will use or not use that thing. Though saved data would probably help any robot that does not have it against over half of the robots in the rumble, most robots in the top 10 do not use it. Why? Robocode is more interesting without it!--AW 15:02, 16 August 2011 (UTC)
It easier if you split the world in only 2 parts. O.o' --MN 02:36, 17 August 2011 (UTC)
Forgive me if I'm misinterpreting, but I think what MN is saying relates to the idea of "soft bans" (eg, [1]). With a good game and rule set, you can expect competitors to do everything possible to maximize their score, and the game retains its depth and balance. With poor games/rules, you end up with extreme imbalance or the need to ban certain tactics to retain the competitive aspects desired by the community. For instance, we kind of have a soft ban on pre-loading data, particularly among the top bots in General 1v1. We would probably also soft ban intentionally building hard-coded Problem Bots for our enemies that tanked against our own bots, if it ever happened. In general, I tend to agree that rule sets should be carefully crafted and then competitors should be expected to take every advantage available. The Robocode community is kind of rare in that such issues rarely become problems and are almost always resolved quickly and peacefully. Though I think we are off on quite a tangent at this point... =) --Voidious 21:15, 16 August 2011 (UTC)
(We are sort of off on a tangent, but I return to the topic later.) Perhaps I was misinterpreting, considering the fact that MN says
"I left robocoding a long time ago after seeing bin-based statistical algorithms owning the rumble, which I never liked. Hard-coded segmentation felt too artificial. More recently, after seeing dynamic clustering owning the rumble, which is a much cooler algorithm, and finally understanding what wave surfing is all about, it brought me back."
He probably doesn't really think there are only two ways to get these things to happen in robocode. What I was getting at was:
1) People make decisions based on much more than those two factors.
2) If enough people find this idea (writing bots that are trying to win the most frequently instead of by the largest margin) interesting, they will probably write these bots anyway. More than 800 robots could be improved by using Diamonds source, but they aren't. I don't think the reason is that people think Diamond's license is too restrictive.
MN, perhaps you could try something like this: User:Chase-san/NeoRoboRumbleParticipants --AW 23:49, 16 August 2011 (UTC)
A new server? I did really think in doing this from the beginning. If the idea does´t catch, it will be wasted effort. If the idea catches, it will split the community in half and damage the accuracy of both servers. If the idea really catches, it will kill Darkcanuck´s server, which held RoboRumble alive for years, and being right will never taste so bitter. The idea is to increase bot competition, not server competition. Since the battle setup is exactly the same, only result evaluation being different, both servers will be a lot similar to each other. But this option is still not discarded entirely.
Option 2: Explain what I see as bad with the current main rumble. And maybe people will agree and RoboRumble as a whole will evolve. If not... then, it is what you are seeing in the discussion pages. But at least 2 new pages in RoboWiki and a lot of wisdom being exchanged. I bet a lot of robocoders noticed something strange in the APS rating system, but didn´t know what exactly. Creation of Premier League, and wondering what would happen if only top bots were in the league are a strong sign of that. "King maker" is the concept behind all that.
Option 3: Build a new server and also a new client which uploads data to more than one server, so all servers keep high accuracy. Why having to choose if you can have all? But it will be trickier to implement as all servers will need to be compatible and it will restrict innovations in the new server and client.
Option 4: Well, 3 options seems enough. --MN 02:36, 17 August 2011 (UTC)
I highly doubt the community would be split in half, or that people will abandon APS rankings in the near future. It's been the primary means of evaluating bot performance for almost 10 years now and I doubt it will ever disappear. I do, however, think some bot authors might take a stronger interest in head-to-head performance if we had a more stable way of ranking it. Even with our horribly unstable "PL ranking", many bot authors enjoy this aspect of Robocode, myself (and most current bot authors) included. Robocoders love having various ways to evaluate their bots - have you seen how many challenges we've built up over the years? =) And, hmm, for starters, maybe I will try running that best-of-99 bracket tourney I mentioned before... --Voidious 15:05, 17 August 2011 (UTC)

The thing about this concept is that makes the assumption that the high score bot B gets against bot C is not available to bot A. What if bot B has figured out a technique that is able to take advantage of an up-till-now unknown weakness in bot C? Surely then bot A will need to develop an equivalent (or better) algorithm to also take advantage of this weakness? Either that, or bot C will need to fix this problem so that it cannot be taken advantage of. Both inspire innovation in the robocode community. A perfect example lies in the robocode history. Paul Evans came up with the idea of adjusting the multiplier on his random movement decision to cause his profile to change depending on what GFs he was getting hit at. ABC figured out the basic wavesurfing idea and put it in Shadow. Jamougha responded by writing(and open-sourcing) RaikoMX, the first bot with a modern wavesurfing algorithm. None of these would have happened if the only thing that mattered was getting more than 50% against more bots than the competition (as the random movement algorithms worked quite well), yet they have led to bots which far out-perform them even by non-ELO/APS metrics. Now THAT is innovation. --Skilgannon 06:35, 17 August 2011 (UTC)

They are all generalist algorithms that are rewarded even more in a non-APS system. Compare the rank of these bots you mentioned in an APS ranking and a wins/draws/losses ranking. --MN 17:00, 17 August 2011 (UTC)

On the other hand, preventing bot-specific behaviour sounds like a good idea, as probably the most effective way of preventing king-making. As long as the bot doesn't know who they're fighting, it's very hard to give a bias that nobody else can figure out and exploit as well. Follow my thinking? --Skilgannon 06:35, 17 August 2011 (UTC)

I know, but the rumble tends towards everyone using the same algorithms against intermediate specialist bots, and get stuck with them until those same bots improve. If they´re development is abandoned as most intermediate bots are, the rumble stagnates. Unwilling king makers are also a problem. --MN 17:00, 17 August 2011 (UTC)
But, if king maker scenarios are dealt with in the ranking system, then if a bot development becomes abandoned, it will gradually lower in ranking and stop interfering with the other bots. And the competition tends towards active development. All innovations came from top bots improving against other top bots. Simple targeting to Anti-head-on/linear/circular to Pattern-matching to Anti-pattern-matching to GuessFactor to Wave Surfing to Segmentation to Dynamic Clustering... --MN 17:00, 17 August 2011 (UTC)
I totally agree that ignoring APS would have delayed or caused us never to find many important innovations, Wave Surfing among them, and that it's an important metric for measuring bot performance. At the same time, I can't help but think that things like Anti-Surfer Targeting and surfing vs strong guns could be a lot further along if people focused on them more. Of course, nothing's stopping anyone from doing so if they find them interesting, and indeed there has been a gradual increase in those activities in recent years. And the funny/ironic part is that I'd advise someone to polish their gun/movement with APS as a metric before bothering to tune it against surfers/strong guns. ;) But a little more glory to the bot that defeats all others doesn't seem so bad, either, and might result in some innovations in those areas. Sometimes I wonder if ABC would still be active if the community focused on PL prowess more. He always seemed kinda bored with APS and more interested in King-crushing. =) --Voidious 15:19, 17 August 2011 (UTC)
 :) Not only am I following this discussion with great interest, but also I tend to agree more with MN's oppinon than you guys seem to. Maybe this kind of rating would shift the focus from the "obsessively perfectionist" type of bot to the more experimental strategist. Seem like a theoretically sound way to mix the two main kinds of rating (ELO+PL). And most importantly, makes Shadow get back into the top 3 without me having to touch it again! ;) --ABC 17:22, 17 August 2011 (UTC)
Before Skilgannon becomes pissed off with all that Shadow fan club, DrussGT is currently the only bot which dethroned it in both APS and PL leagues. Unfortunately, not thanks to its own strength, but to YersiniaPestis´. But in this case because of a rock-paper-scissor setup in the top 20 and not really a king maker scenario. So, DrussGT PL rank is authentic. --MN 17:00, 17 August 2011 (UTC)
Just for clarity - and not to disparage Shadow at all, as it's probably my favorite all-time bot, nor YersiniaPestis, another personal fav... But there seems to be a misconception that Shadow was the undefeatable PL champ until YersiniaPestis came along, perhaps due to wikipedia:Robocode's misinformation. The truth is as far back as 2006/2007, versions of both Dookious and Phoenix were fighting Shadow to at least a draw and holding the PL throne just as often as Shadow. Before that, Ascendant opted to optimize for APS instead of a tweak that let it beat Shadow at the time. Last I checked, Diamond is ~48% vs Shadow (over enough battles to matter). I imagine DrussGT is also pretty close to even with Shadow - I've certainly seen undefeated versions of DrussGT. Now 500-round battles, that might be a different story... =) --Voidious 17:57, 17 August 2011 (UTC)
And as I remember, Shadow´s Wave Surfing was born as a counter to SandboxDT´s GuessFactor, king-crushing style. --MN 17:00, 17 August 2011 (UTC)

Not sure what I can say about this, since I have since changed my style from going for PL to the APS. Seraphim almost ranks the same as Etna in the PL. Robots that are certainly impressive such as Etna wouldn't be considered quite so impressive. It does somewhat bother me my old rank 50 would be considered nearly as good if not better then a top 10, 2100 bot. — Chase-san 19:29, 17 August 2011 (UTC)

As a note Chase, Etna ranks 11th in the win/loss/tie-based system that MN was experimenting with. (As a little side note, note that RougeDC ranks 14th while Midboss ranked 36th despite those two using the same movement. If Etna was switched to RougeDC targeting instead of Midboss targeting, I'm sure it's score in that system would increase further than 11th even. So... Etna is not particularly hurt by win/loss/tie rounding I guess?) --Rednaxela 19:43, 17 August 2011 (UTC)
Looking closer... the cause of the difference appears to primarily be that Etna had too few battles per pairing before. Both PL and Offline batch ELO rating system are very unstable when battles per pairing is low and a bot has many near-ties. Both seem equally subject to the problem of close fights making the rankings change drastically in either direction.
As a little note... I think tonight I'll experiment with some alternative ranking systems, which may have some of those benefits (because there are some), without the problems that result from close fights. --Rednaxela 19:56, 17 August 2011 (UTC)
I have a Schulze (Condorcet) ranking method put together already, just running on the same dataset that MN used as I type. Will probably post a comparison table with standerd rumble rankings & MN's batch ranking when I get it all sorted out. Note that on a small subset of the rumble, this ranking method with pairings APS as an input tended to sort the bots by wins rather than overall score. --Darkcanuck
Sneak peek of the top 20, to spur further discussion (from August 9 data set and still suffers from instability with close match-ups). Overall APS is the 'score' reported for comparison purposes (it's not used in any calculation):
00 = jk.mega.DrussGT 2.0.7 (score=88.01)
01 = abc.Shadow 3.83c (score=83.86)
02 = voidious.Dookious 1.573c (score=85.64)
03 = voidious.Diamond 1.5.38c (score=87.04)
04 = zyx.mega.YersiniaPestis 3.0 (score=83.04)
05 = kc.serpent.WaveSerpent 2.11 (score=86.17)
06 = darkcanuck.Pris 0.88 (score=82.73)
07 = mue.Ascendant 1.2.27 (score=84.15)
08 = florent.XSeries.X2 0.17 (score=82.39)
09 = ar.horizon.Horizon 1.2.2 (score=82.24)
10 = simonton.beta.LifelongObsession 0.5.1 (score=80.06)
11 = Krabb.sliNk.Garm 0.9u (score=82.83)
12 = cs.s2.Seraphim 2.1.4 (score=78.52)
13 = ags.rougedc.RougeDC willow (score=82.63)
14 = positive.Portia 1.26e (score=80.70)
15 = pe.SandboxDT 3.02 (score=78.11)
16 = simonton.mini.WeeksOnEnd 1.10.4 (score=80.49)
17 = florent.test.Toad 0.14t (score=81.00)
18 = darkcanuck.Holden 1.13a (score=81.78)
19 = stelo.Spread 0.3 (score=75.47)
20 = lancel.Lynx 1.09 (score=76.06)
Funny, it's actually Schulze method that I'm dealing with tonight too. I'm rather curious exactly how you're translating the pairing data into input for the Schulze method, because there's a variety of ways you could do that. I'm pretty sure the way I'm implementing it here would be "immune" to being made unstable by close pairings. --Rednaxela 00:02, 18 August 2011 (UTC)
I wanted to focus on getting the method working first (some of the results above have me questioning that) then look at the input method. RIght now it just uses the pairing's APS, straight from the database. I plan to test win% (# wins / battles) later on. If you (or anyone) has other suggestions... --Darkcanuck 00:33, 18 August 2011 (UTC)
I would suggest total wins (full score) as main ranking, and total wins (survival score) as secondary ranking. And the opposite in Twin Duel. (total wins = # wins + 0.5 # draws). I think there is no need to normalize the score dividing it by # battles in Condorset. --MN 03:12, 18 August 2011 (UTC)
Actually it would skew the vote counts if some pairings had many more battles. Normalized is the way to go. --Darkcanuck 05:00, 18 August 2011 (UTC)
But % wins is nice for statistics. --MN 03:12, 18 August 2011 (UTC)
Total wins or % wins takes variance in account while APS doesn´t. --MN 03:12, 18 August 2011 (UTC)
I believe anything 50%-score-threshold-based (win count) is quite simply, doomed to be highly unstable when faced with close matches. The thresholding in the presence of close fights, inherently creates more instability than I consider acceptable. As such, the good things that come of such methods have to be achieved by different means instead. Hence my current line of investigation.. --Rednaxela 03:41, 18 August 2011 (UTC)
Anyway, Darkcanuck, that's not what I was exactly referring to by a variety of ways it could be done. I meant that regardless of what numbers you use (score, win counts, whatever) there are many ways you could construct the "ballots" of sorts. One "preferential vote ballot" per robot, or consider individual pairings to be separate partially-filled ballots, among a variety of other ways to model it. --Rednaxela 03:41, 18 August 2011 (UTC)
Oh, I see. Well, I skipped the ballot step and went straight to pairing tallies using the raw pairing average score (stored as 0-100000 in the database) as the "vote count". So if DrussGT beats Dookius 55.37 to 44.63, then d(DrussGT,Dookius) = 55370 and d(Dookius,DrussGT) = 44630. --Darkcanuck 05:00, 18 August 2011 (UTC)
Ahh. I just implemented the ballot step here (It treats each "RatingsDetails" page as a ballot essentially, with itself added in the "50% score" rank. The idea is, don't reduce to win/loss, instead reduce to ranks against each competitor. I think it may have interesting results this way...) I also have other methods of ballot construction to try later too. Sleep is necessary though, so getting the results from the tallies will wait till tomorrow. :) --Rednaxela 05:24, 18 August 2011 (UTC)