Talk:Darkcanuck/RRServer
Fire away...
Just a suggestion for an additional check. I have never seen score a bot more than 8000 points, so this could be checked too. When examining the results that messed up the original roborumble rating beyond repair, I saw results of 20000 against 16000 (Thats what you get when running OneOnOne with MELEE=YES). For the time being I let my client running (unattended) for ABC's server, as I don't really have the time for bughunting. Your effort however seems promising. Good luck. -- GrubbmGait
- Thanks! That's a good check, will be combining that with the survival >=35 (also your suggestion I think) once I rearrange the error handling and failure output to the client. Then I'll look into ELO... --Darkcanuck
- Your checks have both been implemented. -- Darkcanuck
Looking very nice! I have a couple questions and thoughts I thought I'll mention. So what is this "Ideal" column in the results mean? One thought I had about ratings, is perhaps it would be best to make the APS fill missing pairings with Glicko-based estimates? I'm thinking that would give the best long term stability/accuracy once pairings are complete while having something a more meaningful before pairings are complete before the pairings are complete. --Rednaxela 01:18, 26 September 2008 (UTC)
Thanks! I've just posted a bit more about ratings here. The "Ideal" column is my attempt to reverse calculate a rating based on a bot's APS. I just inverted the Glicko formula for "E" (expected probability) to yield a rating given if given E (i.e. APS) and a competitor's rating and RD. For the latter two I used the defaults (1500 and 350) so theoretically if the APS represents the score vs an average bot (and there's a uniform distribution?) then the rating might converge to the "ideal" value. But I have no idea if it works, just wanted to see how close it might be. I'm not sure you could fill in the pairings using Glicko + APS -- the reason systems like Glicko exist is to get around the problem of incomplete pairings, so the Glicko rating should be enough in itself. If it's accurate, that is -- we'll see once the ratings catch up to the pairings already submitted... -- Darkcanuck 03:39, 26 September 2008 (UTC)
Ahh, I see. Thanks for the explanation. If the Glicko rating doesn't converge very very close to the "Ideal" then I'd say it alone might not be the best fit alone for Robocode due to how complete pairings are not hard to get. The reason I suggest using APS and filling missing pairings with Glicko-based percent estimates, is because my proposed method will be guaranteed to always converge to an exact APS ranking order when pairings are complete, and would quite surely be at least slightly better than APS when pairings are not complete. Perhaps I'm more picky than most, but I'd consider a hybrid necessary if "Glicko" doesn't in practice converge to "Ideal" to within an accuracy that preserves exact rankings with APS (which I think is very plain and simple the most fair when there are complete pairings). I suppose we'll see how accurately Glicko converges :) --Rednaxela 04:25, 26 September 2008 (UTC)
- Be careful about the "ideal" convergence concept! Keep in mind that I made this value up and it doesn't really have a statistical basis of any sort. I was just curious what a naive reversal with a single data point might produce, in order to get an idea of what neighbourhood DrussGT's rating might be in, for example. I also wanted to get a sense whether I had programmed the formulas correctly. I wonder though, if we're abusing these rating systems by using %score instead of absolute win/lose values (1/0)? Would the Glicko rating converge more rapidly to match the APS scale if I had chosen win/loss? I'm very curious, but no so much as to interrupt the current rebuild, which may take longer than I thought. -- Darkcanuck 04:54, 26 September 2008 (UTC)
- Well, I'm not talking about the convergence to that "Ideal" column. I'm talking about convergence of the relative rankings as opposed to specific rating numbers. If the rankings, don't converge to exactly the same order as APS, then I think there's issue enough to justify a hybrid that uses APS, with ELO or Glicko to estimate missing pairings. --Rednaxela 05:10, 26 September 2008 (UTC)
- Gotcha. I suppose you could keep track of the rating (Elo or Glicko) and just use it to calculate expected scores for missing pairings. Then generate an estimated APS for full pairings. We'll have to see how well the ratings stabilize. I'm thinking I should have used Glicko-2 instead, since it includes a volatility rating to account for erratic (read problem bot) performance. -- Darkcanuck 06:22, 26 September 2008 (UTC)
Started sending the results to your server, as long as you relay them to ABC's server. What is the delay btw? --GrubbmGait 10:08, 26 September 2008 (UTC)
- Thanks for joining in! I have no plans to stop relaying results and have been doing so for almost a week now. If by "delay" you mean occasional slow connections, it's due to the scoring update and I've posted it on the known issues page. I have this process cranked up at the moment while I try to get the ratings to catch up, but it will get faster soon. :) -- Darkcanuck 15:25, 26 September 2008 (UTC)
Great job with his server, you can always get the ranking/battles_* files from my server and sumbit them all into yours. I'm also experimenting with mySql atm. My SQL skills are a little rusty but it's all coming back pretty fast :).
I also have a few doubts about the new ranting method. The first one is: why? From what I understand Glicko is an ELO extension for rankings where the match frequency is not uniform between participants, which is not the rumble's case? As an experiment it's very cool, but for me the "old" ELO method is time tested and proven to work great, and should be the default sorting method for the ranking table. --ABC 11:23, 26 September 2008 (UTC)
- I also have some doubts about if Glicko will actually give better or much different results than ELO, however I'm not sure ELO is really the best default ranking system when full pairings are easiest to get. I suppose we'll see once your server gets to full pairings, but I'm strongly suspecting there will be some ranking deviations from the APS ranking, which I think is hard to argue is in any way biased. --Rednaxela 13:26, 26 September 2008 (UTC)
- I have doubts as well, but I wouldn't have known until I tried it. My major objection against Elo is the lack of a clear, published implementation. It was easier to implement Glicko than to sort through the RR server code. If someone can clarify this for me, sure I'll try it out. Why not? -- Darkcanuck 15:25, 26 September 2008 (UTC)
Bravo
I just want to leave a note saying you're awesome. :) It's really nice having someone put effort into improving the rumble itself. Good work! --Simonton 03:27, 11 October 2008 (UTC)
Oh, and FNL, if you're reading this, that goes double for you :). --Simonton 03:30, 11 October 2008 (UTC)
- [View source↑]
- [History↑]
You cannot post new threads to this discussion page because it has been protected from new threads, or you do not currently have permission to edit.
Contents
Thread title | Replies | Last modified |
---|---|---|
retiring ELO column | 6 | 15:51, 17 February 2012 |
FatalFlaw's uploads have suspicious APS for Tomcat | 0 | 04:58, 16 February 2012 |
kidmumu uploads | 3 | 17:16, 1 February 2012 |
Feature Request: average APS diff in bots compare | 6 | 15:55, 17 November 2011 |
Performance | 1 | 22:48, 13 November 2011 |
Now that everyone's ELO rating is subzero in General 1v1 =), is it maybe time to retire it altogether?
I'm all for it =) Although, doesn't the LRP depend on ELO data? Maybe shift that over to Glicko data instead? And if there was some way to make the LRP show the 'expected' option by default... that would make my day =)
I'd also support removal of ELO from the rumble, and replacing it with Glicko or Glicko2 in the places that use it (LRP).
Elo is working fine, even with negative scores, but keeping both Elo and Glicko-2 is redundant. So, removing one of them is fine by me.
FatalFlaw's uploads have suspicious APS for Tomcat:
- lxx.Tomcat 3.55 VS voidious.mini.Komarious 1.88
- lxx.Tomcat 3.55 VS "baal.nano.N 1.42
- lxx.Tomcat 3.55 VS gf.Centaur.Centaur 0.6.7
Darkcanuck, can you rollback all his uploads?
I haven't had a chance to check if this could affect mn.Combat, but my #1 guess would be that perhaps it's a java version issue (i.e. kidmumu is using Java 5 and Combat requires Java 6?).
Failing that, I'd have to think that kidmumu's client may be skipping turns.
Probably a Java version issue. I´ll downgrade to 1.5 in future versions. But didn´t check other bots scores.
I'm sure there are lots of bots that require Java 6, right? We might want to have Darkcanuck rollback all his uploads until we can get kidmumu onto Java 6.
I find that until all pairings is done it's very useful to know current avarage difference in APS between two versions - after about 100 random battles this number says fairly exactly is newer version better, than older.
Darkcanuck, can you schedule to add row for columns "% Score", "% Survival" in section "+/- Difference" in bots compare page with avarage value of corresponds columns? I think, there're work for 1-2 hours maximum
I think this is already covered by the 'Common % Score (APS)' and 'Common % Survival', the lowest two lines in the top-table. At least I use it to check if my changes have a positive (or negative) result when the pairings are not complete yet.
No. May be i wrote not clear.
I mean, that i want to know average difference in pairings between 2 versions. According to my tests, this number stabilizes mach faster, than APS. And more, Common % Score does not make sense, because while there only 1 battle in every pairing it's exactly equals to APS and in another case, there may be 10 battles against Walls and 1 battle against Druss.
As far as I know, when your new version has for example 100 pairings, you will see the average APS for that 100 pairings. AND for your older version you will also see the APS for that 100 pairings. And you are right, this indicates much more reliable what your final score will be (relative to your older version) than plain APS. The one who can really answer this question is Darkcanuck.
The common %score is calculated just like APS, but only for pairings that the old and new versions have in common. That makes it easier to compare two versions when the new one is still missing many pairings, or in the case where the old bot may have pairings against a lot of retired bots (and may be missing scores vs newer bots). I think that's what you're looking for...
Yes, thank you:)
Can you turn on .htaccess browser caching for results.
#Caching ExpiresActive on ExpiresByType image/gif "access plus 1 year"
Other performance enhancing things you can do are: Set specific size for the images, inline or via CSS. CSS would be easier. This would speed up the page loading and be also less annoying while all the requests are going through (having default sized images deforming the table before they load). Minify the HTML/CSS/JS (less to send).
Not doing/doable for known reasons: Serve identical files from the same url. (flag images)