Difference between revisions of "Talk:Darkcanuck/RRServer/Ratings"

Revision as of 23:13, 26 August 2011

Explanations Behind Ratings

This page does a nice job of explaining what some of the ratings are, but it still assumes certain existing knowledge. Somewhere, perhaps on this page, there needs to be descriptions for all rating terms. This could be on the wiki or even just a legend at the darkcanuck ratings site.

Some examples of what is missing -- No where...anywhere...can I find out what "PBI" stands for or what it's significance is. I don't see anywhere that explains "Specialization" either. Also "LRP". Skotty 00:04, 26 May 2011 (UTC)

I'm pretty sure that sort of information mostly existed on the old wiki before, but yeah. Well, "PBI" means "ProblemBot Index", and it is the difference between your percent score, and the "expected" percent score based on the ELO scores of the two. (A system based on how it compares based on near-ranking-bots rather than ELO would be more accurate IMO, but ELO isn't that bad of a predictor when pairings are complete I guess)

"LRP" stands for "Linear Regression Plot". Well, the term is awfully vague IMO, but it refers to oldwiki:RoboRumble/LRP. It is a plot of PBI vs ELO score. Essentially, the plot can give you a quick visual display that can highlight outliers (i.e. things your bot does exceptionally well/poorly against) and show some general trends such as "Does it fare more favorably against high ranking bots, or against low ranking bots?". --Rednaxela 11:59, 26 May 2011 (UTC)

Battles per Pairing

I just wanted to comment on the statement, "It's uncertain how well it works with less battles or incomplete pairings." My experiment with the MC2K7 shows that separate runs of 75 battles can still show more than 1% variation for a given pairing. This affects any scoring system, and is a fact that we have to live with. The reliability of output can only be as good as input, no matter how fancy the interpolation is for incomplete pairings. The hope is that the variance will become a wash when seen over 600+ pairings. --Simonton 15:25, 26 September 2008 (UTC)

I think David Alves commented that targeting challenge scores also varied by almost 1% at 15 seasons, so I agree there's lots of evidence that more battles per pairing are needed, which would take a very, very long time in a 600+ competitor environment. You're right that as the number of competitors increases, variabilities cancel each other out. But at the same time, the bigger the competition, the more risk of a "black swan" competitor whose scores are all skewed in one direction. -- Darkcanuck 15:31, 26 September 2008 (UTC)

After scratching some things down on paper which are mostly intuition rather than statistics, I believe the odds of having such a "black swan" are either exactly the same or reduced by increasing the number of bots. --Simonton 16:05, 26 September 2008 (UTC)

Well, if there are 3 bots, the chance of one getting lucky against both others is 1/4th, multiply by 3 bots, and the chance of a "black swan" in 3 bots is 75% I believe. With 4 bots, the chance of one getting lucky against against all others is 1/8th, multiply by 4 bots, and the chance of a black swan is 50%. For 5 bots... it is 31.25% chance of a black swan. For 650 bots with one pairing each, the chance of a bot having above average score in every pairing is about 1 to 2.78*10^193. So if we presume getting lucky is anything above the mean score and there's a 50% chance of that in any pairing, and that a "black swan" is only when all pairings are lucky, then the chance of a black swan sharply decreases as the number of bots becomes larger. Of course perhaps what would be more useful than simply chance of there being a bot with all pairings lucky, would be the chance of luck making the score 1% different. I could calculate this, but only if I had a number of what the "standard deviation" of the percent score of an average robocode battle is. --Rednaxela 16:28, 26 September 2008 (UTC)

My intuitive hypothesis remains unshaken, but I don't have any numbers to prove it. But I can't argue with something to the power of 193. :) I'll look into adding standard deviation to some of the tables. What would be most useful, within a pairing, across all pairings, or across all final scores? -- Darkcanuck 16:43, 26 September 2008 (UTC)

Ah, now that you put the statistics that way I can see how to do it. With 3 bots each has 2 pairings, so the chance of both coin flips being "lucky" is indeed 25%. However, the chance of at least 1 of those bots hitting its 25% is (1 - 75%^3) ~= 57.8%. Generalized, this formula is 1 - (1 - .5^(bots - 1))^bots. If you graph that you can see it reduces to pretty much zero pretty quickly. --Simonton 17:18, 26 September 2008 (UTC)

Oh right, I got slightly mixed up and was multiplying by 3 when I should have been working with powers. --Rednaxela 17:35, 26 September 2008 (UTC)

You've got an extra leading paren, but that makes sense to me. Nothing to worry about! -- Darkcanuck 18:05, 26 September 2008 (UTC)

Glicko-2 Rating System

Looking at things as my recently added versions have gained battles, it's seeming like Glicko-2 seems FAR faster to converge to a realistic expected score far quicker than ELO or Glicko-1, and seems quite stable. Glicko-2's performance seems to really impress me. I wonder if maybe we should remove ELO and Glicko-1 at time point maybe, and just keep APS and Glicko-2? (Would that make uploading a little faster?) Also, maybe it would be good to make a modified APS that uses the Glicko-2 ratings to estimate the scores of missing pairings, in order to make the APS ranking less distorted by cases when there are incomplete pairings still? --Rednaxela 21:25, 25 November 2008 (UTC)

Is it possible to modify the 'deviation' so that we have a similar 'spread' in the rankings as the ELO does? And a second to the using G-2 ratings for estimating the score for missing pairings in the APS rankings.--Skilgannon 21:41, 25 November 2008 (UTC)

I'm glad you guys are comparing the ranking systems. From what I can tell, Elo and Glicko-2 ratings seem to settle to the same ranking order as APS, although I've never noticed which converges faster. The Glicko-1 scores haven't worked out so well, so they're not really viable -- I may try replacing that column with a Glicko-2 rating which updates only using the result of the last pairing result. This would speed up uploads since it would eliminate the full pairing query. The three current methods all rely on the same data so just removing one or two won't make a noticeable difference. I suppose we could fill in the APS using Glicko-2 expected scores, that would be interesting. And I could probably scale the Glicko-2 ratings to match the current Elo scores, if that's what you meant, Skilgannon. --Darkcanuck 06:25, 26 November 2008 (UTC)

Yes, that's it exactly. --Skilgannon 06:43, 26 November 2008 (UTC)

Premier League rating calculation

Is the PL score being calculated using only the last battle for each pairing? If so, I would suggest averaging all battles in the same pairing, like the APS system does. For example, if after 4 battles you win 3 times, (2+2+2+0)/4 = 1.5. The PL ranking would be more stable and there would be less ties. --MN 20:58, 16 July 2011 (UTC)

I believe the PL league works like the following (internally), every win adds one to a running total, every loss subtracts one, every actual tie does nothing. Then taking this total, if above zero is considered a win. If below zero considered a loss, and at zero is considered a tie. It then assigns points based on this win/tie/loss, win = 2, tie = 1, loss = 0. — Chase-san 21:16, 16 July 2011 (UTC)

A bot's rankings data (APS, Glicko2, PL, etc) gets updated every time a new battle involving that bot is uploaded. The PL score is based on the APS for each pairing: you get 2 points for every pair where your APS is > 50%. It's a winner-take-all system, so there's no credit given for losing at 49.999%. I didn't even implement the 1-point for a tie, since it's so unlikely to ever happen (except for crashing bots paired together). --Darkcanuck 22:44, 16 July 2011 (UTC)

A lot better than what I was suggesting... --MN 03:53, 17 July (UTC)

Also, can we have a ranking order for the PL league in the rankings page? --MN 20:58, 16 July 2011 (UTC)

In this case I would ask the actual rank column does not get sorted, so whatever it is we sort by we see by its rank. So we could see the ranks for the top worst survival scores as rank 1 to 800+. Just not sorting this column would be far simpler then asking for a rank for every sortable column. The sortable nature of the rank column I think would be used less often. — Chase-san 21:20, 16 July 2011 (UTC)

I could probably link to it (I think the sorting code already exists) but I think most of us just re-sort using the PL column. It gives an interesting view of APS rank vs PL score. --Darkcanuck 22:44, 16 July 2011 (UTC)

Re-sorting using the PL column works, but if the bot is not near the top you have to manually count the rows. Chase suggestion would be perfect. --MN 03:53, 17 July (UTC)

Please no special link for PL. I just got into the top-10, nobody needs to know that I get beaten by that much. ;-) --GrubbmGait 23:16, 16 July 2011 (UTC)

Offline_batch_ELO_rating_system

I added another page with a modified ELO system I made: Offline_batch_ELO_rating_system --MN 12:46, 12 August 2011 (UTC)

Premier League

Perhaps scores within .5 or 1 to 50 should be treated as a tie (since these are hard to hammer out taking many many seasons, and might still be wrong). So 49.5 to 50.5 OR 49.0 to 51.0 splits could be counted as 1 instead of one getting 2 and the other getting 0. This might motivate people to work to break the tie. After all unlike something with a low integer scoring (like soccer/football) ties are harder to come by. If no one agrees, I propose dividing all PL scores by 2 instead. :) — Chase-san 20:45, 26 August 2011 (UTC)

Schulze & Tideman Condorcet

I like both of these methods. Looking at the scoring, they have some very interesting results. I do somewhat like Tideman's more because of results. — Chase-san 21:05, 26 August 2011 (UTC)

Well, I like Tideman because it catapults Pris to #3, but the result doesn't make much sense to me... --Darkcanuck 21:13, 26 August 2011 (UTC)

Revision as of 23:05, 26 August 2011 (view source) Chase-san (talk \| contribs) (→‎Schulze & Tideman Condorcet: new section) ← Older edit		Revision as of 23:13, 26 August 2011 (view source) Darkcanuck (talk \| contribs) (→‎Schulze & Tideman Condorcet: reply) Newer edit →
Line 68:		Line 68:

	I like both of these methods. Looking at the scoring, they have some very interesting results. I do somewhat like Tideman's more because of results. — <span style="font-family: monospace">[[User:Chase-san\|Chase]]-[[User_talk:Chase-san\|san]]</span> 21:05, 26 August 2011 (UTC)		I like both of these methods. Looking at the scoring, they have some very interesting results. I do somewhat like Tideman's more because of results. — <span style="font-family: monospace">[[User:Chase-san\|Chase]]-[[User_talk:Chase-san\|san]]</span> 21:05, 26 August 2011 (UTC)
		+
		+	: Well, I like Tideman because it catapults Pris to #3, but the result doesn't make much sense to me... --[[User:Darkcanuck\|Darkcanuck]] 21:13, 26 August 2011 (UTC)

Difference between revisions of "Talk:Darkcanuck/RRServer/Ratings"

Revision as of 23:13, 26 August 2011

Contents

Explanations Behind Ratings

Battles per Pairing

Glicko-2 Rating System

Premier League rating calculation

Offline_batch_ELO_rating_system

Premier League

Schulze & Tideman Condorcet

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools