Difference between revisions of "Darkcanuck/RRServer/Ratings"

From Robowiki
Jump to navigation Jump to search
(trying to summarize how current Elo ratings are calculated)
(clean up, add Glicko-2, notes on Elo calculation)
Line 13: Line 13:
 
* taking the average percentage score of all battles against each opponent separately to get an APS for each pairing,
 
* taking the average percentage score of all battles against each opponent separately to get an APS for each pairing,
 
* then averaging all pairing scores to obtain the final average.
 
* then averaging all pairing scores to obtain the final average.
 +
 +
This differs from the old/current rumble server which weights newer scores higher than previous ones.  The old system uses something like: <code>newaverage = 0.3 * newscore + 0.7 * oldaverage</code>.  The reasons behind this weighting were to account for learning/data-saving bots improving in performance over time and to discard "bad" results.  However, I believe that this actually increases score variability so I've chosen to use a true average which will smooth out results.  Learning bots should still see their APS go up over time, but now there is no special advantage given to a bot which saves data between battles.
  
  
Line 23: Line 25:
  
  
== Current Status ==
+
== Glicko-2 Rating System ==
 
 
Pairings are still far from complete on the new server, but the APS values for the most part are up-to-date with the latest battles.  I just added the Glicko ratings yesterday and the server is incrementally building them from scratch, so expect them to catch up by tomorrow.  You'll notice that the two don't correspond nicely yet.  -- [[User:Darkcanuck|Darkcanuck]] 03:28, 26 September 2008 (UTC)
 
  
Squashed a bottleneck in the scoring update code and doubled the rate. Due to the increase in clients (plus I'm uploading melee results too, which flood the server with new data) we weren't catching up nearly fast enoughThis morning we turned the corner:  less unrated results than unrated!  Once we catch up, newer stuff (including melee) will finally show up in the rankings. -- [[User:Darkcanuck|Darkcanuck]] 16:03, 26 September 2008 (UTC)
+
Also by Mark Glickman:  http://math.bu.edu/people/mg/glicko/glicko2.doc/example.html
 +
This is an enhancement of the original Glicko system which adds a volatity measure to each scoreThe calculations are more involved but it should (in theory) provide stable ratings for even competitors whose performance is erratic (ie. high specialization index in rumble-speak).
  
As seen on the [[Darkcanuck/RRServer/Updates | updates page]], the ratings rebuild has completed and ratings/rankings are up-to-date within the last minute!  (going to remove this section soon, don't need two "update" areas to maintain) -- [[User:Darkcanuck|Darkcanuck]] 18:26, 27 September 2008 (UTC)
+
Again the system has been implemented as described but without RD decay.  Tau is set to 0.5 and the iterative volatility calculation has been bounded to at most 20 iterations.
  
  
== Elo Ratings on the Old/Current Server ==
+
== RoboRumble Elo Ratings ==
  
I'm trying to understand the old server's rating scheme.  Thanks mostly to the work of [[User:nfwu | Nfwu]] and his commented [[http://robowiki.net/cgi-bin/robowiki?Nfwu/EloSim | EloSim]] code, plus details scattered about the old wiki, I've pieced together the following:
+
I've tried to understand the old server's rating scheme and implement that system as faithfully as possible.  Thanks mostly to the work of [[User:nfwu | Nfwu]] and his commented [[http://robowiki.net/cgi-bin/robowiki?Nfwu/EloSim | EloSim]] code, plus details scattered about the old wiki, I've pieced together the following:
  
 
* New competitors start at a rating of 1600
 
* New competitors start at a rating of 1600
 
* The expected % score outcome of a pairing between bots A and B is given by E(A,B) = 1.0 / (1 + 20^(ratingA-ratingB)/800))
 
* The expected % score outcome of a pairing between bots A and B is given by E(A,B) = 1.0 / (1 + 20^(ratingA-ratingB)/800))
 
* When a new pairing result is submitted to the server:
 
* When a new pairing result is submitted to the server:
*# The new % score for bot A vs bot B  New(A,B) = scoreA / (scoreA + scoreB)
+
*# The new % score for bot A vs bot B  <code>New(A,B) = scoreA / (scoreA + scoreB)</code>
*# The running pairing score of A vs B  Pair(A,B)' = 0.7 * Pair(A,B) + 0.3 * New(A,B)
+
*# The running pairing score of A vs B  <code>Pair(A,B)' = 0.7 * Pair(A,B) + 0.3 * New(A,B)</code>
 
*# Then calculate the rating change for A by iterating over ''all'' ranked bots Ri:
 
*# Then calculate the rating change for A by iterating over ''all'' ranked bots Ri:
*#* deltaRatingA += 3.0 * (Pair(A,Ri) - E(A,Ri)
+
*#* <code>deltaRatingA += 3.0 * (Pair(A,Ri) - E(A,Ri)</code>
 
*# Do the same for B
 
*# Do the same for B
 
*# Update the ratings for A and B by adding the new delta to their current rating.
 
*# Update the ratings for A and B by adding the new delta to their current rating.
 +
 +
As mentioned above, this server uses the true average for Pair(A,B);  otherwise the calculations are the same.
 +
 +
 +
== Final Notes ==
 +
 +
One of [[Albert]]'s most interesting changes to the standard Elo calculation is that each new battle result requires updating a bot's rating by iterating over the whole participant list, not just the one involved in the battle.  I have no idea how this might affect rankings inflation/deflation, but I think it contributes significantly to stabilizing the ratings faster since every time a new result comes in, we re-calculate that bot's rating compared to the entire field.
 +
 +
This is computationally expensive, but since I had to put it in for Elo, it was easy to do the same for Glicko and Glicko-2.  Thus all three ratings systems are using the same method, just different calculations for the expected and ratings delta values.
 +
 +
Based on what I've read about the three systems, Glicko-2 would probably yield the fastest-stabilizing results, but that still needs to be determined by comparing the three side-by-side.  It's quite possible that Elo is "good enough", thanks to [[Albert]]'s modification.

Revision as of 05:38, 8 October 2008

Navigation: About | Updates | Ratings | Query API | Roadmap | Design | Develop | Known Issues


I'd like to open up a discussion on what ratings are meaningful in the rumble. There are various discussions scattered around the old wiki, but right now we have an opportunity to experiment with new things (on my server) and compare to existing results (ABC's server).

Here's what I've implemented on the new server and my reasons for doing so.


Average Percentage Score (APS)

This is probably the "purest" measure of a bot's performance. Under ideal conditions (i.e. full pairings and at least 20-50 battles per pairing to reduce variability), APS would allow an accurate comparison between all bots in the rumble. It's uncertain how well it works with less battles or incomplete pairings.

The server calculates APS for each bot by:

  • taking the average percentage score of all battles against each opponent separately to get an APS for each pairing,
  • then averaging all pairing scores to obtain the final average.

This differs from the old/current rumble server which weights newer scores higher than previous ones. The old system uses something like: newaverage = 0.3 * newscore + 0.7 * oldaverage. The reasons behind this weighting were to account for learning/data-saving bots improving in performance over time and to discard "bad" results. However, I believe that this actually increases score variability so I've chosen to use a true average which will smooth out results. Learning bots should still see their APS go up over time, but now there is no special advantage given to a bot which saves data between battles.


Glicko Rating System

Created by Mark Glickman, the system is described here: http://math.bu.edu/people/mg/glicko/glicko.doc/glicko.html The main difference from the Elo system is that each competitor has a ratings deviation (RD) in addition to their rating. New competitors start out with a high RD, which gradually drops as the rating settles. In Elo, winner and loser receive equal but opposite ratings adjustments after a battle, whereas in the Glicko system the adjustment is based on the RD value. A high RD results in a bigger adjustment, so new competitors are adjusted more quickly; established competitors ratings' should change more slowly.

The server implements the system to the letter, with the one exception that RD values in the rumble do not "decay" (increase) with inactivity. Scores start at 1500, RD values start at 350.


Glicko-2 Rating System

Also by Mark Glickman: http://math.bu.edu/people/mg/glicko/glicko2.doc/example.html This is an enhancement of the original Glicko system which adds a volatity measure to each score. The calculations are more involved but it should (in theory) provide stable ratings for even competitors whose performance is erratic (ie. high specialization index in rumble-speak).

Again the system has been implemented as described but without RD decay. Tau is set to 0.5 and the iterative volatility calculation has been bounded to at most 20 iterations.


RoboRumble Elo Ratings

I've tried to understand the old server's rating scheme and implement that system as faithfully as possible. Thanks mostly to the work of Nfwu and his commented [| EloSim] code, plus details scattered about the old wiki, I've pieced together the following:

  • New competitors start at a rating of 1600
  • The expected % score outcome of a pairing between bots A and B is given by E(A,B) = 1.0 / (1 + 20^(ratingA-ratingB)/800))
  • When a new pairing result is submitted to the server:
    1. The new % score for bot A vs bot B New(A,B) = scoreA / (scoreA + scoreB)
    2. The running pairing score of A vs B Pair(A,B)' = 0.7 * Pair(A,B) + 0.3 * New(A,B)
    3. Then calculate the rating change for A by iterating over all ranked bots Ri:
      • deltaRatingA += 3.0 * (Pair(A,Ri) - E(A,Ri)
    4. Do the same for B
    5. Update the ratings for A and B by adding the new delta to their current rating.

As mentioned above, this server uses the true average for Pair(A,B); otherwise the calculations are the same.


Final Notes

One of Albert's most interesting changes to the standard Elo calculation is that each new battle result requires updating a bot's rating by iterating over the whole participant list, not just the one involved in the battle. I have no idea how this might affect rankings inflation/deflation, but I think it contributes significantly to stabilizing the ratings faster since every time a new result comes in, we re-calculate that bot's rating compared to the entire field.

This is computationally expensive, but since I had to put it in for Elo, it was easy to do the same for Glicko and Glicko-2. Thus all three ratings systems are using the same method, just different calculations for the expected and ratings delta values.

Based on what I've read about the three systems, Glicko-2 would probably yield the fastest-stabilizing results, but that still needs to be determined by comparing the three side-by-side. It's quite possible that Elo is "good enough", thanks to Albert's modification.