Darkcanuck/RRServer/Ratings

From Robowiki
< Darkcanuck‎ | RRServer
Revision as of 06:38, 8 October 2008 by Darkcanuck (talk | contribs) (clean up, add Glicko-2, notes on Elo calculation)
Jump to navigation Jump to search

Navigation: About | Updates | Ratings | Query API | Roadmap | Design | Develop | Known Issues


I'd like to open up a discussion on what ratings are meaningful in the rumble. There are various discussions scattered around the old wiki, but right now we have an opportunity to experiment with new things (on my server) and compare to existing results (ABC's server).

Here's what I've implemented on the new server and my reasons for doing so.


Average Percentage Score (APS)

This is probably the "purest" measure of a bot's performance. Under ideal conditions (i.e. full pairings and at least 20-50 battles per pairing to reduce variability), APS would allow an accurate comparison between all bots in the rumble. It's uncertain how well it works with less battles or incomplete pairings.

The server calculates APS for each bot by:

  • taking the average percentage score of all battles against each opponent separately to get an APS for each pairing,
  • then averaging all pairing scores to obtain the final average.

This differs from the old/current rumble server which weights newer scores higher than previous ones. The old system uses something like: newaverage = 0.3 * newscore + 0.7 * oldaverage. The reasons behind this weighting were to account for learning/data-saving bots improving in performance over time and to discard "bad" results. However, I believe that this actually increases score variability so I've chosen to use a true average which will smooth out results. Learning bots should still see their APS go up over time, but now there is no special advantage given to a bot which saves data between battles.


Glicko Rating System

Created by Mark Glickman, the system is described here: http://math.bu.edu/people/mg/glicko/glicko.doc/glicko.html The main difference from the Elo system is that each competitor has a ratings deviation (RD) in addition to their rating. New competitors start out with a high RD, which gradually drops as the rating settles. In Elo, winner and loser receive equal but opposite ratings adjustments after a battle, whereas in the Glicko system the adjustment is based on the RD value. A high RD results in a bigger adjustment, so new competitors are adjusted more quickly; established competitors ratings' should change more slowly.

The server implements the system to the letter, with the one exception that RD values in the rumble do not "decay" (increase) with inactivity. Scores start at 1500, RD values start at 350.


Glicko-2 Rating System

Also by Mark Glickman: http://math.bu.edu/people/mg/glicko/glicko2.doc/example.html This is an enhancement of the original Glicko system which adds a volatity measure to each score. The calculations are more involved but it should (in theory) provide stable ratings for even competitors whose performance is erratic (ie. high specialization index in rumble-speak).

Again the system has been implemented as described but without RD decay. Tau is set to 0.5 and the iterative volatility calculation has been bounded to at most 20 iterations.


RoboRumble Elo Ratings

I've tried to understand the old server's rating scheme and implement that system as faithfully as possible. Thanks mostly to the work of Nfwu and his commented [| EloSim] code, plus details scattered about the old wiki, I've pieced together the following:

  • New competitors start at a rating of 1600
  • The expected % score outcome of a pairing between bots A and B is given by E(A,B) = 1.0 / (1 + 20^(ratingA-ratingB)/800))
  • When a new pairing result is submitted to the server:
    1. The new % score for bot A vs bot B New(A,B) = scoreA / (scoreA + scoreB)
    2. The running pairing score of A vs B Pair(A,B)' = 0.7 * Pair(A,B) + 0.3 * New(A,B)
    3. Then calculate the rating change for A by iterating over all ranked bots Ri:
      • deltaRatingA += 3.0 * (Pair(A,Ri) - E(A,Ri)
    4. Do the same for B
    5. Update the ratings for A and B by adding the new delta to their current rating.

As mentioned above, this server uses the true average for Pair(A,B); otherwise the calculations are the same.


Final Notes

One of Albert's most interesting changes to the standard Elo calculation is that each new battle result requires updating a bot's rating by iterating over the whole participant list, not just the one involved in the battle. I have no idea how this might affect rankings inflation/deflation, but I think it contributes significantly to stabilizing the ratings faster since every time a new result comes in, we re-calculate that bot's rating compared to the entire field.

This is computationally expensive, but since I had to put it in for Elo, it was easy to do the same for Glicko and Glicko-2. Thus all three ratings systems are using the same method, just different calculations for the expected and ratings delta values.

Based on what I've read about the three systems, Glicko-2 would probably yield the fastest-stabilizing results, but that still needs to be determined by comparing the three side-by-side. It's quite possible that Elo is "good enough", thanks to Albert's modification.