Offline batch ELO rating system

From Robowiki
Jump to navigation Jump to search

Offline batch ELO rating system

Using data handled by Darkcanuck on 09 August 2011, I tried a different rating system algorithm I was thinking for a long time. I hope people here like it.

How does the algorithm works?

  • It is heavily based on the chess ELO rating system. Scores per battle are either a win (1), draw (0.5) or loss (0). For expected scores, the same logistic distribution formula used in the chess ELO rating system (Mi=log(10) and s=400) is used.
  • The difference comes in adjusting the ratings for each competitor. I replaced the online K-based adjust algorithm with a offline Rprop batch learning algorithm similar to the ones used in neural networks (error = sum of scores - sum of expected scores). All battles are considered to happen in the same rating period, and the algorithm keeps iterating on all battles over and over until the rankings converge to a stable setting. The idea is to squeeze as much accuracy as possible for a given battle data through the use of raw CPU power.
  • After the rankings stabilize, an anti-inflation algorithm shifts all ratings up or down so they all average to 1500.

What are the advantages over the current rating systems?

  • It is data ordering independent (no problem bot rating drift), contrary to classic ELO and Glicko-2 systems. This is one the main strengths of offline batch learning algorithms. But it is only possible because all battle data is available for processing.
  • It is still accurate with incomplete pairings, contrary to APS and PL systems.
  • It makes less assumptions on results than APS and PL systems, making it less biased. I believe it also makes the system less king-making prone (weak competitors interfering with strong competitor scores).
  • Inflation control, contrary to classic ELO and Glicko-2 systems.

What are the disadvantages?

  • It is slow, as any offline batch learning algorithm. On a 3.4ghz CPU, it took 10 minutes iterating once over all 27 million battles from Darkcanuck/RRServer counting all wins/draws/losses from all pairings. It took another minute and a half iterating a few hundred times over the pairings. And another half minute generating XHTML pages. This for 1v1 General league alone. There is a lot of room for optimization though.
  • By making less assumptions on results, it needs more data for the same accuracy. Although I believe the extra accuracy achieved with batch processing compensates for that.

The resulting XHTML files are available for download here: http://sites.google.com/site/mnrobocode/files/rankings.rar --MN 12:30, 12 August 2011 (UTC)

Philosophy behind the rating system

The idea is to bring to RoboRumble an ELO-like rating system close to what is used successfully in most sports. With that, increase the competitive environment and push the evolution of competitive bots further.

I see the following assumptions being made on those ratings systems:

  • Perception - If competitor A wins against competitor B, then competitor A is stronger than competitor B.
  • Simplicity - Strength is uni-dimensional.
  • Transitivity - If competitor A is stronger than competitor B, and competitor B is stronger than competitor C, than competitor A is stronger than competitor C. This is a consequence of the rankings being uni-dimensional.
  • Measurement - Stronger competitors have higher ratings than weaker competitors.
  • Accuracy - A good rating system tries to calculate a rating setup which minimizes the situations when the above assumptions fail (mostly due to close matches and rock-paper-scissor situations).

I added the following:

  • Trust - The assumptions above made on most sports are good.
  • Experience - Sport federations knew what they were doing when they picked the logistic distribution formula.
  • Likeliness - Trying to mimic the ratings from the chess federation is cool. (Mi=log(10), s=400, and 1500 average)
  • Accuracy - Batch processing is more accurate than incremental processing. Not really necessary to increase competitiveness, but attempts for making similar ratings systems in the past were abandoned because of accuracy.
  • Fairness - King-making is bad for the rumble. The most important assumption for increasing competitiveness, as this one is where the difference between this rating system and the APS rating system relies.