Mastodon Icon GitHub Icon LinkedIn Icon RSS Icon

Ranking Systems 02 - The Elo Rating System

As we have said before, skill and ranking are never measured directly; instead, they are inferred from the observed performance of a player in previous matches. The idea of building an estimator based on pairwise wins and losses is quite old, but the Elo System represents the most simple and popular implementation.

History of the Elo System

The Elo system takes its name from its inventor: the Austro-Hungarian physicist Arpad Elo who wanted to improve the chess rating system – the Harkness system – because it was inaccurate and not based statistic models. Elo developed its system in the ’50s, and in 1960 it became the official system used by USCF to rank chess players.

Ten years later, in 1970, the Elo’s system was adopted by the World Chess Federation (FIDE) and, from there, a lot of different tournaments and sports started adopting it. In 1978, Elo published a book titled “The Rating of Chessplayers, Past and Present” where he explained his method in details.

The book “The Rating of Chessplayers, Past and Present”
Figure 1. The book “The Rating of Chessplayers, Past and Present”

Part of its success comes from the fact that the system is easy. Every player could understand and predict how much they will lose or gain just by using a calculator. But how it is work?

Basic Mechanism

The Elo’s central assumption was that the chess performance of each player in each game is a normally distributed random variable. As we have seen in the previous article, the true skill is unknown, but we can assume that we can model it with a specific probability distribution. If we assume that such probability distribution is normally distributed (and we assume the same variance for each player), we can encode the “skill” with a single number: the mean value. Such value, in this context, is called “Elo Score.”

However, a rating system is useless if we cannot update it when we observe the result of a match or tournament. The update mechanism proposed by Elo is very simple: if A has a bigger Elo score than player B but B win over A, the system will lower a bit the rating for A while raising the score for B.

The intensity of the update depends on how much difference there is between A and B score. If A had a very large score compared to B, this means that A was very likely to win. However, because A lost with a very inferior opponent (at least, on paper), we can infer that the difference between A and B should be much less, and therefore there will be a big update. On the other hand, if A won, that’s was supposed to happen, so the Elo Score predicted the result accurately and, therefore, there is no need for a big update.

Math Details!

In my opinion, it is much harder to explain the Elo system without going a bit deeper in everyone’s favorite part: the math details. Do not worry. I’ll go step by step.

1) Score Prediction

The first step of the Elo system is its core assumption: the difference in rating between two players should be able to predict the expected score of a match. As we briefly explained before, if A score is higher than B, then A is more likely to win.

The shape of the score prediction function.
Figure 2. The shape of the score prediction function.

To transform the difference between A and B rating into a “score estimation,” we can use any function that ranges from 0 (inevitable defeat) to 1 (inevitable success). Usually, the Elo system is based on the logistic function.

The constant c is just a scaling factor measuring how much a 1 point of difference in the rating will affect the score. In his work, Elo suggested a scaling constant such that 200 points difference in favor of player A corresponds to a 0.75 expected score for A.1

If you solve the equation for these values, you get \( c = ~419 \), a value that we usually approximate with 400.

$$ E_B = \frac 1 {1 + 10^{(R_A - R_B)/400}} $$

If a player performs multiple matches (even with different opponents), the expected score is just the sum of the individual expected scores.

2) Update the Rating

Now we have an expectation. To update the rating for a player we just need to wait for the end of the match/tournament and compare the expected score with the Actual Score.

If the Actual Score is greater than the Expected Score, this means that the system undervalues the player: we need to increase their rating! On the other hand, if the Actual Score is lower than the Expected Score, then the system overvalues the player: we need to decrease their rating.

There are potentially infinite ways to do that. Elo, however, suggested a very simple linear adjustment.

$$ R_A^\prime = R_A + K(S_A - E_A) $$

The K value (usually called K-factor) represents the maximum adjustment per game. It controls the speed with which the player rate is adjusted after each game. There is no indication for it. In Chess, usually, K = 16 is used for veteran players and K = 32 for novice players.

3) The Initial Score

Everything is fine, but: what if the player has no score? You can just decide a standard one. You can use zero if you like, but we tend to avoid negative numbers for players rating, so the average rate (the initial one) is usually set around 1000 or 1500.

Demo

It is time to try. I have prepared a simple demo showing how we can calculate a new ranking given the player’s rating and the opponents’ ratings.








The scenario is simple. We have the player that challenge two opponents (each one with a specific Elo rating value). Depending on the rating difference and the score, we can see how the Elo rating for the player is updated (the K-factor is 36).

The implementation is trivial, I do not think it is interesting. But if you are wondering, here it is.

1
2
3
4
5
6
7
function expectedScore(aRating, bRating) {
    return 1 / (1 + Math.pow(10, (bRating - aRating) / 200));
}

function updateValue(score, expectedScore, kFactor) {
    return kFactor * (score - expectedScore);
}

Elo Rating Issues

Nothing is perfect, not even the Elo System. In particular, two significant issues plague the Elo System:

Sitting on the Rating

Imagine you are a veteran player that, after a decade of tournaments, reached the highest rank with a huge Elo Rating. Now what? A new tournament is finally announced and you are invited but… is it worth it? As the one with the highest rating, you are guaranteed to lose more points than your adversaries. Moreover, a moment of distraction can happen to everyone, but if you lose against a player with a much lower rating than you (a probable event due to your top position), you are going to lose a lot of points.

This situation happens a lot. With the Elo System high-ranked players prefers not to play to conserve their rating. That’s a problem: we want to encourage people to play, not to stay home and sit on their rating.

There are usually many proposals to address this issue. One involves removing rating points over time; like a kind of inactivity malus. However this can lead to frustration: players feel that they have to play even when they legitimately can not (e.g., they are sick or are in an extended vacation or have other issues in their life). Another solution is to add bonus points for activity; this is no different from the previous suggestion, but human beings react better to bonus than to malus.

However, if you want to make your players play, there is another solution: use another rating system. That’s what Wizard of the Coast did with Magic: The Gathering. They replaced the Elo system with a point-based ranking system called Planeswalker Points.2 And let’s be honest: Elo is a horrible system for Magic anyway (see next section).

Ratings inflation and deflation

The Elo System is symmetrical: when a player loose X points other players gain X points. This means that the amount of point in the system is fixed. If the player population is composed by 10 players and the starting Elo is 1000, in any point of time, also after 1 billion matches, the sum of all players’ rating points will be 1000x10.

In theory, then, no new points are added to the system. Except that during game’s history, players enter and leave the system. This causes two opposite problems to the Elo Rating: inflation and deflation.

Deflation

Let’s start with deflation. Imagine a group of 10 players again: the sum of all the points in the system is 10,000, independently from how they are distributed among them. Now a new player enters the game with its starting 1000 points. The new sum is 11,000. No problem until here.

Now imagine that the new players learn the game reaching a 1200 rating and then decide to leave the competitive game. Leaving the game, the player will remove his/her 1200 points from the system. The new total sum for the remaining original players is now 9800. Uh oh. That’s an issue; do you see why?

We now have the same players, with the same relative skill, but now, suddenly they are sharing only 9800 points. The 200 points were drained out of the system. As a consequence, people still in the system may see their rating decrease over time, even if their skill is exactly the same. This can lead to frustration and make people quit what they believe to be an unfair system.

This happens all the time. In fact, in general, every player enters as a noob and leave as a competent player, draining points out of the system.

How to solve this? It is not easy. When using an Elo System, you need a way to fuel points back in the system. A possible system is to use higher K-factors for new players; this means that new players gain more point than they subtract from more established players. The result is a positive net gain for the system.

Inflation

Inflation is the opposite problem: players may have a certain score (e.g., 2700) that represents a lower skill than the same score evaluated 10 years ago. Inflation makes it hard to compare players from different periods. This is not supposed to happen, but happens anyway: especially among high-ranked players.3

The number of Chess players rated more than 2700 in 1979, 1994 and 2009.
Figure 3. The number of Chess players rated more than 2700 in 1979, 1994 and 2009.

Why there is inflation? It is not clear, but there is a possible explanation that I find extremely similar to how a black-hole evaporates. :)

To understand this, we need to note that high-ranked players usually play among themselves (there is a rating limit for their tournaments). These are called rating islands.

Now imagine that an overrated player enters one of this tournament. It may happen, after all the Elo rating is just a statistical process, the player may be lucky and win a series of matches and go over the threshold (for instance, 2200).

The day of the tournament the player gets destroyed and drops to its true skill level that is under 2200. The player is now out of the island, and probably he/she never came back to it; however the point he/she loose are now in the island, distributed among the high-ranked players. The sum of points in the high-ranked players’ island is now bigger than before, even if their relative strength is still the same.

Multiply this for hundreds of player over decades, and you have a possible explanation for this issue.

When DO NOT use Elo System

Now you are developing a game, or you are organizing a tournament for a game/sport, and you want to set up a ranking for competitive players. Is Elo the right choice?

In general, yes, it is. The Elo System is simple and effective: if you do not know where to start, the Elo System is the right choice. Many commercial games use Elo for their matchmaking rating without any issue.

BUT

There is a huge but. The Elo System was designed for Chess: a two-players, skill-based, zero-sum, perfect information, deterministic game. If your game is not a game “like Chess,” you can start to have issues. Let’s go over them and see when we should avoid Elo (or, at least, a pure-Elo system).

When the gameplay contains randomness

The main warning to avoid Elo in your Rating System is when luck and randomness influence the outcomes of your game. The core foundation of the Elo System is that the Expected Score is influenced exclusively by the difference between the players’ skill. If you add randomness to the mix, this is no longer true.

If your players need to grind for rank, that’s a very important clue that your ranking system is broken.
Figure 4. If your players need to grind for rank, that’s a very important clue that your ranking system is broken.

A typical example is, again, Magic: The Gathering. In Magic, even the world champion can lose a game against a super-noob if he/she gets unlucky and draws too many or too few lands. According to the Elo System this will cost a massive amount of points for that. That’s very unfair. If you couple that with a ladder-system, a simple unlucky streak can cost to the skilled player hundreds of positions that they will be tough to win back.

In a game like that, the best strategy to keep (or even increase) your ladder position is not playing, and that brings us to the next issue.

When you need to encourage playing

If you are organizing an official international federation, the “sitting on rank” issue may be just a minor annoyance. However, if you are writing a commercial game, like (yes, again) Magic Arena, you want people to play. You really want that. You do not want that a player feels forced to play less to keep their rank.

In this case, the Elo System may be the wrong choice. You can use it internally, hiding the score from the player, just to provide fair matchmaking. But if you attach rewards to it, then you cannot avoid this unwanted effect.

That’s why Magic stopped using Elo in paper (just to repeat the same mistake in the miserable experience of Magic Arena).

Multiplayer and Teams

Again, the Expected Score must be influenced exclusively by the difference between the players’ skill. This is not true if the game is a “team versus team” game: the individual skill is now blended with the team, and a team can perform better (or worse) than the sum of the individual players’ skill due to a lot of other factors.

While Elo can still predict “fine” the Expected Score, Elo is incapable of updating the players’ skill because it is incapable of separating the outcome into the individual contributions.

This drawback has not stopped games from using Elo for team-based games. There are several tweaks, but the most important one is that, when the team loses the game, every team member lose the same amount of points (and vice versa).

This trick works fine if we fix the team, but in a matchmaking scenario, this is frustrating. Skilled players may get angry if they lose points just because another player is bad or he/she intentionally troll and play for loosing.

Yep. We have all been there…
Figure 5. Yep. We have all been there…

The Elo System is then perceived as unfair. Games communities have a name for that: Elo Hell. Unfortunately, there is no solution here. There is no real Team-Based Rating System that is able to evaluate the individual player performance. This is still a pretty open problem.

Conclusions

The article ended much longer than expected! At least, we explored all the important points about the simplest of the Rating Systems. In the next article, we will go a little step further by introducing an improved version of the Elo System.

I hope this can be useful! See you the next time!


  1. Many people call it win probability. Note, however, that the expected score is defined as the “probability of winning plus half the probability of drawing”. Therefore it is not possible to get back the probability of winning and drawing from the score (unless there are no draws in the game). ↩︎

  2. We will talk about point-based ranking systems in a future article. ↩︎

  3. The emblematic case is Chess (probably due to its long history of ratings). In 1979 there was only ONE player with a rating greater than 2700, now they are more than 44. ↩︎

comments powered by Disqus