title = “Guildmaster: Magic The Gathering pod ranking and game tracker” +++ date = “2022-04-11T17:51:17-07:00” author = “Dylan Lott” authorTwitter = “dylanlott” #do not include @ cover = "" tags = [“mtg”, “elo”] keywords = ["", “”] description = "" showFullContent = false readingTime = false hideComments = false +++
Guildmaster: MTG Pod ranking and game tracker#
Code can be found here at
Elo scores#
Elo scores are a measure of relative skill in zero sum games. Elo scores are maybe most popularly known from chess, where they are the de facto standard for measuring player ratings.
They’re also quite commonly used by online multiplayer video games like Starcraft 2, however most platforms don’t publish their exact algorithm, and most don’t use Arpad Elo’s exact formula, instead tweaking it to their own preferences and use cases - just like we’re going to do.
Objectively measuring and scoring skill is a very difficult problem. Magic in particular is a difficult game to score, where even among the same player, different starting conditions and environments can produce drastically different outcomes.
Because Elo scores are calculated on a game-to-game basis, this necessarily means that we must track player scores from game to game, and that games must be recorded and analyzed in proper order. However, also means that we can make (admittedly rough) predictions backed by data about who will win in a given matchup. Prediction is an entirely different animal, though, and this post won’t cover any of that. Just know that for Elo scores, you can calculate a delta between two scores and thus the probability of each of them winning.
Elo scores start at some arbitrary point - some chess leagues start at 1500, others at 1000, and others still at 1250. The starting point doesn’t matter as much as one might think. A player’s score will rapidly approach where they really should be, often times within only a few games.
Modeling MTG games#
There are some immediate problems with using Elo to score Commander games. Commander pods are typically 4 players, and Elo only models a direct player to player comparison. We have to map our games of 4 player magic to this 2 player system, essentially flattening all of our games down to a win/loss between two players.
A 4 player match is interpreted as the 1st place player (last player standing) winning a game against the second place player. 2nd place loses a game to 1st place, but wins a game against 3rd place; 3rd place loses and wins a game in the same fashion, but then 4th place strictly loses one game and wins none.
This means that last place actually has a slightly larger point penalty than 2nd and 3rd, and 1st place has a slightly larger point reward for winning.
When one player kills multiple other players at the same time, colloquially called “table zaps”, are difficult to record. In my first approach, I’ve scored table zaps as a loss in turn order, starting from the player who won and with each player losing in turn order. This has obvious drawbacks, but I haven’t found a better way to handle it that I can retroactively apply to the game log.
K factor#
The K factor in Elo ratings is essentially the sensitivity knob. A higher K factor means more reaction from the same inputs. Turn it too high, and your score could drastically drop after just a few losses, which doesn’t intuitively line up with our subjective expectations of skill.
On the other hand, set it too low and your score could lag in representing what your actual skill level is, creating a frustrating lack of challenge for a player and making it difficult to see meaningful progression.
Some Elo implementations change the K factor based on the number of games a player has played. In our case, we’re going to simplify K factor handling and set it to a straight 40 all the time. As a reference point, 32 is commonly used for chess players with less than 30 games under their belt, and Elo set K equal to 10 in the original Elo formulation.
A K value of 40 keeps our algorithm springy and reactive, which we want in a game where the politics and meta matter as much as a player’s deck and play style, and where players might play a burst of 3 or 4 games in a day and then go months without any others.
Two Headed Giant#
Another interesting problem that our Elo scoring situation presents is Two Headed Giant. In EDH games, Two Headed Giant is a flavor where two players team up and share one life total. For Two Headed Giant games, I have treated the pairs as their own “player”. This is similar to how online games handle team ratings. For example, Starcraft 2 tracks your 2v2 ladder score for each other player you ladder with.
Turn order#
This scoring system is ignorant to any concept of turn order other than when a table zap is recorded and players lose in the respective turn order. Otherwise, we have no information about who went first in the math and if the turn order ever changed. Both of these are good data points to consider for future improvement, as turn order has a notable effect in Commander, and can have an even more important effect in competitive EDH, or cEDH.
Looking Forward#
The next features I want to add to this is more detailed tracking. Metrics around first blood, commander, turn order, etc… should be able to be gathered from willing sources.
The script#
I hacked together an Elo scoring script in about an hour using elo-go
. The code can be found here. It reads from a csv
file and computes scores for each player from the sheet on a game by game basis, updating each player’s score as it comes across them and then outputting the final score list at the end.