About two years ago I started keeping score of who won our Magic games. I didn't know what I wanted to do with it, but after sitting on it for 2 years I think it's ripe for some initial analysis of my data set. I obviously will keep tracking games, but for now, I was curious how things were shaping up.
Elo scores are a measure of relative skill in zero sum games. Elo scores are maybe most popularly known from chess, where they are the de facto standard for player ratings.
They're also quite commonly used by online multiplayer video games like Starcraft 2, however most platforms don't publish their exact algorithm, and most don't use Arpad Elo's exact formula, instead tweaking it to their own preferences and use cases like we're going to do.
Objectively measuring and scoring skill is a very difficult problem. Magic in particular is a difficult game to score, where even among the same player, different starting conditions and environments can produce drastically different outcomes.
Because Elo scores are calculated on a game-to-game basis, this necessarily means that we must track player scores from game to game, and that games must be recorded and analyzed in proper order. However, also means that we can make (admittedly rough) predictions backed by data about who will win in a given matchup. Prediction is an entirely different animal, though, and this post won't cover any of that. Just know that for Elo scores, you can calculate a delta between two scores and thus the probability of each of them winning.
Elo scores start at some arbitrary point - some chess leagues start at 1500, others at 1000, and others still at 1250. The starting point doesn't matter as much as one might think. A player's score will rapidly approach where they really should be, often times within only a few games.
For example, chess Grandmasters have Elo scores in the 2200-2400+ range, while a beginner might have a score of 1000-1200 and a novice might be closer to 1400, with good chess players starting around 1500 and excellent ones breaking the 1800 mark.
Modeling MTG games
There are some immediate problems with using Elo to score Commander games. Commander pods are typically 4 players, and Elo only models a direct player to player comparison. We have to map our games of 4 player magic to this 2 player system, essentially flattening all of our games down to a win/loss between two players.
A 4 player match is interpreted as the 1st place player (last player standing) winning a game against the second place player. 2nd place loses a game to 1st place, but wins a game against 3rd place; 3rd place loses and wins a game in the same fashion, but then 4th place strictly loses one game and wins none.
This means that last place actually has a slightly larger point penalty than 2nd and 3rd, and 1st place has a slightly larger point reward for winning.
When one player kills multiple other players at the same time, colloquially called "table zaps", are difficult to record. In my first approach, I've scored table zaps as a loss in turn order, starting from the player who won and with each player losing in turn order. This has obvious drawbacks, but I haven't found a better way to handle it that I can retroactively apply to the game log.
The K factor in Elo ratings is essentially the sensitivity knob. A higher K factor means more reaction from the same inputs. Turn it too high, and your score could drastically drop after just a few losses, which doesn't intuitively line up with our subejctive expectations of skill.
On the other hand, set it too low and your score could lag in representing what your actual skill level is, creating a frustrating lack of challenge for a player and making it difficult to see meaningful progression.
Some Elo implementations change the K factor based on the number of games a player has played. In our case, we're going to simplify K factor handling and set it to a straight 40 all the time. As a reference point, 32 is commonly used for chess players with less than 30 games under their belt, and Elo set K equal to 10 in the original Elo formulation.
Lowering our K value to 10 would make for much less difference between players in our rankings, which we don't want because of our limited pool of players and relative infrequency of matches. Even a K value of 32 seemed a little too stubborn to movement.
A K value of 40 keeps our algorithm springy and reactive, which we want in a game where the politics and meta matter as much as a player's deck and play style, and where players might play a burst of 3 or 4 games in a day and then go months without any others.
Two Headed Giant
Another interesting problem that our Elo scoring presents is Two Headed Giant. In EDH games, Two Headed Giant is a flavor where two players team up and share one life total. For Two Headed Giant games, I have treated the pairs as their own "player". This is similar to how online games handle team ratings. For example, Starcraft 2 tracks your 2v2 ladder score for each other player you ladder with.
This scoring system is ignorant to any concept of turn order other than when a table zap is recorded and players lose in the respective turn order. Otherwise, we have no information about who went first in the math and if the turn order ever changed. Both of these are good data points to consider for future improvement, as turn order has a notable effect in Commander, and can have an even more important effect in competitive EDH, or cEDH.
I hacked together an elo scoring script in about an hour using elo-go. The code can be found here. It reads from a
csv files (comma separated values - what spreadsheet programs like Excel and Google Sheets use) and computes scores for each player from the sheet on a game by game basis, updating each player's score as it comes across them and then outputting the final score list at the end.
Okay, finally enough background information and nerd drivel. Here's the numbers.
Other interesting points
Jacob and Marshall are tied for most wins at 50 each. Dylan is in third place at 43 total wins. This metric is subejct to frequency bias, but is interesting nonetheless.
The largest game in my tracker was 6 people, and thank god they're rare. A 6 person game takes forever.
Brenden and Dylan are tied for most 4th place finishes, clocking in at 14 times each.
Dylan has more than double the number of 3rd place finishes of any other player in the records, with 44 3rd place finishes, and Brenden having 20 3rd place finishes.
The tracker has 225 games across 2 years, with 80 unique dates. There's a prominent gap after March of 2020, when the pandemic shook everything up. It took us several months to get back to playing regularly, which is reflected in the games.
I logged three different Two Headed Giant Games.
Problems and areas for future improvement
There are some problems with our data model that I want to dive into here.
Table zaps, part 2.
First of all, table zaps are still a hard problem. I've accounted for them in the most idiomatic way possible, but let's consider some other options.
My current approach models them as everyone that dies losing in turn order. But what if modeled a table zap as the 1st place player winning a game against 2nd, 3rd, and 4th place? In our current Elo model, it would create an incentive to table zap because it would be basically like winning three games at once and not one game.
What if we made it so that it was just 1st place winning a game against the average score of the other three players? This could work, but it would be skewed by the difference between the other three players scores, and that doesn't necessarily mean that first place is a better player if there's a wider gap in skill between the other three players. One could even argue that it actually should mean less reward for the 1st place player if they beat a wide range of skills.
We could also consider it a win against 2nd place, and 3rd and 4th place would simply lose their games, but they wouldn't be counted as wins for 1st place. This could work, but it feels arbitrary and counter intuitive.
Another option is to mark the three players that lose as a draw, and first place as a win. This is maybe the most compelling option, because it equally punishes all three of the losers, but we still have to have a score for 1st player to win against, since Elo scores are sensitive to the difference in the score of the player that you beat.
I manually entered all of these games myself, which is an error prone operation at best and a straight up biased source of truth at worst. I am only human, but I have tried to be objective at every possible turn.
In cases where I didn't have enough data to log a game, e.g. I only had the winner, or I had the setup but no clear ranking to the game, I discarded it completely. This definitely effects the data but there's no way to know how much. I estimate that my entries are probably 90% accurate, based on how consistent the data was.
I logged some data around what Commander each player was playing, but not regularly enough to make any meaningful analysis from it. In the future, I'd like to incorporate it, but Commanders come and go so frequently, and even among a single commander, the decklist can change drastically, so it's very difficult to extrapolate anything from just the Commander name.