Here's a few interesting questions:
How do you prove whether 0% luck makes for more strategic games than 16% luck?
How will we know when we've finally found a good non-Europe based template for 3v3? And is EU 4x5 0% WR just as good as EU 4x4 0% SR?
Is Rise of Rome too big to be a good 1v1 map? Is it a good 2v2 map then?
Are Poon Squad's settings really
that bad?
How do I explain to someone that Guiroma 1v1 is actually not a "really weird and bad template" but in fact a solid and well-tested strategic 1v1 template?
If you've ever thought about any of these questions, then my ramblings here might be of interest to you.
I basically ran into this central question-
how do I find out what's a good strategic template and what isn't?- when expanding the templates used in CORP Strategic League (our internal ladder system, which now has 56 templates in rotation- so it's a little unwieldy). We all have some qualitative notions of what makes a template good strategically- reasonably low luck, balanced map, just the right size, no weird card situations, etc.- but I've always wondered what quantitative methods could be used to overcome the biases in our qualitative judgements. After all, if Ares 3v3 really turns out to be just as good as the gold standard of EU 3v3, how do we resolve the debate? We obviously have some entrenched opinions. Similarly, we've got a tendency toward conformity when it comes to template design- most people don't really stray too far from Strategic 1v1-like settings.
So how do we do this with just numbers?
Well, first let's consider the extreme examples of "bad" templates:
- Template A (Lottery): A significantly better player and I play each other in a 1v1. We have an equal chance of winning. (In a good strategic template, the better player should be much more likely to win)
- Template B: A really good team plays a really bad team in a 2v2. The worse team has a significantly higher chance of winning. (Again, the better team should have a higher chance).
- Template C: One slightly better player plays a slightly worse player in a 1v1. The slightly better player is almost certain to win- the other player, even though he's just a tad worse, is probably not going to win more than 2% of the time. (The chance of winning should be commensurate to the difference in skill between the two players- while Templates A and B didn't accurately reflect the difference by increasing the worse player's odds of winning, this template fails to accurately reflect the difference in skill level by significantly decreasing the worse player's odds of winning).
So, looking at this, we kind of get an idea of what a good strategic template looks like:
A strategic template is a template that accurately reflects the difference in skill level between the two players.Where can we go with this core assumption? Well, the main idea here is that there needs to be some way to go between a relative measurement of skill level and the probability that a player wins.
... And that's where Elo ratings come in.
So, Elo, if you're not familiar with it, is just a system where each player has a rating based on their history- with wins against tougher players counting for more points (i.e., a player who's played one game and beat master of desaster will have a higher Elo rating than a player who's played one game and beat someone ranked outside the top 100 of the ladder). Your Elo gain/loss from a game is a function of the difference in ratings- which is supposed to predict the % of times that you'll win. Elo assumed a normal distribution when coming up with system, which (at least in chess) is inaccurate- however, his system still allows us to get some basic quantitative data, which should at least be good enough to compare templates.
A difference in Elo ratings can be converted to an overdog win % using the formula:
Probability that player with higher rating wins = 1 - 1 / (1 + 10 ** (EloDifference/400))
Conversely, you can estimate the rating difference between two players using the formula:
EloDifference = -400 * Log((1/Win Percentage - 1), 10)
Source:
http://www.3dkingdoms.com/chess/elo.htmAs you can see by playing around on that site, this would mean that a player who wins 70% of games against their opponent should have an Elo rating that's 147 points higher. Conversely, a player whose rating is 147 points higher than that of their opponent should win about 70% of games.
Where did I go with this, then?
Well, so since you can convert Elo differences to win probabilities, and since you have actual win probabilities, here's the following data I played around with:
- the % of time that the "overdog" (better-rated player) wins a game
- the average rating difference between the "overdog" and "underdog" on a template
I used the rating difference to come up with an expected % of time that the overdog should have won, and compared it to actual results. This is, of course, just one of many analyses that can be performed- I liked it because it's simple.
So, this relies on the following major assumptions:
Elo ratings accurately reflect the relative strength of players in terms of how likely they are to win a head-to-head matchup.A good Warlight template should have win probabilities that are very close to those predicted by EloAlso, there's some risk in using the Elo ratings- if you're getting them based on games only played on the template that's being tested, they're going to be a little bit "off" since they would be tainted by the inaccuracies in the template itself- i.e., a template that makes upsets more likely to happen is probably also going to cause you to underestimate how good your overdogs are and overestimate how good your underdogs are. Conversely, if you get data from games played on multiple templates, then you're making the huge assumption that someone can be "good" across a wide range of templates and that the Elo rating you're using accurately reflects their skill across the entire range- which is a risky, albeit useful, assumption. On top of that, games played on templates being tested are still going to be similarly "tainted." However, once you buy into these assumptions, you can start getting cool-ish data:
You can simply subtract the actual overdog win rate from the expected overdog win rate to figure out a template's "bias"- a rating of how likely it makes upsets to happen.
First, I ran this on the CORP Strategic League templates (you can find the data in the "Templates" spreadsheet at
http://www.tinyurl.com/csldata). However, CSL only has 88 finished games- and the average player has only played 2.3 games so far. And, well, with a small enough dataset, you can disprove gravity or evolution. So no go.
So I decided to just test this out on the 1v1 and 2v2 ladders (all completed games as of 12:55 AM EDT on 6/16/2015). Given the focus on proving templates, I was also going to check out the Real-Time and Seasonal ladders, but will deal with that later as I'm not sure about the usefulness and reliability of that data (given the higher boot/surrender rates on those- you can also see a lot more upsets if you just look at that data).
Also, this is all based on the assumption that I can use the Bayeselo ratings in a way that's more or less similar to how I would use regular Elo ratings. I don't know enough about the theory behind Coulon's Bayeselo system to be certain of this, but eh this was interesting so I did it anyway.
Here's what I got from the ladders (I got win/loss and rating data from all games, ignoring games where one or both of the players' ratings were expired and set to 0):
1v1 Ladder - Strategic ME 1v1 template
total games: 41732
overdog wins: 27091
total overdog score: 70866841
total underdog score: 62354453
average overdog score: 1698.14
average underdog score: 1494.16
average difference: 203.98
overdog expected win rate: .76
overdog actual win rate: .65
bias direction: underdog
bias strength: .11
2v2 Ladder - Final Earth 2v2 template
total games: 2388
overdog wins: 1788
total overdog score: 3902977
total underdog score: 3353609
average overdog score: 1634.41
average underdog score: 1404.36
average difference: 230.05
overdog expected win rate: .79
overdog actual win rate: .75
bias direction: underdog
bias strength: .04
So, as you can see from this, upsets are much more likely to happen on the 1v1 ladder than on the 2v2 ladder. I speculate that this might occur due to some players not playing a whole lot of games and being rated lower than they actually are, but given the size of the dataset maybe the 1v1 template actually just makes upsets more likely. Keep in mind that this data is better understood in relative terms- the 2v2 template might not be biased in favor of the underdog- it could just be a flaw in my assumptions or in the dataset, but it's probably
less likely to yield upsets (in Elo-based terms) than the 1v1 ladder template, which I'd say is useful data.
Finally, here's an idea for how you can use this to test a new template:
1. Host a Round Robin with that template. Don't invite players that are going to get booted and ruin some of your data.
2. Great. Now you have 190 games' worth of data. That's 19 games/player- more than enough for reliable Elo ratings.
3. Use Elostat or Bayeselo to give Elo ratings to each player.
4. Analyze the game data the way I did- average rating difference, overdog win %, expected win %, and the difference. I'd love to see more data on this if you'd like to share.
5. Now you have a simplified quantitative reflection of how strategic the template is in the bias strength datapoint.
Also, if someone wants to run this analysis on the Real-Time Ladder (template by template) for me, it'd be much appreciated.
Edited 6/17/2015 04:04:17