I’ve been experimenting with TrueSkill as a potential replacement for the Bayesian ELO that the ladders use.
I’ve put together a small sample app (download at the bottom of this post) that will calculate the ratings of players using the TrueSkill algorithm. This can be used to compare the results of the algorithms side-by-side. Below is the top 30 results of the 1v1 ladder, as of the date of this blog post.
TrueSkill | Bayesian ELO |
Rank Player Rating Wins Losses --------------------------------------------------- 1 zaeban 2421.536 41 6 2 Gui 2372.826 15 2 3 AceWindu 2354.958 17 3 4 Rubik87 2308.389 32 8 5 unknownsoldier 2287.282 20 7 6 Heyheuhei 2264.112 62 23 7 Eitz 2258.755 31 11 8 NuckLuck 2254.831 16 5 9 ????V 2210.151 29 11 10 ????Chaos 2205.284 20 2 11 chas 2203.465 36 15 12 Oliebol 2187.576 21 8 13 20AquaHolic 2158.945 61 31 14 Yeon 2151.634 18 6 15 13CHRIS37 2143.782 44 16 16 DrTypeSomething 2098.338 18 6 17 WMMekBlaze 2098.327 18 4 18 TheEmperorCornInMyTight 2079.722 61 29 19 MonsenhorChacina 2075.971 9 3 20 Mian 2072.797 23 10 21 Hroptatyr 2071.86 26 12 22 PaniX 2069.742 16 9 23 bytjie 2037.808 20 11 24 Xyphistor 2018.273 71 39 25 alababi 2010.574 18 8 26 REGLMentysh 2005.368 28 17 27 WMDazedInsane 1992.604 24 9 28 20TheWindowCleaner 1988.566 15 5 29 JimH 1985.033 33 18 30 Tor 1978.475 24 17 |
Rank Name Elo ------------------------------------ 1 AceWindu 2176 2 Gui 2163 3 zaeban 2130 4 ????Chaos 2085 5 NuckLuck 2060 6 Rubik87 2042 7 unknownsoldier 2039 8 MonsenhorChacina 2029 9 zibik21 2021 10 WMMekBlaze 2004 11 ????V 1995 12 Yeon 1992 13 Eitz 1989 14 Oliebol 1981 15 20TheWindowCleaner 1977 16 Heyheuhei 1963 17 DrTypeSomething 1961 18 chas 1954 19 PaniX 1950 20 Troll 1943 21 13CHRIS37 1935 22 TheImpaller 1933 23 Mian 1927 24 fwiw 1917 25 alababi 1907 26 Hroptatyr 1904 27 LilEitz 1892 28 bytjie 1886 29 WMDazedInsane 1882 30 Fizzer 1879 |
The ratings are not important, just the ordering of the players. The wins and losses are only specified in the left table, but these are comparisons over the same games so the numbers are the same for both sides.
Even though Bayesian ELO is what the site uses now, you might notice some differences between the right table and what WarLight.net shows today. They’re not identical since WarLight doesn’t give ranks to players who have left the ladder, don’t have 10 games yet, or are on vacation.
Algorithm Differences
There are advantages and disadvantages between the two algorithms. The biggest difference is that the ripple effect that Bayesian ELO uses is not existent in TrueSkill. That is, when a game ends, Bayesian ELO applies the biggest changes to the players who played that game, but also applies smaller adjustments to everyone who has played either of those players, and so on.
The nice thing about TrueSkill is that when a game ends, you can immediately know how many rating points you gained or lost. Your rating also only changes when you finish a game. This also means that you can see exactly how your rating got to its current location, as each game can show its affect on your rating.
The disadvantage of this is that *when* you defeat an opponent matters. Say player A rises from #30 to #1 on the ladder. If you defeat player A when they’re at #1, you’ll get a much bigger ratings boost than you would have if you defeated them when they were #30. This isn’t true in Bayesian ELO, since the rating points you got from defeating player A rise as they rise up the ladder.
This is most visible in contrived examples. Say Player A defeat B, B defeats C, and then C defeats A. In Bayesian ELO, all three players would be tied, as their victories form a perfect triangle, evening each other out. In TrueSkill, player C would be the highest ranked, since they defeated A who was the #1 ranked player at the time of their game.
Running TrueSkill Simulations
As mentioned above, I wrote a command-line tool that allows you to run simulations of WarLight ladders with the TrueSkill algorithm.
You can download this tool from this link: WLTrueSkill.exe.
On Windows, this requires .NET 4.0 runtime to be installed. On Mac or Linux, the tool should work fine under a recent version of Mono.
To use the tool, simply feed it one of the Bayeselo Logs linked from the wiki. For example:
WLTrueSkill < BayeseloLog0.txt
By running the program with an argument of /?, you can see some additional options.
Feedback
I'm considering using TrueSkill for Season II as a trial run. Let me know your thoughts!
This is interesting. As you said, the biggest downside is “when” you beat an opponent. For that reason, I think that short-term results would be skewed, but with enough games under you the rankings would even out to where they should be. What I would propose, based on this reasoning, is that you could keep the B-ELO for the seasonal ladders, as they have short durations, while changing the permanent ones over to the TrueSkill as I think that over time it would make a more fair estimation of ratings.
Funny how zaeban, Gui, and AceWindu and ranked 1,2,3 for TrueSkill, but that reversed for the Bayesian chart.
So by average, Gui is the best warlight player.
Or does that mean that we have a 3-way tie for the best ladder player?
zaeban – 1+3/2 = 2
Gui – 2+2/2 = 2
AceWindu – 3+1/2 = 2
Since ladder games are multi-day, how does that affect the timing under TrueSkill? A single game could theoretically stretch on for weeks, and the player’s ranks could drastically change while that game is ongoing. Does TrueSkill only count their ranks at the moment the game ends? Or the moment it begins? Or some combination over time?
When it ends.
Eagleblast’s idea seems to be the best solution. With the shortness of seasonal ladders, TrueSkill would be heavily Skewed. In regular ladders, it feels like TrueSkill would be the best though. Thus we have B-ELO for Seasonals and TS for the Regular ladders.
SebCorps, keep up the good work! A fat Chinese New Year’s red envelope is on the way! (Though you ignore the fact that how one plays on one map in 1v1s is not the sole measure of who is the best, as if other maps/settings, team games, and FFAs don’t matter.)
Fizzer, why not create your own ranking system, something that is a tweaked basket of ranking systems that best suits WL? Or, maybe make it 50%-50% or 70% B-ELO, 30% TrueSkill (or whatever you think is best)? An average (of some sort) of the two ranking systems would be best I think: get the best of both, reduce the problems either might have.
This makes the most sense to me. I think the trickle down effect of the Bayesian method is important so it rewards you for beating good players regardless of when that victory came. (and on the flip side it ‘punishes’ you for beating on weaker players)
I like this idea, you could just take both rankings for everyone on the ladder, average the two ratings (not rankings) and have that average as the score that rankings are based on.
In my opinion the ripple effect is very important and shouldn’t be changed
true skills is better. That way we will all have more points and be happier 🙂
I think that getting rid of the “ripple effect” is just fine. After all, players can get better over time… why should someone who beat me a week ago when I had a very poor understanding of the game, benefit from my increase in skill?
Trueskill is better, but the ladders have gotten stagnant and boring. Most have stopped caring about it, and many people’s ratings have fallen b/c of lack of interest. Think the ladders need a face-lift and also a reset.
The ripple effect can be crippling when a player (i.e. HHH) who’s generally very good beats you and then goes inactive for awhile, getting boot from several games in the process. HHH’s rating then drops a good 500 points which in turn drags your rating down until your game with him expires (in three months time).
As one person commented, the ladders have become stagnant. TrueSkill will up the competition and render delaying less effective than previously (losing a game against a low ranked player won’t continue to impact you until it expires, but can be compensated for with two or three solid victories against good players).
It also requires you to focus more on the ladder games, but at the same time doesn’t punish you excessively when you lose against a player whose rating isn’t that great.