TrueSkill

I’ve been experimenting with TrueSkill as a potential replacement for the Bayesian ELO that the ladders use.

I’ve put together a small sample app (download at the bottom of this post) that will calculate the ratings of players using the TrueSkill algorithm. This can be used to compare the results of the algorithms side-by-side. Below is the top 30 results of the 1v1 ladder, as of the date of this blog post.

TrueSkill Bayesian ELO
Rank Player                  Rating     Wins Losses
---------------------------------------------------
1    zaeban                  2421.536   41   6
2    Gui                     2372.826   15   2
3    AceWindu                2354.958   17   3
4    Rubik87                 2308.389   32   8
5    unknownsoldier          2287.282   20   7
6    Heyheuhei               2264.112   62   23
7    Eitz                    2258.755   31   11
8    NuckLuck                2254.831   16   5
9    ????V                   2210.151   29   11
10   ????Chaos               2205.284   20   2
11   chas                    2203.465   36   15
12   Oliebol                 2187.576   21   8
13   20AquaHolic             2158.945   61   31
14   Yeon                    2151.634   18   6
15   13CHRIS37               2143.782   44   16
16   DrTypeSomething         2098.338   18   6
17   WMMekBlaze              2098.327   18   4
18   TheEmperorCornInMyTight 2079.722   61   29
19   MonsenhorChacina        2075.971   9    3
20   Mian                    2072.797   23   10
21   Hroptatyr               2071.86    26   12
22   PaniX                   2069.742   16   9
23   bytjie                  2037.808   20   11
24   Xyphistor               2018.273   71   39
25   alababi                 2010.574   18   8
26   REGLMentysh             2005.368   28   17
27   WMDazedInsane           1992.604   24   9
28   20TheWindowCleaner      1988.566   15   5
29   JimH                    1985.033   33   18
30   Tor                     1978.475   24   17
Rank Name                        Elo
------------------------------------
   1 AceWindu                   2176
   2 Gui                        2163
   3 zaeban                     2130
   4 ????Chaos                  2085
   5 NuckLuck                   2060
   6 Rubik87                    2042
   7 unknownsoldier             2039
   8 MonsenhorChacina           2029
   9 zibik21                    2021
  10 WMMekBlaze                 2004
  11 ????V                      1995
  12 Yeon                       1992
  13 Eitz                       1989
  14 Oliebol                    1981
  15 20TheWindowCleaner         1977
  16 Heyheuhei                  1963
  17 DrTypeSomething            1961
  18 chas                       1954
  19 PaniX                      1950
  20 Troll                      1943
  21 13CHRIS37                  1935
  22 TheImpaller                1933
  23 Mian                       1927
  24 fwiw                       1917
  25 alababi                    1907
  26 Hroptatyr                  1904
  27 LilEitz                    1892
  28 bytjie                     1886
  29 WMDazedInsane              1882
  30 Fizzer                     1879

The ratings are not important, just the ordering of the players. The wins and losses are only specified in the left table, but these are comparisons over the same games so the numbers are the same for both sides.

Even though Bayesian ELO is what the site uses now, you might notice some differences between the right table and what WarLight.net shows today. They’re not identical since WarLight doesn’t give ranks to players who have left the ladder, don’t have 10 games yet, or are on vacation.

Algorithm Differences

There are advantages and disadvantages between the two algorithms. The biggest difference is that the ripple effect that Bayesian ELO uses is not existent in TrueSkill. That is, when a game ends, Bayesian ELO applies the biggest changes to the players who played that game, but also applies smaller adjustments to everyone who has played either of those players, and so on.

The nice thing about TrueSkill is that when a game ends, you can immediately know how many rating points you gained or lost. Your rating also only changes when you finish a game. This also means that you can see exactly how your rating got to its current location, as each game can show its affect on your rating.

The disadvantage of this is that *when* you defeat an opponent matters. Say player A rises from #30 to #1 on the ladder. If you defeat player A when they’re at #1, you’ll get a much bigger ratings boost than you would have if you defeated them when they were #30. This isn’t true in Bayesian ELO, since the rating points you got from defeating player A rise as they rise up the ladder.

This is most visible in contrived examples. Say Player A defeat B, B defeats C, and then C defeats A. In Bayesian ELO, all three players would be tied, as their victories form a perfect triangle, evening each other out. In TrueSkill, player C would be the highest ranked, since they defeated A who was the #1 ranked player at the time of their game.

Running TrueSkill Simulations

As mentioned above, I wrote a command-line tool that allows you to run simulations of WarLight ladders with the TrueSkill algorithm.

You can download this tool from this link: WLTrueSkill.exe.

On Windows, this requires .NET 4.0 runtime to be installed. On Mac or Linux, the tool should work fine under a recent version of Mono.

To use the tool, simply feed it one of the Bayeselo Logs linked from the wiki. For example:

WLTrueSkill < BayeseloLog0.txt

By running the program with an argument of /?, you can see some additional options.

Feedback

I'm considering using TrueSkill for Season II as a trial run. Let me know your thoughts!

14 thoughts on “TrueSkill”

  1. This is interesting. As you said, the biggest downside is “when” you beat an opponent. For that reason, I think that short-term results would be skewed, but with enough games under you the rankings would even out to where they should be. What I would propose, based on this reasoning, is that you could keep the B-ELO for the seasonal ladders, as they have short durations, while changing the permanent ones over to the TrueSkill as I think that over time it would make a more fair estimation of ratings.

  2. Funny how zaeban, Gui, and AceWindu and ranked 1,2,3 for TrueSkill, but that reversed for the Bayesian chart.

    So by average, Gui is the best warlight player.

    1. Or does that mean that we have a 3-way tie for the best ladder player?

      zaeban – 1+3/2 = 2
      Gui – 2+2/2 = 2
      AceWindu – 3+1/2 = 2

  3. Since ladder games are multi-day, how does that affect the timing under TrueSkill? A single game could theoretically stretch on for weeks, and the player’s ranks could drastically change while that game is ongoing. Does TrueSkill only count their ranks at the moment the game ends? Or the moment it begins? Or some combination over time?

  4. Eagleblast’s idea seems to be the best solution. With the shortness of seasonal ladders, TrueSkill would be heavily Skewed. In regular ladders, it feels like TrueSkill would be the best though. Thus we have B-ELO for Seasonals and TS for the Regular ladders.

  5. SebCorps, keep up the good work! A fat Chinese New Year’s red envelope is on the way! (Though you ignore the fact that how one plays on one map in 1v1s is not the sole measure of who is the best, as if other maps/settings, team games, and FFAs don’t matter.)

    Fizzer, why not create your own ranking system, something that is a tweaked basket of ranking systems that best suits WL? Or, maybe make it 50%-50% or 70% B-ELO, 30% TrueSkill (or whatever you think is best)? An average (of some sort) of the two ranking systems would be best I think: get the best of both, reduce the problems either might have.

    1. This makes the most sense to me. I think the trickle down effect of the Bayesian method is important so it rewards you for beating good players regardless of when that victory came. (and on the flip side it ‘punishes’ you for beating on weaker players)

      1. I like this idea, you could just take both rankings for everyone on the ladder, average the two ratings (not rankings) and have that average as the score that rankings are based on.

  6. I think that getting rid of the “ripple effect” is just fine. After all, players can get better over time… why should someone who beat me a week ago when I had a very poor understanding of the game, benefit from my increase in skill?

  7. Trueskill is better, but the ladders have gotten stagnant and boring. Most have stopped caring about it, and many people’s ratings have fallen b/c of lack of interest. Think the ladders need a face-lift and also a reset.

  8. The ripple effect can be crippling when a player (i.e. HHH) who’s generally very good beats you and then goes inactive for awhile, getting boot from several games in the process. HHH’s rating then drops a good 500 points which in turn drags your rating down until your game with him expires (in three months time).

    As one person commented, the ladders have become stagnant. TrueSkill will up the competition and render delaying less effective than previously (losing a game against a low ranked player won’t continue to impact you until it expires, but can be compensated for with two or three solid victories against good players).

    It also requires you to focus more on the ladder games, but at the same time doesn’t punish you excessively when you lose against a player whose rating isn’t that great.

Leave a Reply

Your email address will not be published. Required fields are marked *


*