Well I've been playing around with the rating tool for a while, and I think I've nailed down what the issue is. I believe this tool is tailored towards calculating ratings for a set of players that each have a constant, unchanging playing strength. Why do I think this? First of all, notice the names of the "players" listed in the provided examples (
http://remi.coulom.free.fr/Bayesian-Elo): Comet B.68, Dragon 4.7.5, Gandalf 4.32h, etc. These are all fairly well known chess engines (essentially AI programs that play chess).
Logically, it would absolutely make sense to retroactively adjust ratings based on the future performance of opponents, if the "players" were actually specific versions of chess engines. Why? Because these engines have a constant, unchanging strength level. Say a chess engine plays one game today, and another 99 games over the next six months. If we want to calculate the strength of the chess engine at the time of the first game played, every single one of the 100 games should be considered with equal weighting, because *the strength of a chess engine does not change over time*!
But with a human player, that of course is not true. I think this is the fundamental flaw with using this method to rate human players.