Club 42 Ratings Experiment - Year 2001
(based on Robert Parker's research on scores and ratings)
by Steve Pellinen
Abstract
Robert Parker and the National SCRABBLE® Association (NSA) Ratings Committee
have suggested that using only wins and losses to rate the relative playing
strength of SCRABBLE® players is insufficient, or, more technically,
inefficient. While wins and losses are easily understood and tabulated, there
is much more information readily available from game results. Specifically,
players' scores can be used, either alone or together with win/loss data to
more accurately rate playing strength. Club 42, in Minneapolis, Minnesota,
USA, recorded all game results for calendar year 2001. A formula, based on
Parker's research, was used to calculate player ratings on a weekly basis.
Score-based ratings were found to be a good measure of relative playing
strength in this particular club context. More research is needed to assess
the viability of score-based ratings in a larger context, such as the North
American tournament circuit currently rated by the NSA. The Club 42 experiment
suggests that this research may be worthwhile, if there is a need to provide a
more stable, more comprehensive and, perhaps, more accurate method to rate
playing strength.
Introduction
Many tournament players in North America are less than happy with the way
relative playing strengths, or ratings, are determined. The dissatisfaction
ranges from mild to extreme, with various underlying reasons. Some don't like
the large fluctuations that occur. Some perceive ratings to be deflating, so
that meaningful comparisons over time can't be made. Some question the
accuracy of the ratings, critical to fair placement in tournament divisions.
Some question the validity of using ratings, with their perceived problems, as
fair qualifying criteria for events such as the World Scrabble Championships.
With ratings playing such a prominent role in the life of the tournament
community, it seems imperative to use the best possible rating system. The
current NSA rating calculation is a simple win/loss probability event, based on
the rating differences between players. This rating calculation uses only
win/loss information from NSA sanctioned tournaments. It is not uncommon for a
player's rating to fluctuate by as much as 100-200 points, or more, over a span
of just a few tournaments. It is not reasonable to assume that a player's true
playing strength changes as much, or as rapidly, as indicated by such rating
fluctuations. If ratings are a measure of playing strength, a large, rapid
change in rating should occur only for players who have experienced significant
input to their skills. Either they have significantly added to their word
knowledge and/or strategic abilities, or they have suffered some type of loss
of mental capacity.
Robert Parker, together with the NSA Ratings Committee, published research in
1998 that discussed the use of tournament game scores to rate players. This
research is archived by the NSA Ratings Committee. Parker's research was not
finished, and a rating formula based on scores was not fully developed.
However, enough work was done to suggest a direction for score-based ratings.
The formula used for Club 42 is based on this research. It consists of four
components: scoring offense, scoring defense, strength of opposition, and
winning percentage weighted by strength of opposition.
Caveats
Parker's research and its potential applications anticipated tournament
contexts. Club play, as it occurred in Club 42 in 2001, differs from
tournament play in what may be significant ways. Tournaments are usually 6-18
or more games in length, with player rating adjustments calculated after an
entire event. Most players play less than one tournament per month. Club 42
sessions, for purposes of rating calculations, are like weekly
mini-tournaments, with ratings calculated after each session of three or four
games. Hence rating adjustments occurred more frequently, and were based on
fewer games, than would occur for tournaments.
In tournaments, players are typically segregated into relatively narrow rating
groups, or divisions. Players only play other players within their division.
Club 42 players can, and do, play opponents of all available ratings. Three to
four games are played in a typical session. Round 1 is paired randomly. Round
2 pairs players rated next to each other, based on year-to-date ratings.
Rounds 3 and 4 are paired king-of-hill (adjusted for no repeats) based on
results after previous rounds. In practice, since three of the four rounds use
performance/rating based pairings, some segregation occurs over time. Over the
course of a year, players play many more games against their relative peers
than they do against much higher or lower rated players.
Unlike chess, which rates players by a method similar to the current NSA
approach, SCRABBLE® possesses a significant luck factor. Over a large enough
number of games, it is presumed that the luck factor, and its various
contributing components, will even out. It is not clear how many games are
needed to render the luck factor insignificant for purposes of rating
calculations. Indeed, current NSA ratings are not attenuated, so the luck
factor plays a significant role in each player's current rating, with resulting
large swings in ratings for most players. For Club 42 ratings, all games were
included. If score-based ratings for Club 42 are carried over in subsequent
years, ratings will be based on the most recently played games, with the number
of games yet to be determined, likely in the range of the most recent 50-200
games.
Parker concluded that there was "no simple combination of score-based and
win/loss-based rating systems that will stand the mathematical test of
fairness." However, he acknowledged that there may be some such combination
that players would find acceptable. Parker suggested adding points to winners'
scores, either a constant number or a number proportional to the opponent's
ability. This could be applied consistently in a tournament context, where
everyone plays the same number of games. In club play, players don't stay for
all games or experience occasional byes any given night, and hence play
different numbers of games per session. The Club 42 formula rewarded wins by
an amount proportional to a player's overall club winning percentage relative
to the quality of their opposition. This may overemphasize the win/loss
component.
Parker also considered the need to limit the effect of blowout games, possibly
trimming such games to a maximum 200-point differential. The Club 42 formula
makes no provision for this. It was felt that, over a large number of games,
the effect of blowouts would either even out, or not be significant.
The size and nature of the data sample and population of players presents some
problems. Only 37 players played the arbitrarily chosen minimum 20 games to be
included in the analyses. The number of games played per player ranged from 20
to 194. The number of games between particular players ranged from zero to
more than 10. Familiarity between players, developed over years of
interaction, may lead to atypical playing approaches based on known tendencies,
strengths, weaknesses, etc. The club context, while competitive, is more
relaxed and less formal than tournament play. Some players may not play with
the same intensity or interest as they would in a tournament context.
The discussion in this report is more qualitative than quantitative.
Statistical analyses supporting or challenging the results and conclusions are
beyond the scope of this report, and are left as an exercise for those who are
interested. Data can be provided for those who are so inclined.
With these caveats, the following results and conclusions are offered.
The Rating Formula
Refer to Parker's work for support and elaboration. The score-based ratings
formula used for Club 42 was
Rating = O + D + S + P
where
O = Offense, the player's average score
D = Defense, the amount by which opponents are kept below their average score
S = Strength of opponents, equal to 2.2 x opponents' average score
P = Power of wins, equal to player's winning percentage x opponents' average score
Offense and Defense are straightforward calculations. The adjustment for
Strength of opponents is based on Parker's preliminary research, and may need
tweaking to determine the best multiplier. Instead of adding points to scores
for wins, as Parker suggested, Club 42 added points based on winning percentage
relative to opponents' scoring average.
Results
Each player's rating was calculated after each club session. A spreadsheet was
used to facilitate the calculations and maintain the ratings. Results were
presented in tabular and graphical chart formats, offering a variety of ways to
view the data. Right-click (control-click on Mac)
here
and choose "Save As" to download the raw data in Excel spreadsheet format.
- Table 1
shows the ratings for all players who played at least one club game in
2001. These ratings are based on all games played. It is interesting to see
the range and volatility of ratings for players with relatively few games, and
how that volatility dramatically decreases as more games are played. This is
especially evident in the raw data, where game-by-game rating changes are
shown.
- Table 2
shows Club 42 players (minimum 20 games played) segregated into two
groups, A and B, divided by the median overall club rating. These groups are
used for subsequent analyses showing the effects of segregation (as is more
typical in tournament contexts).
Note the contribution of each component (O, D, S, P) to the overall rating.
The relative contribution of each component is more fairly assessed by the
range, rather than the numerical value, of each component. Thus, O is bounded
by a low value of 326 and a high value of 411, yielding a range of 85 points.
D is bounded by a low value of -49 and a high value of 32, yielding a range of
81 points. O+D, the total scoring component, is bounded by a low value ot 280
and a high value of 440, yielding a range of 160 points. S is bounded by a low
value of 735 and a high value of 885, yielding a range of 150 points. P is
bounded by a low value of 74 and a high value of 287, yielding a range of 213
points.
- Table 3
shows adjusted ratings for Groups A and B, calculated from games played
only between members of each group. That is, when Group A or B players only
play other Group A or B players, respectively, what effect does it have on
their ratings?
- Table 4
shows the movement, up or down, for each player, when their rating is
calculated only from games played within their group, compared to their overall
rating.
- Table 5
shows overall club rating, adjusted group rating and average NSA rating
for 2001, with relative rankings. Average NSA rating is an unweighted average
of all ratings obtained in calendar year 2001.
- Table 6
shows players' relative ranking for each component (O, D, S, P).
Parker did not seem convinced that factoring in win/loss data was necessary, or
even a good idea. If the calculation is just as accurate, or more accurate,
without win/loss data, then it would seem a good idea to eliminate P and
simplify the calculation. However, in Club 42 play, P also factors in
opponents' strength, so it's not strictly win/loss data but contains some score
information within its value, perhaps making it redundant.
- Table 7
shows the offsetting effects of scores and strength of opposition when
play is segregated. These effects are perhaps intuitively obvious, but it is
useful to see what happens to rating components when play is limited to
narrower rating distributions. Parker's suggested strength of opposition
factor of 2.2 (applied to opponent average score) seems like a good starting
point, but needs more study across many different rating fields.
Conclusions
Rigorous testing and statistical analysis for a very large number of games,
under tournament conditions, is needed before any valid conclusions can be
made. The Club 42 experiment, while gathering and presenting some data, is
very limited in scope and application. It merely attempts to show, in a
particular club context, whether score-based ratings offer potential for
improving the assessment of players' abilities. Most of the conclusions herein
are observational, even anecdotal, and should not be represented as more than
that.
Generally, the Club 42 ratings appear to order the players fairly and
accurately. As with most normal distributions, distinctions are clearer at the
upper and lower ends while more inconsistencies show up in the middle. There
are a few aberrant exceptions, which may relate to the paucity of data, or
possibly the inherent "ingrown" effects of playing against very familiar
opponents. Additional statistical analyses may help in explaining these data.
The relative contribution of each of the rating formula components to the final
ratings distribution can be questioned. Offense and Defense contribute about
the same (160 point range) as Strength of opponents (150 point range). Power
of Wins makes a heavier contribution (213 point range). This may not lead to
the most accurate set of ratings in all situations, although it represented
Club 42 quite well. Applying a different factor for wins should be considered,
or perhaps it should be dropped altogether.
The main reason for including a win/loss component seems to be psychological
rather than mathematical. That is, players are accustomed to seeing their wins
and losses affect their rating, and a rating system that does not incorporate
win/loss data may be seen as too different, or somehow not correct. The object
of the game is to win, and wins and losses determine placement in most
tournament competitions. At first, it may seem counterintuitive to experience
a slight rating decrease when winning, or a slight increase when losing. But
such occasions can be seen as correct when the scores and strength of opponents
are considered.
Based on comparisons with NSA ratings, the order of playing strength shown by
Club 42 ratings correlates fairly well. Based on historical club data,
including previous years, the players also appear to be ordered fairly well. A
more subjective assessment, the opinions of the players themselves, also
supports the relative accuracy of the ratings. That is, most players thought
that the ratings gave a good indication of where they stood in the club.
One of the goals of any rating system should be that it doesn't unfairly reward
or punish any player simply by virtue of who they play. That is, there should
be no rating incentive for high-rated players to play low-rated players, or
vice versa. The current NSA rating calculation has been shown to slightly
understate the probability of a lower-rated player beating a higher-rated
player. This has led to the unfortunate perception, perhaps rightly so, that
it is not in the best interests of higher-rated players to play lower-rated
players. This makes it more difficult for lower-rated players to increase
their ratings, as some higher-rated players choose to play only in tournaments
they perceive to be potentially beneficial to their ratings. Another primary
motivation of many players, to seek out the best possible competition
(regardless of ratings), is also thwarted by this seeming inequity.
In this study, it's not clear whether rating incentives existed. Generally,
divisional play hurt the ratings of higher division players while helping the
lower division ratings, but there were exceptions. Some possible reasons for
this include improper strength of opposition factor or improper weighting of
wins, or, the effects of playing familiar opponents with known tendencies.
More study needs to be done on what happens when players play primarily within
relatively narrow rating divisions. While this study looked at the effects of
dividing players into two groups, most tournaments divide players into three,
four or more rating divisions. When players play nearly all their games
against rated peers, are the results as clear as they seem to be when playing
against a wider range? Much of Parker's research data did address this, but
more is needed.
The reality is that the luck factor can either exacerbate or hide the inherent
disincentive in the NSA rating calculation, rendering it difficult to see.
Thus, many players have had the opposite experiences of going up in rating when
playing a weak field and going down in rating when playing a strong field.
Nevertheless, it is the persistent perception of many players that there is
something inherently wrong with the existing calculation. Any improvements to
the rating calculation should attempt to minimize or eliminate ratings
incentives within the calculation. This would allow players to choose their
tournaments for reasons other than effect on ratings, i.e., the potential for
ratings gain/loss should be the same at any tournament.
However, a case could be made to incorporate a positive rating incentive to the
calculation. That is, if it is slightly advantageous for higher-rated players
to play lower-rated players, it would encourage wider rating ranges within
tournament divisions. Presumably, this would lead to more rapid improvement of
some lower-rated players who would get more opportunities to play better
players. This situation may be more appropriate for club play than tournament
play. A rating calculation that could be tweaked by adjusting one or two
parameters so that it is optimized for either club or tournament conditions may
be possible.
Finally, for any score-based rating system based on Parker's research, a
decision must be made regarding the number of games to include in a current
rating, and whether more recent games should be weighted more heavily. Ratings
volatility decreases with a higher number of included games, but at some point
the opportunity to see significant gain or loss in rating is lost. The
balancing point would appear to be somewhere between 50 and 200 games, and
could depend on whether a weighting factor is included.
The current NSA rating calculation puts an unweighted 10, 16 or 20 rating
points at risk in each game, depending on one's rating. In the Club 42 study,
the rating points at risk varied with the number of total games played. After
20 games, up to 30 rating points were at risk. After 50, 100 and 150 games, up
to 10, 6 and 4 points, respectively, were at risk for extreme events
(blowouts), with point swings of 4, 2 and 1 point more typical. Due to the
nature of the calculation, many games produce no change in rating, especially
after a large number of included games.
There is concern that if scores are incorporated into the rating calculation,
players may change their style of play. That is, they may play to maximize
score instead of play to win. In the Club 42 experiment, it is difficult to
know if that happened, but it wasn't apparent. Anecdotal evidence indicates
that it happened very little, especially once players saw the rating
calculation in action week after week and realized there was very little to be
gained by focusing on scores instead of winning.
For one thing, wins are included in the calculation. Perhaps equally
important, since each club session is run like a small tournament, winning is
still the primary goal. Also, once players understand that defense is as
important as offense, there is little reason to change one's style of play.
Finally, after a sufficiently large number of games are included in the rating
calculation, players see that the effect of one or two games is very little,
and the difference between winning by a little and winning by a lot is even
less.
The author of this study performed the rating calculations week after week and
thus understood and saw the specific effects of every type of game. Knowing
that, the author could not conceive of any particular changes in his playing
style that would significantly enhance his rating. Playing to win was still
the primary goal, which presumably also would be true for NSA tournament play
in a score-based rating system. If one's rating is used to qualify for a
particular event, e.g., the World Championship, it is conceivable that someone
might try to alter their play in a way to enhance their rating. It remains to
be seen if that is possible or significant, but again, more research can
address the issue.
More research will clarify and solidify the value, or lack thereof, of using
scores to rate players. The Club 42 experience was largely positive, which
indicates that for many players, using scores to calculate ratings may be
acceptable.
Send your feedback to Steve Pellinen at
pellinet@aol.com