Club 42 Ratings Experiment - Year 2001

  (based on Robert Parker's research on scores and ratings)
  by Steve Pellinen

Abstract

Robert Parker and the National SCRABBLE® Association (NSA) Ratings Committee have suggested that using only wins and losses to rate the relative playing strength of SCRABBLE® players is insufficient, or, more technically, inefficient. While wins and losses are easily understood and tabulated, there is much more information readily available from game results. Specifically, players' scores can be used, either alone or together with win/loss data to more accurately rate playing strength. Club 42, in Minneapolis, Minnesota, USA, recorded all game results for calendar year 2001. A formula, based on Parker's research, was used to calculate player ratings on a weekly basis. Score-based ratings were found to be a good measure of relative playing strength in this particular club context. More research is needed to assess the viability of score-based ratings in a larger context, such as the North American tournament circuit currently rated by the NSA. The Club 42 experiment suggests that this research may be worthwhile, if there is a need to provide a more stable, more comprehensive and, perhaps, more accurate method to rate playing strength.

Introduction

Many tournament players in North America are less than happy with the way relative playing strengths, or ratings, are determined. The dissatisfaction ranges from mild to extreme, with various underlying reasons. Some don't like the large fluctuations that occur. Some perceive ratings to be deflating, so that meaningful comparisons over time can't be made. Some question the accuracy of the ratings, critical to fair placement in tournament divisions. Some question the validity of using ratings, with their perceived problems, as fair qualifying criteria for events such as the World Scrabble Championships.

With ratings playing such a prominent role in the life of the tournament community, it seems imperative to use the best possible rating system. The current NSA rating calculation is a simple win/loss probability event, based on the rating differences between players. This rating calculation uses only win/loss information from NSA sanctioned tournaments. It is not uncommon for a player's rating to fluctuate by as much as 100-200 points, or more, over a span of just a few tournaments. It is not reasonable to assume that a player's true playing strength changes as much, or as rapidly, as indicated by such rating fluctuations. If ratings are a measure of playing strength, a large, rapid change in rating should occur only for players who have experienced significant input to their skills. Either they have significantly added to their word knowledge and/or strategic abilities, or they have suffered some type of loss of mental capacity.

Robert Parker, together with the NSA Ratings Committee, published research in 1998 that discussed the use of tournament game scores to rate players. This research is archived by the NSA Ratings Committee. Parker's research was not finished, and a rating formula based on scores was not fully developed. However, enough work was done to suggest a direction for score-based ratings. The formula used for Club 42 is based on this research. It consists of four components: scoring offense, scoring defense, strength of opposition, and winning percentage weighted by strength of opposition.

Caveats

Parker's research and its potential applications anticipated tournament contexts. Club play, as it occurred in Club 42 in 2001, differs from tournament play in what may be significant ways. Tournaments are usually 6-18 or more games in length, with player rating adjustments calculated after an entire event. Most players play less than one tournament per month. Club 42 sessions, for purposes of rating calculations, are like weekly mini-tournaments, with ratings calculated after each session of three or four games. Hence rating adjustments occurred more frequently, and were based on fewer games, than would occur for tournaments.

In tournaments, players are typically segregated into relatively narrow rating groups, or divisions. Players only play other players within their division. Club 42 players can, and do, play opponents of all available ratings. Three to four games are played in a typical session. Round 1 is paired randomly. Round 2 pairs players rated next to each other, based on year-to-date ratings. Rounds 3 and 4 are paired king-of-hill (adjusted for no repeats) based on results after previous rounds. In practice, since three of the four rounds use performance/rating based pairings, some segregation occurs over time. Over the course of a year, players play many more games against their relative peers than they do against much higher or lower rated players.

Unlike chess, which rates players by a method similar to the current NSA approach, SCRABBLE® possesses a significant luck factor. Over a large enough number of games, it is presumed that the luck factor, and its various contributing components, will even out. It is not clear how many games are needed to render the luck factor insignificant for purposes of rating calculations. Indeed, current NSA ratings are not attenuated, so the luck factor plays a significant role in each player's current rating, with resulting large swings in ratings for most players. For Club 42 ratings, all games were included. If score-based ratings for Club 42 are carried over in subsequent years, ratings will be based on the most recently played games, with the number of games yet to be determined, likely in the range of the most recent 50-200 games.

Parker concluded that there was "no simple combination of score-based and win/loss-based rating systems that will stand the mathematical test of fairness." However, he acknowledged that there may be some such combination that players would find acceptable. Parker suggested adding points to winners' scores, either a constant number or a number proportional to the opponent's ability. This could be applied consistently in a tournament context, where everyone plays the same number of games. In club play, players don't stay for all games or experience occasional byes any given night, and hence play different numbers of games per session. The Club 42 formula rewarded wins by an amount proportional to a player's overall club winning percentage relative to the quality of their opposition. This may overemphasize the win/loss component.

Parker also considered the need to limit the effect of blowout games, possibly trimming such games to a maximum 200-point differential. The Club 42 formula makes no provision for this. It was felt that, over a large number of games, the effect of blowouts would either even out, or not be significant.

The size and nature of the data sample and population of players presents some problems. Only 37 players played the arbitrarily chosen minimum 20 games to be included in the analyses. The number of games played per player ranged from 20 to 194. The number of games between particular players ranged from zero to more than 10. Familiarity between players, developed over years of interaction, may lead to atypical playing approaches based on known tendencies, strengths, weaknesses, etc. The club context, while competitive, is more relaxed and less formal than tournament play. Some players may not play with the same intensity or interest as they would in a tournament context.

The discussion in this report is more qualitative than quantitative. Statistical analyses supporting or challenging the results and conclusions are beyond the scope of this report, and are left as an exercise for those who are interested. Data can be provided for those who are so inclined.

With these caveats, the following results and conclusions are offered.

The Rating Formula

Refer to Parker's work for support and elaboration. The score-based ratings formula used for Club 42 was

    Rating = O + D + S + P

where

    O = Offense, the player's average score
    D = Defense, the amount by which opponents are kept below their average score
    S = Strength of opponents, equal to 2.2 x opponents' average score
    P = Power of wins, equal to player's winning percentage x opponents' average score

Offense and Defense are straightforward calculations. The adjustment for Strength of opponents is based on Parker's preliminary research, and may need tweaking to determine the best multiplier. Instead of adding points to scores for wins, as Parker suggested, Club 42 added points based on winning percentage relative to opponents' scoring average.

Results

Each player's rating was calculated after each club session. A spreadsheet was used to facilitate the calculations and maintain the ratings. Results were presented in tabular and graphical chart formats, offering a variety of ways to view the data. Right-click (control-click on Mac) here and choose "Save As" to download the raw data in Excel spreadsheet format.

Conclusions

Rigorous testing and statistical analysis for a very large number of games, under tournament conditions, is needed before any valid conclusions can be made. The Club 42 experiment, while gathering and presenting some data, is very limited in scope and application. It merely attempts to show, in a particular club context, whether score-based ratings offer potential for improving the assessment of players' abilities. Most of the conclusions herein are observational, even anecdotal, and should not be represented as more than that.

Generally, the Club 42 ratings appear to order the players fairly and accurately. As with most normal distributions, distinctions are clearer at the upper and lower ends while more inconsistencies show up in the middle. There are a few aberrant exceptions, which may relate to the paucity of data, or possibly the inherent "ingrown" effects of playing against very familiar opponents. Additional statistical analyses may help in explaining these data. The relative contribution of each of the rating formula components to the final ratings distribution can be questioned. Offense and Defense contribute about the same (160 point range) as Strength of opponents (150 point range). Power of Wins makes a heavier contribution (213 point range). This may not lead to the most accurate set of ratings in all situations, although it represented Club 42 quite well. Applying a different factor for wins should be considered, or perhaps it should be dropped altogether.

The main reason for including a win/loss component seems to be psychological rather than mathematical. That is, players are accustomed to seeing their wins and losses affect their rating, and a rating system that does not incorporate win/loss data may be seen as too different, or somehow not correct. The object of the game is to win, and wins and losses determine placement in most tournament competitions. At first, it may seem counterintuitive to experience a slight rating decrease when winning, or a slight increase when losing. But such occasions can be seen as correct when the scores and strength of opponents are considered.

Based on comparisons with NSA ratings, the order of playing strength shown by Club 42 ratings correlates fairly well. Based on historical club data, including previous years, the players also appear to be ordered fairly well. A more subjective assessment, the opinions of the players themselves, also supports the relative accuracy of the ratings. That is, most players thought that the ratings gave a good indication of where they stood in the club.

One of the goals of any rating system should be that it doesn't unfairly reward or punish any player simply by virtue of who they play. That is, there should be no rating incentive for high-rated players to play low-rated players, or vice versa. The current NSA rating calculation has been shown to slightly understate the probability of a lower-rated player beating a higher-rated player. This has led to the unfortunate perception, perhaps rightly so, that it is not in the best interests of higher-rated players to play lower-rated players. This makes it more difficult for lower-rated players to increase their ratings, as some higher-rated players choose to play only in tournaments they perceive to be potentially beneficial to their ratings. Another primary motivation of many players, to seek out the best possible competition (regardless of ratings), is also thwarted by this seeming inequity.

In this study, it's not clear whether rating incentives existed. Generally, divisional play hurt the ratings of higher division players while helping the lower division ratings, but there were exceptions. Some possible reasons for this include improper strength of opposition factor or improper weighting of wins, or, the effects of playing familiar opponents with known tendencies.

More study needs to be done on what happens when players play primarily within relatively narrow rating divisions. While this study looked at the effects of dividing players into two groups, most tournaments divide players into three, four or more rating divisions. When players play nearly all their games against rated peers, are the results as clear as they seem to be when playing against a wider range? Much of Parker's research data did address this, but more is needed.

The reality is that the luck factor can either exacerbate or hide the inherent disincentive in the NSA rating calculation, rendering it difficult to see. Thus, many players have had the opposite experiences of going up in rating when playing a weak field and going down in rating when playing a strong field. Nevertheless, it is the persistent perception of many players that there is something inherently wrong with the existing calculation. Any improvements to the rating calculation should attempt to minimize or eliminate ratings incentives within the calculation. This would allow players to choose their tournaments for reasons other than effect on ratings, i.e., the potential for ratings gain/loss should be the same at any tournament.

However, a case could be made to incorporate a positive rating incentive to the calculation. That is, if it is slightly advantageous for higher-rated players to play lower-rated players, it would encourage wider rating ranges within tournament divisions. Presumably, this would lead to more rapid improvement of some lower-rated players who would get more opportunities to play better players. This situation may be more appropriate for club play than tournament play. A rating calculation that could be tweaked by adjusting one or two parameters so that it is optimized for either club or tournament conditions may be possible.

Finally, for any score-based rating system based on Parker's research, a decision must be made regarding the number of games to include in a current rating, and whether more recent games should be weighted more heavily. Ratings volatility decreases with a higher number of included games, but at some point the opportunity to see significant gain or loss in rating is lost. The balancing point would appear to be somewhere between 50 and 200 games, and could depend on whether a weighting factor is included.

The current NSA rating calculation puts an unweighted 10, 16 or 20 rating points at risk in each game, depending on one's rating. In the Club 42 study, the rating points at risk varied with the number of total games played. After 20 games, up to 30 rating points were at risk. After 50, 100 and 150 games, up to 10, 6 and 4 points, respectively, were at risk for extreme events (blowouts), with point swings of 4, 2 and 1 point more typical. Due to the nature of the calculation, many games produce no change in rating, especially after a large number of included games.

There is concern that if scores are incorporated into the rating calculation, players may change their style of play. That is, they may play to maximize score instead of play to win. In the Club 42 experiment, it is difficult to know if that happened, but it wasn't apparent. Anecdotal evidence indicates that it happened very little, especially once players saw the rating calculation in action week after week and realized there was very little to be gained by focusing on scores instead of winning.

For one thing, wins are included in the calculation. Perhaps equally important, since each club session is run like a small tournament, winning is still the primary goal. Also, once players understand that defense is as important as offense, there is little reason to change one's style of play. Finally, after a sufficiently large number of games are included in the rating calculation, players see that the effect of one or two games is very little, and the difference between winning by a little and winning by a lot is even less.

The author of this study performed the rating calculations week after week and thus understood and saw the specific effects of every type of game. Knowing that, the author could not conceive of any particular changes in his playing style that would significantly enhance his rating. Playing to win was still the primary goal, which presumably also would be true for NSA tournament play in a score-based rating system. If one's rating is used to qualify for a particular event, e.g., the World Championship, it is conceivable that someone might try to alter their play in a way to enhance their rating. It remains to be seen if that is possible or significant, but again, more research can address the issue.

More research will clarify and solidify the value, or lack thereof, of using scores to rate players. The Club 42 experience was largely positive, which indicates that for many players, using scores to calculate ratings may be acceptable.

Send your feedback to Steve Pellinen at pellinet@aol.com