Judging Bias and Figure Skating: Part One – Nationalistic Bias - フィギュアスケートファンの資料庫

FS Judging ReviewOctober 17, 2019Uncategorized

As the Senior Grand Prix season is about to begin, I thought now would be an appropriate time to take a systematic look at the judging records of our judges. Despite the fact that figure skating judges play a significant role in determining the career trajectory of young athletes, it is well known that they face little accountability. Only the most egregious cases of biased scoring are ever punished, in spite of the fact that biased scoring is rampant and well-documented. Nor does the International Skating Union (ISU) appear to keep any kind of running record of judges’ performances, so judges avoid all scrutiny as long as they keep their bias to a moderate amount at every competition rather than an extreme amount at a single competition.

In order to improve accountability, then, I’ve decided to compile and analyze the data myself. Over the course of this season I plan on looking at a few different issues: 1. Nationalistic bias 2. “Bloc” judging, judges being systematically biased in favor of skaters from other particular nationalities, typified, for instance, by judges from former USSR countries scoring Russian skaters more favorably. 3. Underscoring of top competitors by rival countries’ judges. For instance, do American judges underscore Yuzuru Hanyu? Japanese judges and Nathan Chen? Russian judges and Rika Kihira? Etc. My hope is that in addition to holding judges accountable, we may also be able to confirm or dispel some suspicions that fans have long held but don’t necessarily have evidence for.

In the first post, we will look at nationalistic bias. We will discuss which judges show statistically significant evidence of nationalistic bias, how we’re able to determine this, as well as address some concerns and limitations.

Part One: Nationalistic bias

First, using data available from skatingscores.com and processing it through a formula to determine how each judge judged each skater relative to the panel, I compiled a spreadsheet of the judging records in terms of relative score of each senior international level judge who has judged an ISU Challenger Series, Grand Prix, or Championship. I also included a couple of other competitions, namely 2019 Challenge Cup and WTT. (Note: I have decided since the publishing of this post that I will remove them in the next update so that the criteria for competition inclusion are more consistent–they were included in this version because of a misapprehension about their status. This will probably affect whether some judges are flagged. Stay tuned.) It may be found here. Note that the file is too big to display on google docs, so you will have to download it.

First, let’s take a look at the top line conclusions. The database currently contains 312 judges, 177 of which I examined for nationalistic bias (the other judges didn’t have an extensive enough judging record), 92 of which showed statistically significant evidence of nationalistic bias. Of those 74 showed strong evidence. On the whole, judges scored their home skaters quite differently than they scored other skaters. For instance, if we plot the average z-scores (this will be explained in detail later) of judges’ scores for their home skaters versus other skaters, we get the following density plot:

As you can see, this shows two very different patterns of judging!

Who are these judges? Well, here they are, divided by federation. Names are included if they meet the standard benchmark for statistical significance (p<0.05), bolded if they meet a stricter benchmark (p<0.01), and bolded and underlined if they meet an even stricter benchmark (p<0.001). In parentheses next to the federation is the number of judges of that federation *tested* (not the total number of judges recorded, some of whom may not be tested due to insufficient data). Federations that do not have any judges that show statistically significant evidence of biased judging are not listed.

Judges with biased judging records

Austria (3)	Adrienn Schadenbauer
Canada (23)	Andre-Marc Allain, Cynthia Benson, Leanne Caron, Reaghan Fawcett, Karen Howard, Leslie Keen, Patty Klein, Nicole Leblanc-Richard, Erica Topolski
China (5)	Dan Fang, Shi Wei, Fan Yang
Czech Republic (7)	Frantisek Baudys, Jana Baudysova
Spain (2)	David Munoz
Finland (7)	Merja Kosonen, Virpi Kunnas-Helminen, Leo Lenkola
France (8)	Ronald Beau, Jezabel Dabouis, Elisabeth Louesdon, Philippe Meriguet, David Molina, Florence Vuylsteker
Great Britain (5)	Christopher Buchanan, Stephen Fernandez, Sarah Hanrahan, Nicholas Russell
Georgia (1)	Salome Chigogidze
Germany (13)	Christian Baumann, Ulla Faig, Uta Limpert, Claudia Stahnke, Elke Treitz, Ekaterina Zabolotnaya
Hungary (2)	Attila Soos, Gyula Szombathelyi
Israel (2)	Anna Kantor, Albert Zaydman
Italy (11)	Matteo Bonfa, Rossella Ceccattini, Raffaella Locatelli, Isabella Micheli, Tiziana Miorini, Miriam Palange, Walter Toigo
Japan (16)	Miwako Ando, Tomie Fukudome, Ritsuko Horiuchi, Akiko Kobayashi, Takeo Kuno, Kaoru Takino, Sakae Yamamoto, Nobuhiko Yoshioka
Kazakhstan (2)	Yuriy Guskov, Nadezhda Paretskaia
South Korea (2)	Sung-hee Koh, Jung Sue Lee
Latvia (1)	Agita Abele
Lithuania (1)	Laimute Krauziene
Mexico (1)	Sasha Martinez
Poland (3)	Malgorzata Grajcar, Malgorzata Sobkow
Russia (14)	Maira Abasova, Julia Andreeva, Sviatoslav Babenko, Igor Dolgushin, Elena Fomina, Maria Gribonosova-Grebneva, Natalia Kitaeva, Olga Kozhemyakina, Lolita Labunskaiya, Igor Obraztsov, Tatiana Sharkina, Alla Shekovtsova
Switzerland (3)	Bettina Meier
Sweden (4)	Inger Andersson, Kristina Houwing
Ukraine (2)	Yury Balkov, Anastassiya Makarova
USA (21)	Samuel Auxier, Richard Dalley, Janis Engel, Kathleen Harmon, Taffy Holliday, Laurie Johnson, Hal Marron, Jennifer Mast, John Millier, Sharon Rogers, Kevin Rosenstein

Let me note that there are many judges who are just on the right side of the borderline of being flagged, who I suspect will be flagged as more data comes in and evidence mounts. There are also many judges who will almost certainly be flagged as soon as they’ve judged enough to meet my minimum threshold for testing. On the other hand, note that due to the sheer number of judges tested, there may be a few judges (chiefly in the non-bolded, maybe one or two bolded, extremely unlikely for bold+italicized) who wind up on this list by pure luck (but the odds that any particular judge on this list is actually unbiased and just unlucky is pretty low). Generally speaking, the odds that I have missed a judge who deserves to be flagged are significantly higher than the odds that I have flagged someone who doesn’t deserve to be flagged.

It is also important to recognize these determinations are made on the basis of averages. Someone who is marked as biased here does over-score their own skaters to a degree that is highly unlikely to occur by random chance, but that does not mean they will over-score every single one of their own skaters at every competition. Individual scores are influenced by a variety of factors, and therefore exhibit a significant amount of noise, and judges are, after all, not clairvoyant, and cannot necessarily predict how the rest of the panel will score. (Conversely, when reviewing judging records, it’s important to remember that just because a single score is out of whack with the panel and looks consistent with nationalistic bias, does not necessarily mean that the judge who issued it is biased in general or that that score was the result of nationalistic bias.)

We can also compare federations. Here we will examine federations with at least ten judges, looking at the degree (the ZDifference) by which their judges have historically been biased. The following bar chart shows what percentage of each federation’s judges have records that fall into 4 different categories of bias: no bias (anything equal to or lower than 0), low bias (0-0.5), medium bias (0.5-1) and high bias (greater than 1). Note that this includes judges who did not meet the minimum threshold of data quantity to be flagged on an individual basis.

As you can see, some federations are all over the place like Canada, with some judges falling into each of the bias categories, whereas for others (chiefly Russia), every judge has a record that favors home country skaters. However, though Russia is the most consistently biased of the federations examined, other federations have a larger percentage of judges who have records displaying a high degree of bias.

Of these larger federations, only Australia appears to have a good claim to be unbiased on any kind of systematic basis. This can be seen more clearly by examining the following box plot, which shows the distribution of ZDifferences of judges from each of these federations in a slightly different manner than the previous graph. (If you’ve never had to read a box plot before, the box is where the middle 50% of judges from each federation lie, and the line in the middle of the box is the median–or middle–judge from that federation. The lines extend out to the full range of each federation’s judges, excluding the judges represented by the dots, who are outliers within their federation.)

Of course, these numbers will change as the season progresses and more competitions are added to the data set. I will try to keep this post updated as time passes. (Everything is up to date as of 17 Oct 2019).

In order to understand how I came to these determinations (and how to read the data), let’s take a look at methodology. As this discussion will involve a lot of statistical concepts that are probably unfamiliar to many people reading this, I’ll try to explain and offer links to statistical concepts that will be invoked here. Rest assured I did not make these metrics up–these are standard statistical tools used to analyze all kinds of data.

Methodology

The basic idea behind determining judge bias was to take each judge, compare their scores for each skater to the scores the rest of the panel gave, quantify how far off the panel judge was, and then put all the numbers together in one place so I could compare how a given judge scored their own skaters in comparison to the rest of the panel, versus how they scored other skaters. I then ran a standard statistical test on the difference in order to figure out the strength of the evidence the judging record provides of bias.

In order to do all this, I started by recording all the scores given by all the judges in all Challenger Series, Grand Prix Series, and ISU Championship competitions since the scoring system changed in 2018. I also included Challenge Cup and World Team Trophy (Note: as I mentioned earlier these will be removed in an update in order to make the criteria for competition inclusion more consistent). This was done by looking up the scores for each judge on skatingscores.com and manually inputting them into a pre-formatted competition spreadsheet. You can find the competition spreadsheets here.

If you open these spreadsheets, you’ll notice they look something like this:

		Odhran Allen	Doug Williams	Maria Fortescue	Veronique Verrue	Andreas Waldeck	Lorna Schroder	Miwako Ando	Mean
		IRL	USA	ISL	FRA	GER	CAN	JPN
Yuzuru Hanyu	JPN	164.01	157.34	172.44	161.55	166.7	167.4	170.74	165.74

(This example from the 2018 ACI men’s free skate is completely random, obviously.)

As you can see, it lists judges, judge nationality codes, skaters, the skater’s nationality code, and the scores given by each judge (these are pulled from skatingscores.com, if you want to verify the numbers–I’m very happy for the existence of this website, as it made this whole process a lot quicker as I didn’t have to run the calculations myself!), as well as the average of all the judges’ scores.

From this data, I first determined how much higher or lower each judge scored each skater compared to the rest of the panel by subtracting the mean of all the scores from the individual judge’s score. This produced what I called the score deviation. Using our friend Yuzu as an example here, this produces:

		Odhran Allen	Doug Williams	Maria Fortescue	Veronique Verrue	Andreas Waldeck	Lorna Schroder	Miwako Ando
		IRL	USA	ISL	FRA	GER	CAN	JPN
Yuzuru Hanyu	JPN	-1.73	-8.4	6.7	-4.19	0.96	1.66	5

So Odhran Allen scored Yuzu 1.73 points below the other judges, Doug Williams 8.4 points below, etc. This is what’s shown in the second block on the competition sheets. Now, unfortunately, we can’t just leave it at that, because we want to compare how judges score skaters across competitions, segments, and disciplines, in order to get the largest, most robust data sets. -8.4, while already a lot even in the men’s free, would be absolutely massive in the short, and also bigger in ladies or pairs than in men’s. Therefore, in order to make these score deviations more comparable, I had to standardize them into z-scores. Note that this is an extremely common method of standardizing data.

Here’s how it’s calculated: first, I have to determine something called the standard deviation of the judges’ scores. Standard deviation is another one of those common statistical measures, and it quantifies how spread out a set of numbers is from the average of those numbers. So if judges are all over the place on someone’s scores, then the standard deviation will be relatively high, whereas if judges are all more or less in agreement, the standard deviation will be low. In this case, the standard deviation of Yuzu’s scores was a fairly typical (in men’s) 4.85.

To calculate the z-scores, we just need to divide each of our score deviations by the standard deviation of Yuzu’s scores. One way to think about the z-score, then, is that it tells you how many standard deviations a judge scored a skater above or below average. Here is what happens to Yuzu’s score deviations once they’re converted into z-scores.

		Odhran Allen	Doug Williams	Maria Fortescue	Veronique Verrue	Andreas Waldeck	Lorna Schroder	Miwako Ando
		IRL	USA	ISL	FRA	GER	CAN	JPN
Yuzuru Hanyu	JPN	-0.36	-1.73	1.38	-0.86	0.2	0.34	1.03

Z-scores typically range from -2 to 2, though occasionally you’ll see numbers outside that if a judge *really* disagrees with the other judges (this occur at roughly a 5% frequency). Underscoring (ie. scoring below the other judges) turns into a negative z-score, while overscoring (ie. scoring above the other judges) turns into a positive z-score.

Using z-scores actually build in a measure of leniency for the judges. If there’s a lot of disagreement within the panel about a skate, then the z-score will be less extreme than the raw score difference, so a big difference with the average will “count” less. On the other hand, it does mean that if there’s a lot of agreement among panelists, someone who is a lone outlier may have a more extreme z-score in comparison to the raw score difference. But overall, the z-score makes it a bit easier for biased judges to hide. Oh well, it’s not like they hide very well in the first place.

Once these z-scores are computed for each judge, competition, and segment, all of the z-scores associated with a given judge are collected together into one sheet, the individual judge’s sheet. You can find these in the big judges database. If you click on any judge’s name in the sheet labeled “Judges”, you’ll be taken to the individual sheet for the judge you clicked, where you can see the collected z-scores for all the competitions they’ve judged.

On the left you’ll see a bunch of summary statistics. I’ll explain those in a second. On the right you’ll see z-scores. As you can see, they are labeled by skater and nationality, and at the top there’s a code which tells you which competition and segment is being shown in a given set of columns. This is composed of [Year][Competition Code] [Segment Code]. The key for competition codes may be found in the “Checklist” portion of the Judges Sheet, which also lists which competitions are included in the database.

Using a formula, all of this data is split into two groups–z-scores for home country skaters and z-scores for other skaters. The z-scores for home country skaters are averaged, producing Z-home on the left. Same thing for the z-scores for other skaters, producing Z-other. The difference between the two, which is what we’re ultimately interested in, is then calculated as ZDifference. (You’ll also see these metrics in the overall judge summary).

You can think of the ZDifference as representing the degree of bias that a judge has shown to home country skaters. In terms of raw score as a rule of thumb, a ZDifference of 1 represents about 7-8 points in men’s, 6ish in ladies and pairs, and 6-7 in ice dance over the course of a competition. In other words, a judge who has a ZDifference of 1 will give, on average, 7-8 bonus points in men’s, etc., versus what they would typically give a skater who is not from their home country.

Of course, the degree of bias shown is not the only thing that matters when it comes to assessing a judge’s level of bias. If a judge shows a ZDifference of 1 but has only judged their home country skater a couple of times, it’s possible that that ZDifference is simply due to random chance or other factors. On the other hand, if the ZDifference persists across many competitions, we can be much more confident that the judge is biased.

This is where the metric p comes in. p is another standard statistical measurement, and in the context of our data it represents the chance out of 1 that an unbiased judge, ie. one that scores home country skaters no differently than other skaters, could arrive at a record that evidences equal or greater bias than the actual judge’s record purely by accident. So in other words, the lower p is, the higher the chance that there is some kind of systematic difference between how a judge scores home country skaters and other country skaters.

By convention, a p value below 0.05 is considered statistically significant, and that is the standard I will be using to flag judges, though in many cases we’ll see that p will be far below that threshold. For instance, in the case of Russian judge Olga Kozhemyakina, p=0.000000000000003 (note that that’s a 0.0000000000003% chance an unbiased judge would produce a record equal to or worse than hers). It’s better not to see statistical significance as a binary thing, however. Instead, you should become more and more suspicious of a judge as p drops. Notice that many judges whose records were not flagged nonetheless have fairly low p values–I suspect that many of these judges’ records will start getting flagged as more scores start coming in.

By considering ZDifference and p jointly, we can make a full assessment of a judge’s judging record. ZDifference tells you the severity of the historical bias, whereas p tells you the probability it came about by chance.

I used a standard statistical test for the difference between two means, the Welch’s t-test, in order to calculate p. I used the one tailed version of the test, because we’re only looking for bias in one direction. Welch’s rather than Student’s was used because I noticed that judges with extensive judging records tended to have different variances for home country skater scores versus other scores. (If you didn’t understand this paragraph, that’s okay–unfortunately, it requires a lot more effort to explain how calculating p works in depth, so I will have to pass on doing that. If you would like to learn, I would recommend you take an introductory statistics class.)

Discussion

First, the most obvious and basic question: what does bias mean? Here, I’ve been using bias to mean a demonstrable, mathematical difference between how a judge scores their own skaters and other skaters. By claiming a judge is “biased,” I don’t mean to impute anything in particular about their psychology, nor am I making any claim about the origin of the bias. It may be conscious, or it may be unconscious. It may be a deliberate attempt to manipulate the results, or simply the same kind of lack of objectivity fans often display concerning the skating of their favorite skaters. Personally, I am not overly concerned about the causes of the bias, only that it exists–after all, let us remember that these judges, through their scoring, determine young athletes’ futures. I don’t know about you, but I don’t want the future of these young, incredibly hard-working people to be determined by a group of people who are unable to be objective, whether that lack of objectivity is the result of corrupt intent or simply clouded judgment.

However, let me address some specific explanations for bias which I do not believe are true or at least have problems, as well as other attempts to defend the judges from criticism.

1. The bias is just due to cultural preferences. People tend to look more favorably upon programs they are culturally familiar with and score them higher, and obviously a judge and a skater from the same country are more able to culturally understand each other.

First, skaters from the same country often have very different skating styles and skate very different types of programs, so it strains credulity to believe that there is something quintessentially “Russian” or “Canadian” or whatever about all skaters who skate under the same flag. Sasha Trusova’s programs look completely different from Alina Zagitova’s which are completely different from Alena Kostornaia’s, and they’re even coached and choreographed by the same people!

Second, if this were true, we would expect to see judges from culturally similar countries scoring each others’ skaters higher. For example, Canada and the US are two extremely culturally similar countries, so Canadian judges should overscore US skaters and vice versa. Fortunately, the sheet is built such that it’s easy to test this proposition (just change the country code inside each judge’s individual page to see how they score specific other countries’ skaters) and in fact, we do not see this. The vast majority of Canadian judges score US skaters like a unbiased (or even anti-biased, ie. biased in the other direction) judge would, and so, too, the other way around. (You can confirm this yourself by downloading this sheet here, which has each judge’s nationality switch to that of a large, at least somewhat culturally similar or at least geographically close figure skating country. Or you can switch around judges’ nationalities yourself on the original sheet, and it will recalculate all the stats for each judge (you have to do this inside each judge’s individual sheet, however, not on the summary sheet).

There are a few exceptions to this (former USSR countries’ judges tend to score Russian judges higher, although the level of bias is not quite as severe as it is for their own skaters. Also, I believe South Korean judges score North Korean skaters higher, but whether those two countries have similar cultures seems quite debatable. We will look at this in more detail in a future post), but I believe there is a better explanation for the exceptions. In general, culturally similar countries do *not* score each others’ skaters higher.

2. It’s just human nature to be biased. We should realize that judges are humans too and not robots.

Judges are not all the same. They do not all show evidence of bias. For instance, Glenn Fortin (CAN), Katharina Heusinger (GER), Andreas Waldeck (GER), Ayumi Kozuka (JPN), Shizuko Ugaki (JPN), Linda Leaver (USA) all have reasonably substantial judging records that do not evince any substantial evidence of bias. This clearly indicates that it is possible for judges to be unbiased, at least when it comes to nationality-related bias. Evidently, not all human have this particular human-nature related flaw. Even among the judges who are biased, there are considerable variations in the degree of bias shown. The worst offenders, for example Salome Chigogidze (GEO), Nicholas Russell (GBR), and Elena Fomina (RUS), have ZDifferences in the range of 1.5-2, whereas the lowest statistically significant differences are in the range of 0.5. This shows that it is certainly possible to reduce the level of bias of the judges overall by getting rid of the worst offenders and replacing them with less biased judges, even if some low level of bias (say, less than 0.5) is difficult to get rid of and may not be practically significant enough to be worth dealing with.

3. Your metric looks at judges’ bias by comparing them to the mean of the judging panel. Doesn’t that assume that the mean of the judging panel is right? But sometimes it’s the outlier judge that is right, and the other judges that are wrong.

It may be true that the outlier judge is indeed “right”–I avoided making any assessment of what a skater “should objectively” have scored because those types of assessments lead to unproductive fan wars and involve a level of personal judgment that I did not want to introduce to this study. However, let me note that outlier scores only “count against” judges if they align with expected patterns of nationalistic bias. If a Japanese judge scores a Filipino skater way above average because only that judge was being objective and the other judges all underscored him due to some other form of bias (reputation, small country, etc.), then that will actually count very mildly in favor of the Japanese judge. Only if a Japanese judge scores a Japanese skater way above average does it count “against” that judge. But in that case, there are still at least 3 other data points to consider (I only start calculating p if a judge has scored her own skaters at least 4 times), and if the judge only shows a pattern of “correcting” the scores for her own skaters, one has to begin wondering whether they are truly being objective. Again, judge is not labeled “biased” for having scores that deviate from the mean, a judge is labeled “biased” if there is a difference between how they score their own skaters and other skaters. A judge who scores both groups 2 standard deviations above the mean would not trigger the flagging formula, despite having scores that are way out of whack with the other judges. It is only the difference between the judges own scores for her skaters and other skaters that matters.

The only situation in which I think this may be a substantial concern is for judges from a small federation who have only judged one or two unique home country skaters. In that case, it’s possible that a personal (rather than nationalistic) preference, or a genuine belief that a particular skater (but less a large group of skaters, as that would affect Z-other as well, and thereby decrease ZDifference) is underscored, which just so happens to coincide with a national flag, gets “wrongly” flagged as nationalistic bias. This being the case, we may wish to be a little bit more lenient on small-fed judges. However, this defense hardly applies to the judges of large and powerful federations like Russia and the United States, who will judge many different skaters from their own country through their judging career, and whose skaters cannot credibly claim to be underscored because of their nationality.

Limitations/Other considerations

Despite the fact that so many judges show evidence of nationalistic bias by the methods used here, I actually think that they are somewhat limited in their ability to catch nationalistic bias. (Which ought to indicate how bad the problem is.)

First, a judge can quite easily avoid detection by only being biased “when it counts” ie. in only a selected number of competitions, when there are medals or spots at stake and a tweak in judging can make the difference. Because this bias will be washed out in the average with all of the other competitions where the judge was not being biased, this type of bias, if detectable at all using these methods, will only be so after a judge has built up an extremely substantial judging record.

Another related type of nationalistic bias this metric is not good at catching is when judges selectively underscore only direct competitors, but score everyone else normally. Because all non-home-country skaters are averaged together, this type of bias has little impact on the overall ZDifference, and consequently it is very difficult to detect it using the method here. I hope to address this in a future segment which will examine whether judges from federations with top competitors underscore the direct competitors of their skaters. Stay tuned for that.

Thirdly, if there is bloc judging going on, or any other score-trading or collusion scheme to increase a skater’s score, it will function to weaken the evidence for biased judging by reducing a biased judge’s apparent difference with the other judges on the panel when scoring their own skaters. On the flip side, if there is a score-trading or collusion scheme to lower a competitor skater’s score, that may wrongly introduce apparent bias on the part of the judge from the home country of that skater. (Nonetheless, I don’t think this creates a major concern about wrongly flagging judges, because unless there is some grand conspiracy against all of the skaters from a certain federation, one instance of apparently biased judging will be washed out when averaged with the rest of the scores that judge has given.) However, we can use this same dataset to take at least a partial look at bloc judging, and we will do so in a forthcoming post, so stay tuned for that too.

Finally, it is also possible for judges to “game the system” by overscoring non-direct competitor skaters, thereby inflating the “Z-other” portion of the calculation. I don’t think this is a major concern now, but if somehow this were adopted as the primary means of track judges’ bias, then it would be a concern in the future. (Although they would chiefly do this by overscoring lower ranking competitors, which might actually be a good thing, since it would combat reputation bias).

Conclusions

There is, as always, more to say than I have said, but as I don’t want to spend literally forever on this or produce something so long no one wants to read it, I will end it here and leave it to others to raise questions if there is a gap you would like me to fill. The overall conclusion is pretty clear: figure skating judging has a massive problem with nationalistic bias, and many judges are extremely blatant in their favoritism for their own countries’ skaters.

This also raises questions about judges’ commitment to objectivity in general. Though we have only looked at a very specific type of bias, and arguably not even the most significant one (just the easiest one to tackle using statistical methods), one might suspect that lack of objectivity in one respect bleeds in to lack of objectivity in others. What about other forms of bias that are also often discussed, like reputation bias and big fed bias? If judges are so demonstrably biased in one way, it seems reasonable to suspect that they are also biased in others.

This post has been edited since it was published to improve verbiage, update graphs, and clarify some points of methodology that were initially confusing or misleading. Also, I forgot to credit skatingscores.com for the underlying judge score numbers–this is now fixed. Also, thank you Veveco from planethanyu.com for doing the graphs!

20 thoughts on “Judging Bias and Figure Skating: Part One – Nationalistic Bias”

Patricia Orzechowski

October 17, 2019 at 2:23 pm
your bias judges basically leaves all judges bias. So How biased are you when you came up with this methodology? Do you favor one skater over others, is your personal bias getting in the way of fair nonbias judging? This reeks of bias in of itself.

You have not explained why you personally are involved in this bias judging or the judges?
Are you a figure skater or former figure skater?
Are you a coach of a skater or yourself that has not achieved the PC’s scores or calls of ur’s?
1. the judges mark on how well closely fit the scoring sheet.
  Are you a skater that gets calls that you think should not get
  Are you a coach of a skater that gets calls you think should not get?
  
  are you a judge that is being told this of which fans should be aware.
  
  WHat is in it for you.
  
  Like
  
  Reply
  1. FS Judging Review
    
    October 17, 2019 at 3:12 pm
    
    Every judge is added, but some were not tested because they didn’t judge enough for the test to be reliable. This was, in fact, to protect judges from being victims of false positives due to a scanty judging record.
    
    Again, please review the methodology. You’ll find that at every step, all judges were treated equally according to objective criteria. I could tell you my favorite skaters (Yuzuru Hanyu) and my own nationality (USA), but it’s neither here nor there, as someone who had a different favorite and a different nationality would get the same results, and in fact I welcome anyone who wishes to follow the methodology here or who wishes to do a similar study to do so and inform me of the results. Also notice that neither Japan nor the USA were left out of my examination, and plenty of judges from both countries were called out. Many other skaters that I like, for instance Alena Kostornaia, Mikhail Kolyada, Evgenia Medvedeva, and Alexandra Trusova, are Russian, and I certainly did not spare the Russian judges. Nothing is in this for me but the integrity of the sport–it certainly took many, many hours I could have devoted to doing something that might actually earn me money! All I want out of this is to see all our young, hardworking athletes judged fairly, regardless of their nationality.
    
    This project was not about tech calls, and the issue of accurate, inaccurate, or biased tech calling is outside the purview of this project, except insofar as an official serves sometimes as a judge and sometimes as a tech caller, their judging record may shed light on whether we can expect them to be a fair tech caller. Of course, I have personal opinions about how scoring should be improved, but at no point did those judgments enter in to the calculation.
    
    At this point, it is clear that you are not operating in good faith, as you have not put in any effort to actually understand the methodology of this project. If you would like clarification on any part of the write up, I am happy to provide it, but I will no longer reply to someone who makes baseless insinuations that show no understanding of what I have written.
    
    Liked by 1 person
    
    Reply
2. Florica
  
  October 17, 2019 at 4:01 pm
  
  This is really interesting. I found that US judges are usually guilty of the underscoring a direct treats of their skater(s) (Hello, Sharon Rogers :))
  If you are really into statistics – would it be possible to make comparison of PCS for warhorse music against the none often/modern one?
  Thank you!
  
  Like
  
  Reply
  1. FS Judging Review
    
    October 17, 2019 at 4:50 pm
    
    I will definitely be looking at federations underscoring direct threats to their skaters in future posts. This is pretty common, and unfortunately isn’t well accounted for by my methodology here, since the scores for all non-home-country skaters get lumped together.
    
    The music comparison would be a lot harder, and wouldn’t be possible with this data set, because this data set relies on comparing judges to the rest of the panel. If there’s some kind of effect that would affect all panelists, it’s not something I’ll be able to catch using what I did here. In order to design a study to look at the effect of music, I think you’d have to compare PCS scores of the same skater and how they change season by season in relation to the type of music they use (obviously you’d look at many skaters but you’d compare them with themselves). You’d have to control for their PCS changing for other reasons, like improved reputation/skating skills/etc. though, which I imagine is quite difficult. This would require a lot of work and isn’t very related to the bias issues I’m trying to tackle here, so it’s not something that is on my radar, but I invite anyone who wants to to do such a study. The nice thing about figure skating is that there is a lot of data available, and it’s all public, so in theory you should be able to come up with ways to answer all kinds of questions about how the sport is scored.
    
    Liked by 1 person
    
    Reply
3. Indigo
  
  October 17, 2019 at 7:48 pm
  
  “I will definitely be looking at federations underscoring direct threats to their skaters in future posts.”
  
  Are you going to use examples other than Yuzuru Hanyu?
  
  Also, in terms of outlier bias, at ACI, while the US judge was indeed lower, he wasn’t exactly super generous to other skaters in that competition, including Jason Brown.
  
  National bias isn’t just about deviation from a mean, but how it’s applied across the field and your analysis, while interesting, fails to exhibit how these judges applied their scoring towards the other 2018 ACI skaters and resultant SDs and z-scores.
  
  Like
  
  Reply
  1. FS Judging Review
    
    October 17, 2019 at 8:15 pm
    
    Please read more carefully–I’m a bit baffled at how you arrived at the conclusion that I only analyzed Yuzuru Hanyu’s ACI 2018 score. Every single score was analyzed from the beginning of the 2018-2019 season to date. Yuzuru’s score at ACI was only a demonstration to explain how the math works.
    
    Liked by 1 person
    
    Reply
4. KJM
  
  October 17, 2019 at 8:29 pm
  
  Thank you for this and all of the work that went into it. It needs to be said. Always has. I knew when the new system came in that they could still manipulate their scores and therefore the results. You’re doing great work.
  
  Like
  
  Reply
5. Kim
  
  October 17, 2019 at 9:20 pm
  
  I don’t have an issue with this kind of research and it could be a good tool for judges training. However I do have a couple of issues with the outline and the way it is written.
  
  It starts off basically accusing judges of bias in a negative context. It would be better to start with explaining what bias is rather than launch into accusations. The problem for me the whole premises is not completing unbiased when it comes to methodologies and what is being sought in evaluating skating judging.
  
  It also names and shames which I don’t think is helpful and could put people on the defensive. I don’t have an issue with regards countries being analysed but putting people’s names in there is opening yourself up to legal trouble and could actually go against the ISU code of conduct and member protection.
  
  I do think the intention is good and it could be used as part of judges training however it doesn’t appear entirely objective in what it is trying to achieve.
  
  Like
  
  Reply
  1. FS Judging Review
    
    October 18, 2019 at 4:25 pm
    
    I’m not a member of the ISU, so I am not bound by its code of conduct. And I am skeptical that any sort of lawsuit would be anything but a waste of time and money on the part of the person attempting to sue.
    
    I do have an objective here, which is to hold judges accountable for their judging records. I freely admit to that objective. This objective is not achievable without naming names, so that is what I have done. If the ISU would like to introduce some kind of accountability mechanism that obviates the need for this relatively public method of achieving accountability, I would of course welcome that, as I understand that it may be unpleasant for these judges to be called out. However, as I have made all the necessary qualifications and elaborations in the body of the text (that “bias” here means a statistically significant difference in how a judge scores home country and other country skaters, that it’s possible that, given the number of judges who are tested there may turn out to be a few false positives, that the confidence of should be modulated according to how low the p value is, etc.), I do not feel squeamish about doing so. I will be updating the list as more data comes in, so the fact that a judge is on it right now does not mean they will necessarily be on it forever. If the evidence starts to weaken, I will remove them.
    
    As for method of presentation, I wrote this article with the aim of maximizing its readability to a lay audience and its ability to hold interest. While it would be nice to start off with the methodology, I’m afraid that if I did that no one would read it as it would be excessively confusing and a bit boring. The article makes abundantly clear what “bias” means, and after all “bias” does not in ordinary language mean anything much different from the way I have used it. It just means lacking in objectivity, partiality.
    
    Like
    
    Reply
    
    Kim
    
    October 18, 2019 at 9:29 pm
    
    I totally disagree. If you want to be taken seriously, then you do need to present the “research” a bit more objectively and professionally and that comes from the first impression. It seems to be preaching to the converted (ie those that already think judges are corrupt) rather than trying to make a convincing argument to those that might not see it that way. Just saying.
    
    Like
6. Kim
  
  October 17, 2019 at 10:02 pm
  
  Also I am wondering why you haven’t revealed your name and credentials. You have suggested that you want to bring accountability to the sport. Remaining anonymous kind of reduces your level of accountability.
  
  Like
  
  Reply
  1. FS Judging Review
    
    October 18, 2019 at 4:40 pm
    
    If I were at all untransparent about my methodology, then you may have a credible complaint. However, I have fully explained every single step I took to get the numbers I presented in this post. At no point have I hidden anything about how I got these numbers or the conclusions, and every single document I used is linked. My identity is irrelevant–anyone who follows my methodology as I have outlined it will get the same numbers, and any attempt to attack my results on the basis of my identity is ad hominem and an attempt to avoid engaging the actual evidence. If you think that a particular conclusion is baseless, please indicate what particular problem you have with either the methodology or the argument. I certainly do not expect anyone to accept my arguments on authority–this is why I have painstakingly presented every step, with accompanying reasoning, on the way from the scores to the conclusions.
    
    Like
    
    Reply
    
    Kim
    
    October 18, 2019 at 9:23 pm
    
    I disagree. I know you will defend it but people do want to know what your involvement in the sport is and your qualifications. Unfortunately my impression is that you do come across as a person with a bit of axe to grind and are not purely objective in what you are trying to achieve which is indicative of your first paragraph. I would assume that you were one of the people that hated anonymous judging. So is there a reason why you can’t share the same accountability?
    
    Like
7. Kim
  
  October 18, 2019 at 1:53 am
  
  Actually I am also wondering why you are just basing this on total score? To get an accurate reflection wouldn’t it be more effective to look at individual GOEs and Component Score against the average GOE or Component Score for each skater? The sum of a score doesn’t necessarily tell the story.
  
  Like
  
  Reply
  1. FS Judging Review
    
    October 18, 2019 at 4:00 pm
    
    It’s not based on total score, but the difference between each judge and the average total score, which is a product of differences in PCS and GOE scoring. This method automatically accounts for the different amount of GOE available per element, versus adding up raw GOEs, and automatically weighs the impact of PCS vs GOE in accordance with how they influence the final score.
    
    Like
    
    Reply
8. FS Peer Review
  
  October 18, 2019 at 10:13 am
  
  Who is the author of this study? Is there a way to contact you?
  
  Like
  
  Reply
  1. FS Judging Review
    
    October 18, 2019 at 3:55 pm
    
    You can message me through this site. I might not respond immediately, but I’ll try to check once a day.
    
    Like
    
    Reply
9. VW
  
  October 18, 2019 at 2:12 pm
  
  The International Skating Union has set up a special commission to evaluate judges (Article 23 ISU Constitution).. This commission in turn consists of judges and evaluates their performance as subjectively as the judges in the competition. What is lacking is a scientifically sound analysis of the scores with comprehensible static methods. Your method seems to be a promising approach. Submit your work to the USFSA for review and ask them to forward it to the ISU. It may be that the ISU is interested in a scientifically based method to analyze and evaluate the judges’ scores.

Share this:

20 thoughts on “Judging Bias and Figure Skating: Part One – Nationalistic Bias”