Sunday, August 17, 2014

More Difficulty = Better Execution Scores?

Let's face it. We are gym fans.

In case you missed it, fan is short for fanatic, and as fanatics, we, gymnastics fans, aren't the most logical people. We tend to select one routine or one example, and we craft an entire theory around that one data point.

Recently, I came across a new theory among gymnastics fanatics. The theory goes something like this: Judges are more lenient with their execution scores when gymnasts perform more difficulty.

In other words, the greater the difficulty, the higher the execution score.

Of course, I had to look into this. So, I took a looky-poo at the scores from the qualifying rounds at the 2013 World Championships.

Why did I choose that particular data set? 'Cause during those early sessions of the World Championships, you have a wide array of D scores. You have the best in the world on an event, and then, you have gymnasts who would be Level 8 gymnasts in the United States.

Let's see what I found…

A quick stats primer for the people who hate numbers

Regression analysis

Regression analysis is used in statistics to determine whether there is a correlation between two variables. You can look at correlations for many things – like whether there is a correlation between the wealth of a country and penis size

In our case, we want to know whether there is a correlation between D and E scores. We're specifically interested in whether a bigger D tends to receive a bigger E.

To check for correlation

Step 1: You plot all your data in a scatter plot, which looks like this:

Step 2: Then, you find the line of line of best fit. Sometimes, the line is linear.

This means that you are growing progressively at the same rate. In the example above, the population is growing at roughly the same rate over time.

You can also have exponential lines of best fit.
These lines curve upward very quickly. That means that you are growing at a faster and faster rate. over time.

In addition, you can have logarithmic lines of best fit.
These lines flatten over time. You're progressing at a slower and slower rate, and eventually, you might hit a "ceiling" where you can't expect to grow anymore.

Okay, so, there are more types of lines, but we won't go over all of them. 3 lines are sufficient for the statistically challenged.

Step 3: Calculate the value of R2.

This is a number between 0 and 1, and it tells you how well your line fits your data set.

To put it differently, R-squared tells you the extent to which the x-variable can be used to predict the value of the y-variable.

In our case,  R2 indicates how well a D-score can predict an E-score. If it's true that gymnasts with high D scores receive high E-scores, the value of R2 should be closer to 1.

When you're starting off in statistics, your teachers tell you that your findings aren't statistically significant unless the value of R2 is greater than 0.5. But if you get farther into stats, you learn that 0.5 isn't the definitive number.

Since I'm trying to keep this simple, let's keep our eyes out for numbers greater than 0.5

MAG: 2013 World Championships - Qualifications

For the stats nerds, I removed outliers by calculating 1.5xIQR.

Men's Floor R= 0.05153

Men's Pommels R= 0.27368

Men's Rings R= 0.164

Men's Vault R= 0.00417

Men's Parallel Bars R= 0.06617

Men's High Bar R= 0.02535

WAG: 2013 World Championships - Qualifications

For the stats nerds, I removed outliers by calculating 1.5xIQR.

Women's Vault R= 0.24592

Women's Uneven Bars R= 0.23437

Women's Beam R= 0.36433

Women's Floor R= 0.24592

A Few Observations

1. For both the men and the women, it is difficult to predict E scores based on D scores. For both MAG and WAG, it should not be taken for granted that a high D score correlates to a high E score.

2. On the men's side, pommel horse is the event where there is the strongest positive correlation between D-scores and E-scores. On the pig, you're more likely to see guys rewarded for their big Ds.

To me, this makes sense. Pommel horse requires a very specific type of swing. If you don't have a good swing, you're more likely to incur more execution deductions, and if you don't have a good swing, you probably will not be able to muster a routine with a lot of difficulty.

3. Compared to the men, the women tend to have a stronger positive correlation between D and E scores. As I said above, neither the men nor the women have extremely high correlations. (Their R2 values correlations aren't over 0.5.) But we do see a slightly stronger correlation on the women's side - across all 4 events.

It's hard to say why that it is. It could be the conspiracy theorists' favorite explanation: biased judging! In other words, the WAG execution judges are biased in favor of big difficulty scores, and they tend to be more lenient on gymnasts who perform more difficult.

Then again, it could be a question of parity. Perhaps the women who perform more difficulty are really better gymnasts. As better gymnasts, they tend to perform better executed routines, and the female gymnasts who perform less difficulty just don't execute as well.

Or it could be a combination of the two?

What do you think? Are the WAG judges biased?

Related Links:


  1. I don't think there is a strong correlation between high D and high E scores. For MAG events, it would be difficult to have high D scores without pretty decent E scores on pommel horse, parallel bars and high bar. The gymnast simply couldn't do some of the skills without proper technique. The only events on which D and E could legitimately be studied separately are rings, floor and vault. You can have really sloppy form and still put a vault to your feet. Indeed, according to your statistics, vault is the place where D and E are least correlated.

  2. You really ought to see an anti-correlation. Gymnasts going for big scores and sacrificing form. Or the converse.

    I think publishing the judge's deductions would do a lot to fix judging. They have a long, long history of cheating. Also, of general impressionism versus following the code.

  3. It seems that gymnasts with higher E-scores win due to the help of the applause-o-meter and not of actual execution. Re: Epke Zonderland who happens to be my favorite, delicious, Dutch crunch of an athlete.

    P.S. I took stats in college twice (once in the math department and once again as a psych class in the last two years) and we learned that a statistically significant R-squared value was over 0.8. How times have changed, eh mate?

    1. You can have a statistically significant R2 of .2. It's a matter of high/medium/low.

  4. It's interesting that the women's lines are mainly exponential for each event, but the men's lines are linear or parabolas except for pommel horse, which looks to be exponential. Obviously, there is more of a correlation in WAG, and I guess this speaks to that. I feel like there isn't enough differentiation in WAG E scores for routines with higher D. MAG seems somewhat better, but there could still be improvement.

    Also, while lower D routines often mean the gymnast has flawed technique that inhibit adding harder elements and performing them well, thus leading to lower E scores even in easier routines, there are gymnasts who have beautiful execution in routines with decent difficulty who don't seem to get the E scores they merit. Noemi Makra is a great example of this. It would be interesting to see the differences in E scores for gymnasts from the Big 4 and non-Big 4 countries with similar D scores. Or maybe the rules for determining E scores need tweaking. Perhaps the deductions that judges are currently allowed to take don't allow for enough differentiation between excellent and good execution or between good and decent.

  5. I would say it's less that execution deductions aren't taken as much as good execution is not rewarded.

    For example, 1.2 in deductions is actually a good amount of deductions. Gymnast A performs with dreadful form and gets an 8.8 E-score with all of the deductions properly taken. Then Gymnast B performs with clean execution and gets a 9.2. It almost seems like judges take off an automatic .5 in E-score to begin with and make up the reasons why after. Gymnast B's deductions are obvious and easy to see, but it's always possible to nitpick a good routine until it's worth a 9.2.

    Not sure if that made sense, but I think the judges are simply more lenient with poorly performed routines, regardless of difficulty level, and too strict with well-executed ones. I would really like to see well-executed routines start scoring above 9.5 in E-score. Kyla Ross bores me, but when she hits her beam sets without a wobble, she really should be scoring about 9.7.

    It's really not incredibly worth it to have good execution when the disparity between superb execution and terrible is so small compared to the disparity that can be created by upping your difficulty.