Thursday, June 25, 2015

More Difficulty = Better Execution Score? - The 2014 AA Edition

Let's face it, gym fans. We're human. We like to find examples and then make sweeping generalizations.

The relationship between difficulty scores and execution scores is no exception. Many gymnastics fans feel that judges are more lenient on difficult routines and reward the big Ds with high execution scores.

I was curious about this theory of ours. So, after the 2013 Worlds, I looked at the data from the qualifying rounds to see whether there was a correlation between difficulty and execution, and generally speaking, there wasn't a strong correlation.

This year, I decided to look at a smaller sample of scores. I focused on what happens when the best-of-the-best compete against each other during the all-around finals. So, I ran the data from the all-around final in Nanning.

And for those who like tl;dr statements, here you go: once again, the results were similar. Typically, there isn't a strong relationship between D-scores and E-scores. In other words, more difficulty does not equal higher execution scores – with a few exceptions.

Let's take a look.


For the stats nerds, here's what you need to know:

  • All scores from the all-around competition have been considered in the initial analyses. You'll see why at the bottom of this post.
  • As you look at these scores, remember that you're looking at the upper echelon of gymnasts. So, there should be less variation in the scores, which will place the value of 'r' closer to 0.
For the non-stats nerds, here's what you are looking for:
  • On individual events, we're looking for a value of 'r' that is greater than or equal to 0.404.
  • What does it mean if the value of 'r' is equal to or greater than 0.404? This means that the relationship between the D and E scores are statistically significant.
  • More specifically, 95 out of 100 times, the D score was correlated with the E score during the all-around finals in Nanning.
  • Those are pretty good odds.
Okay, enough math babble. Let's take a look at some Ds, shall we? Woof.


On vault, r = -0.258. For the stats nerds, r(22) = -0.258, p > 0.05.

On uneven bars, r = 0.057. r(22) = -0.057, p > 0.05.

On beam, r is equal to 0.589. r(22) = 0.589, p < 0.05.

On floor, r is equal to 0.443. r(22) = 0.443, p < 0.05.

When you look at all the scores from the 2014 all-around finals, r is equal to 0.105.


On men's floor, r is equal to 0.333. r(22) = 0.333, p > 0.05.
On men's pommel horse, r is equal to 0.188. r(22) = 0.188, p > 0.05.
On rings, r is equal to 0.028. r(22) = 0.028, p > 0.05.

On vault, r is equal to -0.240. r(22) = -0.240, p > 0.05.

On parallel bars, r is equal to 0.268. r(22) = 0.268, p > 0.05.

On high bar, r is equal to 0.020. r(22) = 0.020, p > 0.05.


1. Vault

During the all-around finals in Nanning, the vault scores tend to trend downwards. In other words, the more difficult vault, the lower the execution score.

This tends to contradict popular wisdom among gymnastics fans, which says that an Amanar will get a better E-score than a full-twisting Yurchenko. However, on both the men's and women's side, the correlation between D- and E-scores isn't all that strong. So, I wouldn't be too quick to jump to conclusions.

2. MAG

The numbers for the men are kind of boring. We didn't really prove any judging controversy. The value of 'r' didn't hit the 0.404 mark on any event. So, the strength of the correlation on the men's side is fairly weak.

On first glance, it's tempting to chalk this up as a win for the men. The MAG judges are far superior to WAG judges. The E-score judges don't let big Ds sway their opinions.

But before we jump to conclusions, we should note that there are other factors that we should consider. For example, on the men's side, there tends to be more parity in the all-around final. So, there's less variation in scores. Less variation in scores tends to lower value of 'r', and when you look at the amount of variation in the men's scores, it's pretty consistent across all events – except for the execution score on rings.

EventStandard Deviation - D ScoreStandard Deviation - E Score
High Bar0.5630.432

So, maybe the judges are better? Maybe not? I'll leave that up to you to decide.

3. WAG

For the women, we exceeded the magical number of 0.404 on two occasions: beam and floor. So, it seems like there tends to be a stronger connection between your D score and your E score on those events.

So, it's tempting to immediately point fingers at the floor and beam judges. Those lousy judges don't know how to judge!

But we shouldn't be so hasty. The judges don't exist in a vacuum. There could be other factors at play, like parity. Generally speaking, it is believed that there are fewer top contenders for the all-around title on the women's side, and when you look at the charts above, you can see the lack of parity, especially on beam and floor.

Lisa Hill competed a beam routine with a 4.0 difficulty score, which received a lower execution score of 7.366. Laura Waem competed a floor routine with a 4.4 difficulty score, which received a 6.233 in execution. When you look at the charts, both of those scores are obvious outliers.

So, what happens when you move those scores from the data sets? Let's take a look.

When you remove Laura Waem's floor score, the value of 'r' drops to 0.147, and the correlation is weakened a lot.

r is equal to 0.147.

So, it appears that Laura's score was really affecting our analysis. Maybe the floor judges aren't so bad after all.

However, when you remove Lisa Hill's beam score, the value of 'r' is still pretty high. It's at 0.551…

… which raises some suspicions. And now compare the beam example to what happens when we remove Shang Chunsong's bar routine with a 5.600 E-score from the data set. As we saw with floor, we still don't hit the 0.404 mark.


Dear Nellie Kim,

What's up with beam?

If I were you, I'd be watching my beam judges during the all-around final very carefully.

Uncle Tim

P.S. I'm curious: Does anyone know if the FIG has a team of data scientists looking at these numbers? I want to know what their conclusions are.


  1. The value of r (I assume that you actually use r*r) indicates the strength of a possible correlation. To know whether the correlation is significant you also need to look at the p-value, which for the analysis you considered should be smaller that 0.05. In other words without reporting the p-value the results and conclusion are meaningless.

    1. Hi there! Nope, not looking at r2 – just Pearson (R).

      To appease the stat nerds like you, I've added the proper statistical notation for you.

      I assure you that I did look at p-values to determine if my findings were statistically significant. That's where I got the magical number of 0.404 from. :)

    2. I never realized how much I learned in college statics until I read this... and understood it. 🙌

    3. I never realized how much I learned in college statics until I read this... and understood it. 🙌

  2. Thanks for the stats. It is helpful to see the D scores and E scores compared across disciplines. I am not surprised that beam has the highest correlation. After all, the margin for error is so small on beam that a gymnast is only going to do skills she can do well. Basically, the more technical the apparatus, the higher the correlation should be.

    1. I thought about that theory, but wouldn't bars be the most technical of the WAG events?

  3. Is it possible that we see the beam correlation because it's the apparatus where an athlete's performance can affect their D score most significantly along with their E score? If an athlete is having a shaky routine then not only will they get E score deductions for balance checks but also will lose tenths on the D scores for missed connections?

    1. I think this may well be right, given that the effect of outliers who recieved very low D and E scores is to increase the correlation. Basically it looks like it is because you lower your E score significantly, along with your D score, if you fall off or nearly fall off. That's way more common with beam than any other apparatus.

  4. The problem with this type of analysis is that you can only look at the score the gymnast actually recieved and not the score they would have recieved had they been marked by some sort of perfect system. So you can say that there isn't a significant correlation between execution scores and difficulty, but it is possible that there really should be a negative correlation (because execution tends to go down as difficulty increases) and not seeing a correlation means judges are being more lenient towards routines with higher difficulty.

    I have certainly seen that happen once at a trampoline competition(Only BUSA, now BUCS, not a ranking comp or anything.) The top three performers all got very similar execution scores, but the least difficult was the most beautifully executed I've ever seen and totally wowed that hall, while the two more difficult routines were pretty soft really, by elite standards, and seriously overmarked. Those judges could also have been influenced by the rankings of the performers. They were all FIGA but the two with the high DD were ranked 1 and 2 in the UK respectively and competed internationally.

    Of course, that is puely annecdotal, and doesn't necessarily imply the same would happen more broadly.