The relationship between difficulty scores and execution scores is no exception. Many gymnastics fans feel that judges are more lenient on difficult routines and reward the big Ds with high execution scores.
I was curious about this theory of ours. So, after the 2013 Worlds, I looked at the data from the qualifying rounds to see whether there was a correlation between difficulty and execution, and generally speaking, there wasn't a strong correlation.
This year, I decided to look at a smaller sample of scores. I focused on what happens when the best-of-the-best compete against each other during the all-around finals. So, I ran the data from the all-around final in Nanning.
And for those who like tl;dr statements, here you go: once again, the results were similar. Typically, there isn't a strong relationship between D-scores and E-scores. In other words, more difficulty does not equal higher execution scores – with a few exceptions.
Let's take a look.
For the stats nerds, here's what you need to know:
- All scores from the all-around competition have been considered in the initial analyses. You'll see why at the bottom of this post.
- As you look at these scores, remember that you're looking at the upper echelon of gymnasts. So, there should be less variation in the scores, which will place the value of 'r' closer to 0.
For the non-stats nerds, here's what you are looking for:
- On individual events, we're looking for a value of 'r' that is greater than or equal to 0.404.
- What does it mean if the value of 'r' is equal to or greater than 0.404? This means that the relationship between the D and E scores are statistically significant.
- More specifically, 95 out of 100 times, the D score was correlated with the E score during the all-around finals in Nanning.
- Those are pretty good odds.
Okay, enough math babble. Let's take a look at some Ds, shall we? Woof.
On vault, r = -0.258. For the stats nerds, r(22) = -0.258, p > 0.05.
On uneven bars, r = 0.057. r(22) = -0.057, p > 0.05.
On beam, r is equal to 0.589. r(22) = 0.589, p < 0.05.
On floor, r is equal to 0.443. r(22) = 0.443, p < 0.05.
When you look at all the scores from the 2014 all-around finals, r is equal to 0.105.
On men's floor, r is equal to 0.333. r(22) = 0.333, p > 0.05.
On men's pommel horse, r is equal to 0.188. r(22) = 0.188, p > 0.05.
On rings, r is equal to 0.028. r(22) = 0.028, p > 0.05.
On vault, r is equal to -0.240. r(22) = -0.240, p > 0.05.
On parallel bars, r is equal to 0.268. r(22) = 0.268, p > 0.05.
On high bar, r is equal to 0.020. r(22) = 0.020, p > 0.05.
During the all-around finals in Nanning, the vault scores tend to trend downwards. In other words, the more difficult vault, the lower the execution score.
This tends to contradict popular wisdom among gymnastics fans, which says that an Amanar will get a better E-score than a full-twisting Yurchenko. However, on both the men's and women's side, the correlation between D- and E-scores isn't all that strong. So, I wouldn't be too quick to jump to conclusions.
The numbers for the men are kind of boring. We didn't really prove any judging controversy. The value of 'r' didn't hit the 0.404 mark on any event. So, the strength of the correlation on the men's side is fairly weak.
On first glance, it's tempting to chalk this up as a win for the men. The MAG judges are far superior to WAG judges. The E-score judges don't let big Ds sway their opinions.
But before we jump to conclusions, we should note that there are other factors that we should consider. For example, on the men's side, there tends to be more parity in the all-around final. So, there's less variation in scores. Less variation in scores tends to lower value of 'r', and when you look at the amount of variation in the men's scores, it's pretty consistent across all events – except for the execution score on rings.
|Event||Standard Deviation - D Score||Standard Deviation - E Score|
So, maybe the judges are better? Maybe not? I'll leave that up to you to decide.
For the women, we exceeded the magical number of 0.404 on two occasions: beam and floor. So, it seems like there tends to be a stronger connection between your D score and your E score on those events.
So, it's tempting to immediately point fingers at the floor and beam judges. Those lousy judges don't know how to judge!
But we shouldn't be so hasty. The judges don't exist in a vacuum. There could be other factors at play, like parity. Generally speaking, it is believed that there are fewer top contenders for the all-around title on the women's side, and when you look at the charts above, you can see the lack of parity, especially on beam and floor.
Lisa Hill competed a beam routine with a 4.0 difficulty score, which received a lower execution score of 7.366. Laura Waem competed a floor routine with a 4.4 difficulty score, which received a 6.233 in execution. When you look at the charts, both of those scores are obvious outliers.
So, what happens when you move those scores from the data sets? Let's take a look.
When you remove Laura Waem's floor score, the value of 'r' drops to 0.147, and the correlation is weakened a lot.
r is equal to 0.147.
So, it appears that Laura's score was really affecting our analysis. Maybe the floor judges aren't so bad after all.
However, when you remove Lisa Hill's beam score, the value of 'r' is still pretty high. It's at 0.551…
… which raises some suspicions. And now compare the beam example to what happens when we remove Shang Chunsong's bar routine with a 5.600 E-score from the data set. As we saw with floor, we still don't hit the 0.404 mark.
Dear Nellie Kim,
What's up with beam?
If I were you, I'd be watching my beam judges during the all-around final very carefully.
P.S. I'm curious: Does anyone know if the FIG has a team of data scientists looking at these numbers? I want to know what their conclusions are.