Comparing Variability in Ultra Trail Running Results Across Countries

using coefficient of variation to compare variation in race results

By Brendan Graham in tidy tuesday

January 2, 2022

This post looks at a past TidyTuesday data set about ultra trail running. After looking at the data I attempt to quantify and compare the variation in rankings between countries.

Explore the Data

Race data is at the race level; events are made up of races. There can be multiple event observations per year.

race

## # A tibble: 1,207 × 14
##    race_year_id event   race   city  country date       start_time participation
##           <dbl> <chr>   <chr>  <chr> <chr>   <date>     <time>     <chr>        
##  1        68140 Peak D… Mills… Cast… United… 2021-09-03 19:00      solo         
##  2        72496 UTMB®   UTMB®  Cham… France  2021-08-27 17:00      Solo         
##  3        69855 Grand … Ultra… viel… France  2021-08-20 05:00      solo         
##  4        67856 Persen… PERSE… Asen… Bulgar… 2021-08-20 18:00      solo         
##  5        70469 Runfir… 100 M… uluk… Turkey  2021-08-20 18:00      solo         
##  6        66887 Swiss … 160KM  Müns… Switze… 2021-08-15 17:00      solo         
##  7        67851 Salomo… Salom… Foll… Norway  2021-08-14 07:00      solo         
##  8        68241 Ultra … 160KM  Spa   Belgium 2021-08-14 07:00      solo         
##  9        70241 Québec… QMT-1… Beau… Canada  2021-08-13 22:00      solo         
## 10        69945 Bunket… BBUT … LIND… Sweden  2021-08-07 10:00      solo         
## # … with 1,197 more rows, and 6 more variables: distance <dbl>,
## #   elevation_gain <dbl>, elevation_loss <dbl>, aid_stations <dbl>,
## #   participants <dbl>, year <dbl>

Example data for Run Rabbit Run event:

race %>% 
  filter(event == "RUN RABBIT RUN") %>%
  select(event, race, year) %>% 
  arrange(year) %>%
  add_table()

Some races can have thousands of participants:

race %>% 
  select(event, race, participants) %>%
  distinct() %>%
  arrange(desc(participants)) %>%
  head(10) %>%
  add_table()

Ranking data is at the racer level; racers can appear more than once:

ultra_rankings

## # A tibble: 137,803 × 8
##    race_year_id  rank runner     time     age gender nationality time_in_seconds
##           <dbl> <dbl> <chr>      <chr>  <dbl> <chr>  <chr>                 <dbl>
##  1        68140     1 VERHEUL J… 26H 3…    30 M      GBR                   95725
##  2        68140     2 MOULDING … 27H 0…    43 M      GBR                   97229
##  3        68140     3 RICHARDSO… 28H 4…    38 M      GBR                  103747
##  4        68140     4 DYSON Fio… 30H 5…    55 W      GBR                  111217
##  5        68140     5 FRONTERAS… 32H 4…    48 W      GBR                  117981
##  6        68140     6 THOMAS Le… 32H 4…    31 M      GBR                  118000
##  7        68140     7 SHORT Deb… 33H 3…    55 W      GBR                  120601
##  8        68140     8 CROSSLEY … 33H 3…    40 W      GBR                  120803
##  9        68140     9 BUTCHER K… 34H 5…    47 M      GBR                  125656
## 10        68140    10 Hendry Bi… 34H 5…    29 M      GBR                  125979
## # … with 137,793 more rows

ultra_rankings %>%
  group_by(runner, nationality) %>% 
  tally(sort = T) %>%
  head(10) %>%
  add_table()

Variation in Ranking Among Countries

Rankings seem interesting, let’s try and see which countries have runners with the most consistent rankings. First we need to prep the data a little bit. We’ll set a threshold to only include countries where runners from that given country participated in at least 15 races.

country_counts <- 
  ultra_rankings %>%
  select(nationality, race_year_id) %>%
  distinct() %>%
  group_by(nationality) %>%
  tally()

runner_count <-
  ultra_rankings %>%
  select(nationality, runner) %>%
  distinct() %>%
  group_by(nationality) %>%
  tally()

quantile(country_counts$n, probs = seq(0, 1, .10))

##    0%   10%   20%   30%   40%   50%   60%   70%   80%   90%  100% 
##   1.0   1.0   2.0   3.3   8.0  15.5  31.0  41.7  82.4 128.7 700.0

quantile(runner_count$n, probs = seq(0, 1, .10))

##      0%     10%     20%     30%     40%     50%     60%     70%     80%     90% 
##     1.0     1.0     1.0     3.0     6.0    16.0    49.4   143.1   261.6   873.0 
##    100% 
## 20345.0

top_countries <- 
  ultra_rankings %>%
  select(nationality, race_year_id) %>%
  distinct() %>%
  group_by(nationality) %>%
  tally() %>%
  filter(n > 15) %>% 
  pull(nationality)

top_countries <- 
  ultra_rankings %>%
  filter(nationality %in% top_countries) %>%
  na.omit() %>%
  select(nationality, runner) %>%
  distinct() %>%
  group_by(nationality) %>%
  tally() %>% 
  filter(n > 15) %>% 
  pull(nationality)

ultra_rankings %>%
  filter(nationality %in% top_countries) %>%
  filter(!is.na(rank)) %>%
  ggplot(aes(rank)) + 
  geom_histogram() + 
  ggtitle("race rankings") + 
  big_labels

Ranking is highly skewed, so we can use a modified formula for the coefficient of variation meant for log-normal data: $$ {cv_{raw}} = \sqrt{e^{s^2_{ln}} - 1} $$

where $s_{ln}$ is the sample standard deviation of the data after a natural log transformation. Another alternative we could use the Coefficient of Quartile Variation (see below), but we’ll stick with the modified cv instead.

Coefficient of Quartile Variation

$$ QCV = [(q3 - q1)/(q3 + q1))]*100 $$

After calculating the CV for log-normal data, we can compare each countries variation. France, Luxembourg and Korea are interesting in that they are relatively consistent, but their median ranks are very high. This suggests racers from these countries get consistently poor results

cv_log <- 
  ultra_rankings %>%
  filter(nationality %in% top_countries) %>%
  filter(!is.na(rank), !(is.na(gender))) %>%
  mutate(log_rank = log(rank)) %>% 
  group_by(nationality) %>% 
  summarise(
    mean_rank = mean(rank, na.rm = T),
    median_rank = median(rank, na.rm = T),
    q3 = quantile(x = rank, probs = .75, na.rm = T),
    q1 = quantile(x = rank, probs = .25, na.rm = T),
    sd_rank = sd(log_rank, na.rm = T),
    cv_log = sqrt((exp(1)^(sd_rank^2)) - 1),
    qcv = ((q3 - q1)/(q3 + q1))*100
  )

cv_log %>%
  ggplot(., aes(x = reorder(nationality, cv_log), y = cv_log, label = round(mean_rank, 2))) + 
  geom_point(size = 3.5, alpha = .85, aes(color = mean_rank)) + 
  scale_color_distiller(direction = -1, palette = "BuPu") +
  coord_flip() +
  labs(y = "Coefficient of Variation", x = "", color = "Mean Rank",
       title = "Coefficient of Variation in Race Rankings") + 
  big_labels

Let’s see if this CV aligns with the distributions of the top 2 and bottom 2 ranked countries. We can compare the ranking distributions of THA and USA with that of EDU and EST. The plot shows that ranking distribution for THA and USA are not as skewed as ECU and EST

ultra_rankings %>%
  filter(nationality %in% top_countries) %>%
  filter(!is.na(rank)) %>%
  filter(nationality %in% c("USA", "THA", "EST", "ECU")) %>% 
  ggplot(aes(x = rank, fill = nationality)) + 
  geom_histogram(position = "stack", show.legend = FALSE, alpha = .55) + 
  facet_wrap(vars(nationality), scales = 'free') +
  big_labels + 
  scale_fill_brewer(palette = "Dark2")

Plotting each country’s mean rank vs CV shows there is slight correlation; countries whose runners are more consistent tend to have better ranking on average.

cv_log %>%
  ggplot(., aes(x = cv_log, y = mean_rank, label = nationality, 2)) + 
  geom_text_repel() + 
  geom_point() + 
  geom_smooth(method = "gam", se = F) +
  labs(x = "Coefficient of Variation", y = "Mean Rank")  + 
  scale_y_continuous(breaks = seq(0, 800, 25)) +
  scale_x_continuous(limits = c(1.5, 10), breaks = seq(0, 10, 1)) + 
  big_labels

Posted on:: January 2, 2022

Length:: 6 minute read, 1149 words

Categories:: tidy tuesday

Tags:: tidy tuesday

See Also:: Classifying Bigfoot Encounters; Creating a Hex Bin Map to Show Changes Pell Grants; Getting started with topic modeling with dog breed traits