# Introduction

In previous posts (Fairbanks Race Predictor, Equinox from Santa Claus, Equinox from Gold Discovery) I’ve looked at predicting Equinox Marathon results based on results from earlier races. In all those cases I’ve looked at single race comparisons: how results from Gold Discovery can predict Marathon times, for example. In this post I’ll look at all the Usibelli Series races I completed this year to see how they can inform my expectations for next Saturday’s Equinox Marathon.

# Methods

I’ve been collecting the results from all Usibelli Series races since 2010. Using that data, grouped by the name of the person racing and year, find all runners that completed the same set of Usibelli Series races that I finished in 2018, as well as their Equinox Marathon finish pace. Between 2010 and 2017 there are 160 records that match.

The data looks like this. `crr` is that person’s Chena River Run pace in
minutes, `msr` is Midnight Sun Run pace for the same person and year, `rotv`
is the pace from Run of the Valkyries, `gdr` is the Gold Discovery Run, and
`em` is Equniox Marathon pace for that same person and year.

crr | msr | rotv | gdr | em |
---|---|---|---|---|

8.1559 | 8.8817 | 8.1833 | 10.2848 | 11.8683 |

8.7210 | 9.1387 | 9.2120 | 11.0152 | 13.6796 |

8.7946 | 9.0640 | 9.0077 | 11.3565 | 13.1755 |

9.4409 | 10.6091 | 9.6250 | 11.2080 | 13.1719 |

7.3581 | 7.1836 | 7.1310 | 8.0001 | 9.6565 |

7.4731 | 7.5349 | 7.4700 | 8.2465 | 9.8359 |

... | ... | ... | ... | ... |

I will use two methods for using these records to predict Equinox Marathon times, multivariate linear regression and Random Forest.

The R code for the analysis appears at the end of this post.

# Results

## Linear regression

We start with linear regression, which isn’t entirely appropriate for this analysis because the independent variables (pre-Equinox race pace times) aren’t really independent of one another. A person who runs a 6 minute pace in the Chena River Run is likely to also be someone who runs Gold Discovery faster than the average runner. This relationship, in fact, is the basis for this analysis.

I started with a model that includes all the races I completed in 2018, but pace time for the Midnight Sun Run wasn’t statistically significant so I removed it from the final model, which included Chena River Run, Run of the Valkyries, and Gold Discovery.

This model is significant, as are all the coefficients except the intercept, and the model explains nearly 80% of the variation in the data:

## ## Call: ## lm(formula = em ~ crr + gdr + rotv, data = input_pivot) ## ## Residuals: ## Min 1Q Median 3Q Max ## -3.8837 -0.6534 -0.2265 0.3549 5.8273 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.6217 0.5692 1.092 0.276420 ## crr -0.3723 0.1346 -2.765 0.006380 ** ## gdr 0.8422 0.1169 7.206 2.32e-11 *** ## rotv 0.7607 0.2119 3.591 0.000442 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 1.278 on 156 degrees of freedom ## Multiple R-squared: 0.786, Adjusted R-squared: 0.7819 ## F-statistic: 191 on 3 and 156 DF, p-value: < 2.2e-16

Using this model and my 2018 results, my overall pace and finish times for Equinox are predicted to be 10:45 and 4:41:50. The 95% confidence intervals for these predictions are 10:30–11:01 and 4:35:11–4:48:28.

## Random Forest

Random Forest is another regression method but it doesn’t require independent variables be independent of one another. Here are the results of building 5,000 random trees from the data:

## ## Call: ## randomForest(formula = em ~ ., data = input_pivot, ntree = 5000) ## Type of random forest: regression ## Number of trees: 5000 ## No. of variables tried at each split: 1 ## ## Mean of squared residuals: 1.87325 ## % Var explained: 74.82 ## IncNodePurity ## crr 260.8279 ## gdr 321.3691 ## msr 268.0936 ## rotv 295.4250

This model, which includes all race results explains just under 74% of the variation in the data. And you can see from the importance result that Gold Discovery results factor more heavily in the result than earlier races in the season like Chena River Run and the Midnight Sun Run.

Using this model, my predicted pace is 10:13 and my finish time is 4:27:46. The 95% confidence intervals are 9:23–11:40 and 4:05:58–5:05:34. You’ll notice that the confidence intervals are wider than with linear regression, probably because there are fewer assumptions with Random Forest and less power.

# Conclusion

My number one goal for this year’s Equinox Marathon is simply to finish without injuring myself, something I wasn’t able to do the last time I ran the whole race in 2013. I finished in 4:49:28 with an overall pace of 11:02, but the race or my training for it resulted in a torn hip labrum.

If I’m able to finish uninjured, I’d like to beat my time from 2013. These results suggest I should have no problem acheiving my second goal and perhaps knowing how much faster these predictions are from my 2013 times, I can race conservatively and still get a personal best time.

# Appendix - R code

```
library(tidyverse)
library(RPostgres)
library(lubridate)
library(glue)
library(randomForest)
library(knitr)
races <- dbConnect(Postgres(),
host = "localhost",
dbname = "races")
all_races <- races %>%
tbl("all_races")
usibelli_races <- tibble(race = c("Chena River Run",
"Midnight Sun Run",
"Jim Loftus Mile",
"Run of the Valkyries",
"Gold Discovery Run",
"Santa Claus Half Marathon",
"Golden Heart Trail Run",
"Equinox Marathon"))
css_2018 <- all_races %>%
inner_join(usibelli_races, copy = TRUE) %>%
filter(year == 2018,
name == "Christopher Swingley") %>%
collect()
candidate_races <- css_2018 %>%
select(race) %>%
bind_rows(tibble(race = c("Equinox Marathon")))
input_data <- all_races %>%
inner_join(candidate_races, copy = TRUE) %>%
filter(!is.na(gender), !is.na(birth_year)) %>%
collect()
input_pivot <- input_data %>%
group_by(race, name, year) %>%
mutate(n = n()) %>%
filter(n == 1) %>%
ungroup() %>%
select(name, year, race, pace_min) %>%
spread(race, pace_min) %>%
rename(crr = `Chena River Run`,
msr = `Midnight Sun Run`,
rotv = `Run of the Valkyries`,
gdr = `Gold Discovery Run`,
em = `Equinox Marathon`) %>%
filter(!is.na(crr), !is.na(msr), !is.na(rotv),
!is.na(gdr), !is.na(em)) %>%
select(-c(name, year))
kable(input_pivot %>% head)
css_2018_pivot <- css_2018 %>%
select(name, year, race, pace_min) %>%
spread(race, pace_min) %>%
rename(crr = `Chena River Run`,
msr = `Midnight Sun Run`,
rotv = `Run of the Valkyries`,
gdr = `Gold Discovery Run`) %>%
select(-c(name, year))
pace <- function(minutes) {
mm = floor(minutes)
seconds = (minutes - mm) * 60
glue('{mm}:{sprintf("%02.0f", seconds)}')
}
finish_time <- function(minutes) {
hh = floor(minutes / 60.0)
min = minutes - (hh * 60)
mm = floor(min)
seconds = (min - mm) * 60
glue('{hh}:{sprintf("%02d", mm)}:{sprintf("%02.0f", seconds)}')
}
lm_model <- lm(em ~ crr + gdr + rotv,
data = input_pivot)
summary(lm_model)
prediction <- predict(lm_model, css_2018_pivot,
interval = "confidence", level = 0.95)
prediction
rf <- randomForest(em ~ .,
data = input_pivot,
ntree = 5000)
rf
importance(rf)
rfp_all <- predict(rf, css_2018_pivot, predict.all = TRUE)
rfp_all$aggregate
rf_ci <- quantile(rfp_all$individual, c(0.025, 0.975))
rf_ci
```

# Introduction

A couple years ago I compared racing data between two races (Gold Discovery and Equinox, Santa Claus and Equinox) in the same season for all runners that ran in both events. The result was an estimate of how fast I might run the Equinox Marathon based on my times for Gold Discovery and the Santa Claus Half Marathon.

Several years have passed and I've run more races and collected more racing data for all the major Fairbanks races and wanted to run the same analysis for all combinations of races.

# Data

The data comes from a database I’ve built of race times for all competitors, mostly coming from the results available from Chronotrack, but including some race results from SportAlaska.

We started by loading the required R packages and reading in all the racing data, a small subset of which looks like this.

race | year | name | finish_time | birth_year | sex |
---|---|---|---|---|---|

Beat Beethoven | 2015 | thomas mcclelland | 00:21:49 | 1995 | M |

Equinox Marathon | 2015 | jennifer paniati | 06:24:14 | 1989 | F |

Equinox Marathon | 2014 | kris starkey | 06:35:55 | 1972 | F |

Midnight Sun Run | 2014 | kathy toohey | 01:10:42 | 1960 | F |

Midnight Sun Run | 2016 | steven rast | 01:59:41 | 1960 | M |

Equinox Marathon | 2013 | elizabeth smith | 09:18:53 | 1987 | F |

... | ... | ... | ... | ... | ... |

Next we loaded in the names and distances of the races and combined this with the individual racing data. The data from Chronotrack doesn’t include the mileage and we will need that to calculate pace (minutes per mile).

My database doesn’t have complete information about all the racers that competed, and in some cases the information for a runner in one race conflicts with the information for the same runner in a different race. In order to resolve this, we generated a list of runners, grouped by their name, and threw out racers where their name matches but their gender was reported differently from one race to the next. Please understand we’re not doing this to exclude those who have changed their gender identity along the way, but to eliminate possible bias from data entry mistakes.

Finally, we combined the racers with the individual racing data, substituting
our corrected runner information for what appeared in the individual race’s
data. We also calculated minutes per mile (`pace`) and the age of the runner
during the year of the race (`age`). Because we’re assigning a birth year to
the minimum reported year from all races, our age variable won’t change during
the running season, which is closer to the way age categories are calculated in
Europe. Finally, we removed results where pace was greater than 20 minutes per
mile for races longer than ten miles, and greater than 16 minute miles for races
less than ten miles. These are likely to be outliers, or competitors not
running the race.

name | birth_year | gender | race_str | year | miles | minutes | pace | age |
---|---|---|---|---|---|---|---|---|

aaron austin | 1983 | M | midnight_sun_run | 2014 | 6.2 | 50.60 | 8.16 | 31 |

aaron bravo | 1999 | M | midnight_sun_run | 2013 | 6.2 | 45.26 | 7.30 | 14 |

aaron bravo | 1999 | M | midnight_sun_run | 2014 | 6.2 | 40.08 | 6.46 | 15 |

aaron bravo | 1999 | M | midnight_sun_run | 2015 | 6.2 | 36.65 | 5.91 | 16 |

aaron bravo | 1999 | M | midnight_sun_run | 2016 | 6.2 | 36.31 | 5.85 | 17 |

aaron bravo | 1999 | M | spruce_tree_classic | 2014 | 6.0 | 42.17 | 7.03 | 15 |

... | ... | ... | ... | ... | ... | ... | ... | ... |

We combined all available results for each runner in all years they participated such that the resulting rows are grouped by runner and year and columns are the races themselves. The values in each cell represent the pace for the runner × year × race combination.

For example, here’s the first six rows for runners that completed Beat Beethoven and the Chena River Run in the years I have data. I also included the column for the Midnight Sun Run in the table, but the actual data has a column for all the major Fairbanks races. You’ll see that two of the six runners listed ran BB and CRR but didn’t run MSR in that year.

name | gender | age | year | beat_beethoven | chena_river_run | midnight_sun_run |
---|---|---|---|---|---|---|

aaron schooley | M | 36 | 2016 | 8.19 | 8.15 | 8.88 |

abby fett | F | 33 | 2014 | 10.68 | 10.34 | 11.59 |

abby fett | F | 35 | 2016 | 11.97 | 12.58 | NA |

abigail haas | F | 11 | 2015 | 9.34 | 8.29 | NA |

abigail haas | F | 12 | 2016 | 8.48 | 7.90 | 11.40 |

aimee hughes | F | 43 | 2015 | 11.32 | 9.50 | 10.69 |

... | ... | ... | ... | ... | ... | ... |

With this data, we build a whole series of linear models, one for each race
combination. We created a series of formula strings and objects for all the
combinations, then executed them using `map()`. We combined the start and
predicted race names with the linear models, and used `glance()` and
`tidy()` from the `broom` package to turn the models into statistics and
coefficients.

All of the models between races were highly significant, but many of them contain coefficients that aren’t significantly different than zero. That means that including that term (age, gender or first race pace) isn’t adding anything useful to the model. We used the significance of each term to reduce our models so they only contained coefficients that were significant and regenerated the statistics and coefficients for these reduced models.

The full R code appears at the bottom of this post.

# Results

Here’s the statistics from the ten best performing models (based on *R²* ).

start_race | predicted_race | n | R² |
p-value |
---|---|---|---|---|

run_of_the_valkyries | golden_heart_trail_run | 40 | 0.956 | 0 |

golden_heart_trail_run | equinox_marathon | 36 | 0.908 | 0 |

santa_claus_half_marathon | golden_heart_trail_run | 34 | 0.896 | 0 |

midnight_sun_run | gold_discovery_run | 139 | 0.887 | 0 |

beat_beethoven | golden_heart_trail_run | 32 | 0.886 | 0 |

run_of_the_valkyries | gold_discovery_run | 44 | 0.877 | 0 |

midnight_sun_run | golden_heart_trail_run | 52 | 0.877 | 0 |

gold_discovery_run | santa_claus_half_marathon | 111 | 0.876 | 0 |

chena_river_run | golden_heart_trail_run | 44 | 0.873 | 0 |

run_of_the_valkyries | santa_claus_half_marathon | 91 | 0.851 | 0 |

It’s interesting how many times the Golden Heart Trail Run appears on this list since that run is something of an outlier in the Usibelli running series because it’s the only race entirely on trails. Maybe it’s because it’s distance (5K) is comparable with a lot of the earlier races in the season, but because it’s on trails it matches well with the later races that are at least partially on trails like Gold Discovery or Equinox.

Here are the ten worst models.

start_race | predicted_race | n | R² |
p-value |
---|---|---|---|---|

midnight_sun_run | equinox_marathon | 431 | 0.525 | 0 |

beat_beethoven | hoodoo_half_marathon | 87 | 0.533 | 0 |

beat_beethoven | midnight_sun_run | 818 | 0.570 | 0 |

chena_river_run | equinox_marathon | 196 | 0.572 | 0 |

equinox_marathon | hoodoo_half_marathon | 90 | 0.584 | 0 |

beat_beethoven | equinox_marathon | 265 | 0.585 | 0 |

gold_discovery_run | hoodoo_half_marathon | 41 | 0.599 | 0 |

beat_beethoven | santa_claus_half_marathon | 163 | 0.612 | 0 |

run_of_the_valkyries | equinox_marathon | 125 | 0.642 | 0 |

midnight_sun_run | hoodoo_half_marathon | 118 | 0.657 | 0 |

Most of these models are shorter races like Beat Beethoven or the Chena River Run predicting longer races like Equinox or one of the half marathons. Even so, each model explains more than half the variation in the data, which isn’t terrible.

# Application

Now that we have all our models and their coefficients, we used these models to make predictions of future performance. I’ve written an online calculator based on the reduced models that let you predict your race results as you go through the running season. The calculator is here: Fairbanks Running Race Converter.

For example, I ran a 7:41 pace for Run of the Valkyries this year. Entering
that, plus my age and gender into the converter predicts an 8:57 pace for the
first running of the HooDoo Half Marathon. The *R²* for this model was a
respectable 0.71 even though only 23 runners ran both races this year (including
me). My actual pace for HooDoo was 8:18, so I came in quite a bit faster than
this. No wonder my knee and hip hurt after the race! Using my time from the
Golden Heart Trail Run, the converter predicts a HooDoo Half pace of 8:16.2,
less than a minute off my 1:48:11 finish.

# Appendix: R code

```
library(tidyverse)
library(lubridate)
library(broom)
races_db <- src_postgres(host="localhost", dbname="races")
combined_races <- tbl(races_db, build_sql(
"SELECT race, year, lower(name) AS name, finish_time,
year - age AS birth_year, sex
FROM chronotrack
UNION
SELECT race, year, lower(name) AS name, finish_time,
birth_year,
CASE WHEN age_class ~ 'M' THEN 'M' ELSE 'F' END AS sex
FROM sportalaska
UNION
SELECT race, year, lower(name) AS name, finish_time,
NULL AS birth_year, NULL AS sex
FROM other"))
races <- tbl(races_db, build_sql(
"SELECT race,
lower(regexp_replace(race, '[ ’]', '_', 'g')) AS race_str,
date_part('year', date) AS year,
miles
FROM races"))
racing_data <- combined_races %>%
inner_join(races) %>%
filter(!is.na(finish_time))
racers <- racing_data %>%
group_by(name) %>%
summarize(races=n(),
birth_year=min(birth_year),
gender_filter=ifelse(sum(ifelse(sex=='M',1,0))==
sum(ifelse(sex=='F',1,0)),
FALSE, TRUE),
gender=ifelse(sum(ifelse(sex=='M',1,0))>
sum(ifelse(sex=='F',1,0)),
'M', 'F')) %>%
ungroup() %>%
filter(gender_filter) %>%
select(-gender_filter)
racing_data_filled <- racing_data %>%
inner_join(racers, by="name") %>%
mutate(birth_year=birth_year.y) %>%
select(name, birth_year, gender, race_str, year, miles, finish_time) %>%
group_by(name, race_str, year) %>%
mutate(n=n()) %>%
filter(!is.na(birth_year), n==1) %>%
ungroup() %>%
collect() %>%
mutate(fixed=ifelse(grepl('[0-9]+:[0-9]+:[0-9.]+', finish_time),
finish_time,
paste0('00:', finish_time)),
minutes=as.numeric(seconds(hms(fixed)))/60.0,
pace=minutes/miles,
age=year-birth_year,
age_class=as.integer(age/10)*10,
group=paste0(gender, age_class),
gender=as.factor(gender)) %>%
filter((miles<10 & pace<16) | (miles>=10 & pace<20)) %>%
select(-fixed, -finish_time, -n)
speeds_combined <- racing_data_filled %>%
select(name, gender, age, age_class, group, race_str, year, pace) %>%
spread(race_str, pace)
main_races <- c('beat_beethoven', 'chena_river_run', 'midnight_sun_run',
'run_of_the_valkyries', 'gold_discovery_run',
'santa_claus_half_marathon', 'golden_heart_trail_run',
'equinox_marathon', 'hoodoo_half_marathon')
race_formula_str <-
lapply(seq(1, length(main_races)-1),
function(i)
lapply(seq(i+1, length(main_races)),
function(j) paste(main_races[[j]], '~',
main_races[[i]],
'+ gender', '+ age'))) %>%
unlist()
race_formulas <- lapply(race_formula_str, function(i) as.formula(i)) %>%
unlist()
lm_models <- map(race_formulas, ~ lm(.x, data=speeds_combined))
models <- tibble(start_race=factor(gsub('.* ~ ([^ ]+).*',
'\\1',
race_formula_str),
levels=main_races),
predicted_race=factor(gsub('([^ ]+).*',
'\\1',
race_formula_str),
levels=main_races),
lm_models=lm_models) %>%
arrange(start_race, predicted_race)
model_stats <- glance(models %>% rowwise(), lm_models)
model_coefficients <- tidy(models %>% rowwise(), lm_models)
reduced_formula_str <- model_coefficients %>%
ungroup() %>%
filter(p.value<0.05, term!='(Intercept)') %>%
mutate(term=gsub('genderM', 'gender', term)) %>%
group_by(predicted_race, start_race) %>%
summarize(independent_vars=paste(term, collapse=" + ")) %>%
ungroup() %>%
transmute(reduced_formulas=paste(predicted_race, independent_vars, sep=' ~ '))
reduced_formula_str <- reduced_formula_str$reduced_formulas
reduced_race_formulas <- lapply(reduced_formula_str,
function(i) as.formula(i)) %>% unlist()
reduced_lm_models <- map(reduced_race_formulas, ~ lm(.x, data=speeds_combined))
n_from_lm <- function(model) {
summary_object <- summary(model)
summary_object$df[1] + summary_object$df[2]
}
reduced_models <- tibble(start_race=factor(gsub('.* ~ ([^ ]+).*', '\\1', reduced_formula_str),
levels=main_races),
predicted_race=factor(gsub('([^ ]+).*', '\\1', reduced_formula_str),
levels=main_races),
lm_models=reduced_lm_models) %>%
arrange(start_race, predicted_race) %>%
rowwise() %>%
mutate(n=n_from_lm(lm_models))
reduced_model_stats <- glance(reduced_models %>% rowwise(), lm_models)
reduced_model_coefficients <- tidy(reduced_models %>% rowwise(), lm_models) %>%
ungroup()
coefficients_and_stats <- reduced_model_stats %>%
inner_join(reduced_model_coefficients,
by=c("start_race", "predicted_race", "n")) %>%
select(start_race, predicted_race, n, r.squared, term, estimate)
write_csv(coefficients_and_stats,
"coefficients.csv")
make_scatterplot <- function(start_race, predicted_race) {
age_limits <- speeds_combined %>%
filter_(paste("!is.na(", start_race, ")"),
paste("!is.na(", predicted_race, ")")) %>%
summarize(min=min(age), max=max(age)) %>%
unlist()
q <- ggplot(data=speeds_combined,
aes_string(x=start_race, y=predicted_race)) +
# plasma works better with a grey background
# theme_bw() +
geom_abline(slope=1, color="darkred", alpha=0.5) +
geom_smooth(method="lm", se=FALSE) +
geom_point(aes(shape=gender, color=age)) +
scale_color_viridis(option="plasma",
limits=age_limits) +
scale_x_continuous(breaks=pretty_breaks(n=10)) +
scale_y_continuous(breaks=pretty_breaks(n=6))
svg_filename <- paste0(paste(start_race, predicted_race, sep="-"), ".svg")
height <- 9
width <- 16
resize <- 0.75
svg(svg_filename, height=height*resize, width=width*resize)
print(q)
dev.off()
}
lapply(seq(1, length(main_races)-1),
function(i)
lapply(seq(i+1, length(main_races)),
function(j)
make_scatterplot(main_races[[i]], main_races[[j]])
)
```

My last blog post compared the time for the men who ran both the 2012 Gold Discovery Run and the Equinox Marathon in order to give me an idea of what sort of Equinox finish time I can expect. Here, I’ll do the same thing for the 2012 Santa Claus Half Marathon.

Yesterday I ran the half marathon, finishing in 1:53:08, which is an average pace of 8.63 / 8:38 minutes per mile. I’m recovering from a mild calf strain, so I ran the race very conservatively until I felt like I could trust my legs.

I converted the SportAlaska PDF files the same way as before, and read the data in from the CSV files. Looking at the data, there are a few outliers in this comparison as well. In addition to being ouside of most of the points, they are also times that aren’t close to my expected pace, so are less relevant for predicting my own Equinox finish. Here’s the code to remove them, and perform the linear regression:

```
combined <- combined[!(combined$sc_pace > 11.0 | combined$eq_pace > 14.5),]
model <- lm(eq_pace ~ sc_pace, data=combined)
summary(model)
Call:
lm(formula = eq_pace ~ sc_pace, data = combined)
Residuals:
Min 1Q Median 3Q Max
-1.08263 -0.39018 0.02476 0.30194 1.27824
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.11209 0.61948 -1.795 0.0793 .
sc_pace 1.44310 0.07174 20.115 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.5692 on 45 degrees of freedom
Multiple R-squared: 0.8999, Adjusted R-squared: 0.8977
F-statistic: 404.6 on 1 and 45 DF, p-value: < 2.2e-16
```

There were fewer male runners in 2012 that ran both Santa Claus and Equinox, but we get similar regression statistics. The model and coefficient are significant, and the variation in Santa Claus pace times explains just under 90% of the variation in Equinox times. That’s pretty good.

Here’s a plot of the results:

As before, the blue line shows the model relationship, and the grey area surrounding it shows the 95% confidence interval around that line. This interval represents the range over which 95% of the expected values should appear. The red line is the 1:1 line. As you’d expect for a race twice as long, all the Equinox pace times are significantly slower than for Santa Claus.

There were fewer similar runners in this data set:

Runner | DOB | Santa Claus | Equinox Time | Equinox Pace |
---|---|---|---|---|

John Scherzer | 1972 | 8:17 | 4:49 | 11:01 |

Greg Newby | 1965 | 8:30 | 5:03 | 11:33 |

Trent Hubbard | 1972 | 8:31 | 4:48 | 11:00 |

This analysis predicts that I should be able to finish Equinox in just under five hours, which is pretty close to what I found when using Gold Discovery times in my last post. The model predicts a pace of 11:20 and an Equinox finish time of four hours and 57 minutes, and these results are within the range of the three similar runners listed above. Since I was running conservatively in the half marathon, and will probably try to do the same for Equinox, five hours seems like a good goal to shoot for.

This spring I ran the Beat Beethoven 5K and had such a good time that I decided to give running another try. I’d tried adding running to my usual exercise routines in the past, but knee problems always sidelined me after a couple months. It’s been three months of slow increases in mileage using a marathon training plan by Hal Higdon, and so far so good.

My goal for this year, beyond staying healthy, is to participate in the 51st running of the Equinox Marathon here in Fairbanks.

One of the challenges for a beginning runner is how pace yourself during a race and how to know what your body can handle. Since Beat Beethoven I've run in the Lulu’s 10K, the Midnight Sun Run (another 10K), and last weekend I ran the 16.5 mile Gold Discovery Run from Cleary Summit down to Silver Gulch Brewery. I completed the race in two hours and twenty-nine minutes, at a pace of 9:02 minutes per mile. Based on this performance, I should be able to estimate my finish time and pace for Equinox by comparing the times for runners that participated in the 2012 Gold Discovery and Equinox.

The first challenge is extracting the data from the PDF files SportAlaska publishes after the race. I found that opening the PDF result files, selecting all the text on each page, and pasting it into a text file is the best way to preserve the formatting of each line. Then I process it through a Python function that extracts the bits I want:

```
import re
def parse_sportalaska(line):
""" lines appear to contain:
place, bib, name, town (sometimes missing), state (sometimes missing),
birth_year, age_class, class_place, finish_time, off_win, pace,
points (often missing) """
fields = line.split()
place = int(fields.pop(0))
bib = int(fields.pop(0))
name = fields.pop(0)
while True:
n = fields.pop(0)
name = '{} {}'.format(name, n)
if re.search('^[A-Z.-]+$', n):
break
pre_birth_year = []
pre_birth_year.append(fields.pop(0))
while True:
try:
f = fields.pop(0)
except:
print("Warning: couldn't parse: '{0}'".format(line.strip()))
break
else:
if re.search('^[0-9]{4}$', f):
birth_year = int(f)
break
else:
pre_birth_year.append(f)
if re.search('^[A-Z]{2}$', pre_birth_year[-1]):
state = pre_birth_year[-1]
town = ' '.join(pre_birth_year[:-1])
else:
state = None
town = None
try:
(age_class, class_place, finish_time, off_win, pace) = fields[:5]
class_place = int(class_place[1:-1])
finish_minutes = time_to_min(finish_time)
fpace = strpace_to_fpace(pace)
except:
print("Warning: couldn't parse: '{0}', skipping".format(
line.strip()))
return None
else:
return (place, bib, name, town, state, birth_year, age_class,
class_place, finish_time, finish_minutes, off_win,
pace, fpace)
```

The function uses a a couple helper functions that convert pace and time strings into floating point numbers, which are easier to analyze.

```
def strpace_to_fpace(p):
""" Converts a MM:SS" pace to a float (minutes) """
(mm, ss) = p.split(':')
(mm, ss) = [int(x) for x in (mm, ss)]
fpace = mm + (float(ss) / 60.0)
return fpace
def time_to_min(t):
""" Converts an HH:MM:SS time to a float (minutes) """
(hh, mm, ss) = t.split(':')
(hh, mm) = [int(x) for x in (hh, mm)]
ss = float(ss)
minutes = (hh * 60) + mm + (ss / 60.0)
return minutes
```

Once I process the Gold Discovery and Equnox result files through this routine, I dump the results in a properly formatted comma-delimited file, read the data into R and combine the two race results files by matching the runner’s name. Note that these results only include the men competing in the race.

```
gd <- read.csv('gd_2012_men.csv', header=TRUE)
gd <- gd[,c('name', 'birth_year', 'finish_minutes', 'fpace')]
eq <- read.csv('eq_2012_men.csv', header=TRUE)
eq <- eq[,c('name', 'birth_year', 'finish_minutes', 'fpace')]
combined <- merge(gd, eq, by='name')
names(combined) <- c('name', 'birth_year', 'gd_finish', 'gd_pace',
'year', 'eq_finish', 'eq_pace')
```

When I look at a plot of the data I can see four outliers; two where the runners ran Equinox much faster based on their Gold Discovery pace, and two where the opposite was the case. The two races are two months apart, so I think it’s reasonable to exclude these four rows from the data since all manner of things could happen to a runner in two months of hard training (or on race day!).

```
attach(combined)
combined <- combined[!((gd_pace > 10 & gd_pace < 11 & eq_pace > 15)
| (gd_pace > 15)),]
```

Let’s test the hypothesis that we can predict Equinox pace from Gold Discovery Pace:

```
model <- lm(eq_pace ~ birth_year, data=combined)
summary(model)
Call:
lm(formula = eq_pace ~ gd_pace, data = combined)
Residuals:
Min 1Q Median 3Q Max
-1.47121 -0.36833 -0.04207 0.51361 1.42971
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.77392 0.52233 1.482 0.145
gd_pace 1.08880 0.05433 20.042 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.6503 on 48 degrees of freedom
Multiple R-squared: 0.8933, Adjusted R-squared: 0.891
F-statistic: 401.7 on 1 and 48 DF, p-value: < 2.2e-16
```

Indeed, we can explain 65% of the variation in Equinox Marathon pace times using Gold Discovery pace times, and both the model and the model coefficient are significant.

Here’s what the results look like:

The red line shows a relationship where the Gold Discovery pace is identical to the Equinox pace for each running. Because the actual data (and the prediced results based on the regression model) are above this line, that means that all the runners were slower in the longer (and harder) Equinox Marathon.

As for me, my 9:02 Gold Discovery pace should translate into an Equinox pace around 10:30. Here are the 2012 runners who were born within ten years of me, and who finished within ten minutes of my 2013 Gold Discovery time:

Runner | DOB | Gold Discovery | Equinox Time | Equinox Pace |
---|---|---|---|---|

Dan Bross | 1964 | 2:24 | 4:20 | 9:55 |

Chris Hartman | 1969 | 2:25 | 4:45 | 10:53 |

Mike Hayes | 1972 | 2:27 | 4:58 | 11:22 |

Ben Roth | 1968 | 2:28 | 4:47 | 10:57 |

Jim Brader | 1965 | 2:31 | 4:09 | 9:30 |

Erik Anderson | 1971 | 2:32 | 5:03 | 11:34 |

John Scherzer | 1972 | 2:33 | 4:49 | 11:01 |

Trent Hubbard | 1972 | 2:33 | 4:48 | 11:00 |

Based on this, and the regression results, I expect to finish the Equinox Marathon in just under five hours if my training over the next two months goes well.

It’s now December 1st and the last time we got new snow was on November 11th. In my last post I looked at the lengths of snow-free periods in the available weather data for Fairbanks, now at 20 days. That’s a long time, but what I’m interested in looking at today is whether the monthly pattern of snowfall in Fairbanks is changing.

The Alaska Dog Musher’s Association holds a series of weekly sprint races starting at the beginning of December. For the past several years—and this year—there hasn’t been enough snow to hold the earliest of the races because it takes a certain depth of snowpack to allow a snow hook to hold a team back should the driver need to stop. I’m curious to know if scheduling a bunch of races in December and early January is wishful thinking, or if we used to get a lot of snow earlier in the season than we do now. In other words, has the pattern of snowfall in Fairbanks changed?

One way to get at this is to look at the earliest data in the “winter year” (which I’m defining as starting on September 1st, since we do sometimes get significant snowfall in September) when 12 inches of snow has fallen. Here’s what that relationship looks like:

And the results from a linear regression:

```
Call:
lm(formula = winter_doy ~ winter_year, data = first_foot)
Residuals:
Min 1Q Median 3Q Max
-60.676 -25.149 -0.596 20.984 77.152
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -498.5005 462.7571 -1.077 0.286
winter_year 0.3067 0.2336 1.313 0.194
Residual standard error: 33.81 on 60 degrees of freedom
Multiple R-squared: 0.02793, Adjusted R-squared: 0.01173
F-statistic: 1.724 on 1 and 60 DF, p-value: 0.1942
```

According to these results the date of the first foot of snow is getting later
in the year, but it’s not significant, so we can’t say with any authority that
the pattern we see isn’t just random. Worse, this analysis could be confounded
by what appears to be a decline in the total *yearly* snowfall in Fairbanks:

This relationship (less snow every year) has even less statistical significance. If we combine the two analyses, however, there is a significant relationship:

```
Call:
lm(formula = winter_year ~ winter_doy * snow, data = yearly_data)
Residuals:
Min 1Q Median 3Q Max
-35.15 -11.78 0.49 14.15 32.13
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.947e+03 2.082e+01 93.520 <2e-16 ***
winter_doy 4.297e-01 1.869e-01 2.299 0.0251 *
snow 5.248e-01 2.877e-01 1.824 0.0733 .
winter_doy:snow -7.022e-03 3.184e-03 -2.206 0.0314 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 17.95 on 58 degrees of freedom
Multiple R-squared: 0.1078, Adjusted R-squared: 0.06163
F-statistic: 2.336 on 3 and 58 DF, p-value: 0.08317
```

Here we’re “predicting” winter year based on the yearly snowfall, the first date where a foot of snow had fallen, and the interaction between the two. Despite the near-significance of the model and the parameters, it doesn’t do a very good job of explaining the data (almost 90% of the variation is unexplained by this model).

One problem with boiling the data down into a single (or two) values for each year is that we’re reducing the amount of data being analyzed, lowering our power to detect a significant relationship between the pattern of snowfall and year. Here’s what the overall pattern for all years looks like:

And the individual plots for each year in the record:

Because “winter month” isn’t a continuous variable, we can’t use normal linear regression to evaluate the relationship between year and monthly snowfall. Instead we’ll use multinominal logistic regression to investigate the relationship between which month is the snowiest, and year:

```
library(nnet)
model <- multinom(data = snowiest_month, winter_month ~ winter_year)
summary(model)
Call:
multinom(formula = winter_month ~ winter_year, data = snowiest_month)
Coefficients:
(Intercept) winter_year
3 30.66572 -0.015149192
4 62.88013 -0.031771508
5 38.97096 -0.019623059
6 13.66039 -0.006941225
7 -68.88398 0.034023510
8 -79.64274 0.039217108
Std. Errors:
(Intercept) winter_year
3 9.992962e-08 0.0001979617
4 1.158940e-07 0.0002289479
5 1.120780e-07 0.0002218092
6 1.170249e-07 0.0002320081
7 1.668613e-07 0.0003326432
8 1.955969e-07 0.0003901701
Residual Deviance: 221.5413
AIC: 245.5413
```

I’m not exactly sure how to interpret the results, but typically you’re looking to see if the intercepts and coefficients are significantly different from zero. If you look at the difference in magnitude between the coefficients and the standard errors, it appears they are significantly different from zero, which would imply they are statistically significant.

In order to examine what they have to say, we’ll calculate the probability curves for whether each month will wind up as the snowiest month, and plot the results by year.

```
fit_snowiest <- data.frame(winter_year = 1949:2012)
probs <- cbind(fit_snowiest, predict(model, newdata = fit_snowiest, "probs"))
probs.melted <- melt(probs, id.vars = 'winter_year')
names(probs.melted) <- c('winter_year', 'winter_month', 'probability')
probs.melted$month <- factor(probs.melted$winter_month)
levels(probs.melted$month) <- \
list('oct' = 2, 'nov' = 3, 'dec' = 4, 'jan' = 5, 'feb' = 6, 'mar' = 7, 'apr' = 8)
q <- ggplot(data = probs.melted, aes(x = winter_year, y = probability, colour = month))
q + theme_bw() + geom_line(size = 1) + scale_y_continuous(name = "Model probability") \
+ scale_x_continuous(name = 'Winter year', breaks = seq(1945, 2015, 5)) \
+ ggtitle('Snowiest month probabilities by year from logistic regression model,\n
Fairbanks Airport station') \
+ scale_colour_manual(values = \
c("violet", "blue", "cyan", "green", "#FFCC00", "orange", "red"))
```

The result:

Here’s how you interpret this graph. Each line shows how likely it is that a month will be the snowiest month (November is always the snowiest month because it always has the highest probabilities). The order of the lines for any year indicates the monthly order of snowiness (in 1950, November, December and January were predicted to be the snowiest months, in that order), and months with a negative slope are getting less snowy overall (November, December, January).

November is the snowiest month for all years, but it’s declining, as is snow in December and January. October, February, March and April are increasing. From these results, it appears that we’re getting more snow at the very beginning (October) and at the end of the winter, and less in the middle of the winter.