# Introduction

The latest forecast discussions for Northern Alaska have included warnings that we are likely to experience an extended period of below normal temperatures starting at the end of this week, and yesterday’s Deep Cold blog post discusses the similarity of model forecast patterns to patterns seen in the 1989 and 1999 extreme cold events.

Our dogs spend most of their time in the house when we’re home, but if both of us are at work they’re outside in the dog yard. They have insulated dog houses, but when it’s colder than −15° F, we put them into a heated dog barn. That means one of us has to come home in the middle of the day to let them out to go to the bathroom.

Since we’re past the Winter Solstice, and day length is now increasing, I was curious to see if that has an effect on daily temperature, hopeful that the frequency of days when we need to put the dogs in the barn is decreasing.

# Methods

We’ll use daily minimum and maximum temperature data from the Fairbanks International Airport station, keeping track of how many years the temperatures are below −15° F and dividing by the total to get a frequency. We live in a cold valley on Goldstream Creek, so our temperatures are typically several degrees colder than the Fairbanks Airport, and we often don’t warm up as much during the day as in other places, but minimum airport temperature is a reasonable proxy for the overall winter temperature at our house.

# Results

The following plot shows the frequency of minimum (the top of each line) and maximum (the bottom) temperature colder than −15° F at the airport over the period of record, 1904−2016. The curved blue line represents a best fit line through the minimum temperature frequency, and the vertical blue line is drawn at the date when the frequency is the highest.

The maximum frequency is January 12th, so we have a few more days before the likelihood of needing to put the dogs in the barn starts to decline. The plot also shows that we could still reach that threshold all the way into April.

For fun, here’s the same plot using −40° as the threshold:

The date when the frequency starts to decline is shifted slightly to January 15th, and you can see the frequencies are lower. In mid-January, we can expect minimum temperature to be colder than −15° F more than half the time, but temperatures colder than −40° are just under 15%. There’s also an interesting anomaly in mid to late December where the frequency of very cold temperatures appears to drop.

# Appendix: R code

```
library(tidyverse)
library(lubridate)
library(scales)
noaa <- src_postgres(host="localhost", dbname="noaa")
fairbanks <- tbl(noaa, build_sql("SELECT * FROM ghcnd_pivot
WHERE station_name='FAIRBANKS INTL AP'")) %>%
collect()
save(fairbanks, file="fairbanks_ghcnd.rdat")
for_plot <- fairbanks %>%
mutate(doy=yday(dte),
dte_str=format(dte, "%d %b"),
min_below=ifelse(tmin_c < -26.11,1,0),
max_below=ifelse(tmax_c < -26.11,1,0)) %>%
filter(dte_str!="29 Feb") %>%
mutate(doy=ifelse(leap_year(dte) & doy>60, doy-1, doy),
doy=(doy+31+28+31+30)%%365) %>%
group_by(doy, dte_str) %>%
mutate(n_min=sum(ifelse(!is.na(min_below), 1, 0)),
n_max=sum(ifelse(!is.na(max_below), 1, 0))) %>%
summarize(min_freq=sum(min_below, na.rm=TRUE)/max(n_min, na.rm=TRUE),
max_freq=sum(max_below, na.rm=TRUE)/max(n_max, na.rm=TRUE))
x_breaks <- for_plot %>%
filter(doy %in% seq(49, 224, 7))
stats <- tibble(doy=seq(49, 224),
pred=predict(loess(min_freq ~ doy,
for_plot %>%
filter(doy >= 49, doy <= 224))))
max_stats <- stats %>%
arrange(desc(pred)) %>% head(n=1)
p <- ggplot(data=for_plot,
aes(x=doy, ymin=min_freq, ymax=max_freq)) +
geom_linerange() +
geom_smooth(aes(y=min_freq), se=FALSE, size=0.5) +
geom_segment(aes(x=max_stats$doy, xend=max_stats$doy,
y=-Inf, yend=max_stats$pred),
colour="blue", size=0.5) +
scale_x_continuous(name=NULL,
limits=c(49, 224),
breaks=x_breaks$doy,
labels=x_breaks$dte_str) +
scale_y_continuous(name="Frequency of days colder than −15° F",
breaks=pretty_breaks(n=10)) +
theme_bw() +
theme(axis.text.x=element_text(angle=30, hjust=1))
# Minus 40
for_plot <- fairbanks %>%
mutate(doy=yday(dte),
dte_str=format(dte, "%d %b"),
min_below=ifelse(tmin_c < -40,1,0),
max_below=ifelse(tmax_c < -40,1,0)) %>%
filter(dte_str!="29 Feb") %>%
mutate(doy=ifelse(leap_year(dte) & doy>60, doy-1, doy),
doy=(doy+31+28+31+30)%%365) %>%
group_by(doy, dte_str) %>%
mutate(n_min=sum(ifelse(!is.na(min_below), 1, 0)),
n_max=sum(ifelse(!is.na(max_below), 1, 0))) %>%
summarize(min_freq=sum(min_below, na.rm=TRUE)/max(n_min, na.rm=TRUE),
max_freq=sum(max_below, na.rm=TRUE)/max(n_max, na.rm=TRUE))
x_breaks <- for_plot %>%
filter(doy %in% seq(63, 203, 7))
stats <- tibble(doy=seq(63, 203),
pred=predict(loess(min_freq ~ doy,
for_plot %>%
filter(doy >= 63, doy <= 203))))
max_stats <- stats %>%
arrange(desc(pred)) %>% head(n=1)
q <- ggplot(data=for_plot,
aes(x=doy, ymin=min_freq, ymax=max_freq)) +
geom_linerange() +
geom_smooth(aes(y=min_freq), se=FALSE, size=0.5) +
geom_segment(aes(x=max_stats$doy, xend=max_stats$doy,
y=-Inf, yend=max_stats$pred),
colour="blue", size=0.5) +
scale_x_continuous(name=NULL,
limits=c(63, 203),
breaks=x_breaks$doy,
labels=x_breaks$dte_str) +
scale_y_continuous(name="Frequency of days colder than −40°",
breaks=pretty_breaks(n=10)) +
theme_bw() +
theme(axis.text.x=element_text(angle=30, hjust=1))
```

# Introduction

So far this winter we’ve gotten only 4.1 inches of snow, well below the normal 19.7 inches, and there is only 2 inches of snow on the ground. At this point last year we had 8 inches and I’d been biking and skiing on the trail to work for two weeks. In his North Pacific Temperature Update blog post, Richard James mentions that winters like this one, with a combined strongly positive Pacific Decadal Oscillation phase and strongly negative North Pacific Mode phase tend to be a “distinctly dry” pattern for interior Alaska. I don’t pretend to understand these large scale climate patterns, but I thought it would be interesting to look at snowfall and snow depth in years with very little mid-November snow. In other years like this one do we eventually get enough snow that the trails fill in and we can fully participate in winter sports like skiing, dog mushing, and fat biking?

# Data

We will use daily data from the Global Historical Climate Data set for the Fairbanks International Airport station. Data prior to 1950 is excluded because of poor quality snowfall and snow depth data and because there’s a good chance that our climate has changed since then and patterns from that era aren’t a good model for the current climate in Alaska.

We will look at both snow depth and the cumulative winter snowfall.

# Results

The following tables show the ten years with the lowest cumulative snowfall and snow depth values from 1950 to the present on November 18th.

Year | Cumulative Snowfall (inches) |
---|---|

1953 | 1.5 |

2016 | 4.1 |

1954 | 4.3 |

2014 | 6.0 |

2006 | 6.4 |

1962 | 7.5 |

1998 | 7.8 |

1960 | 8.5 |

1995 | 8.8 |

1979 | 10.2 |

Year | Snow depth (inches) |
---|---|

1953 | 1 |

1954 | 1 |

1962 | 1 |

2016 | 2 |

2014 | 2 |

1998 | 3 |

1964 | 3 |

1976 | 3 |

1971 | 3 |

2006 | 4 |

2016 has the second-lowest cumulative snowfall behind 1953 and is tied for second with 2014 for snow depth with 1953, 1954 and 1962 all having only 1 inch of snow on November 18th.

It also seems like recent years appear in these tables more frequently than would be expected. Grouping by decade and averaging cumulative snowfall and snow depth yields the pattern in the chart below. The error bars (not shown) are fairly large, so the differences between decades aren’t likely to be statistically significant, but there is a pattern of lower snowfall amounts in recent decades.

Now let’s see what happened in those years with low snowfall and snow depth values in mid-November starting with cumulative snowfall. The following plot (and the subsequent snow depth plot) shows the data for the low-value years (and one very high snowfall year—1990), with each year’s data as a separate line. The smooth dark cyan line through the middle of each plot is the smoothed line through the values for all years; a sort of “average” snowfall and snow depth curve.

In all four mid-November low-snowfall years, the cumulative snowfall values remain below average throughout the winter, but snow did continue to fall as the season went on. Even the lowest winter year here, 2006–2007, still ended the winter with 15 inches of snow on the groud.

The following plot shows snow depth for the four years with the lowest snow depth on November 18th. The data is formatted the same as in the previous plot except we’ve jittered the values slightly to make the plot easier to read.

The pattern here is similar, but the snow depths get much closer to the average values. Snow depth for all four low snow years remain low throughout November, but start rising in December, dramatically in 1954 and 2014.

One of the highest snowfall years between 1950 and 2016 was 1990–1991 (shown on both plots). An impressive 32.8 inches of snow fell in eight days between December 21st and December 28th, accounting for the sharp increase in cumulative snowfall and snow depth shown on both plots. There are five years in the record where the cumulative total for the entire winter was lower than these eight days in 1990.

# Conclusion

Despite the lack of snow on the ground to this point in the year, the record shows that we are still likely to get enough snow to fill in the trails. We may need to wait until mid to late December, but it’s even possible we’ll eventually reach the long term average depth before spring.

# Appendix

Here’s the R code used to generate the statistics, tables and plots from this post:

```
library(tidyverse)
library(lubridate)
library(scales)
library(knitr)
noaa <- src_postgres(host="localhost", dbname="noaa")
snow <- tbl(noaa, build_sql(
"WITH wdoy_data AS (
SELECT dte, dte - interval '120 days' as wdte,
tmin_c, tmax_c, (tmin_c+tmax_c)/2.0 AS tavg_c,
prcp_mm, snow_mm, snwd_mm
FROM ghcnd_pivot
WHERE station_name = 'FAIRBANKS INTL AP'
AND dte > '1950-09-01')
SELECT dte, date_part('year', wdte) AS wyear, date_part('doy', wdte) AS wdoy,
to_char(dte, 'Mon DD') AS mmdd,
tmin_c, tmax_c, tavg_c, prcp_mm, snow_mm, snwd_mm
FROM wdoy_data")) %>%
mutate(wyear=as.integer(wyear),
wdoy=as.integer(wdoy),
snwd_mm=as.integer(snwd_mm)) %>%
select(dte, wyear, wdoy, mmdd,
tmin_c, tmax_c, tavg_c, prcp_mm, snow_mm, snwd_mm) %>% collect()
write_csv(snow, "pafa_data_with_wyear_post_1950.csv")
save(snow, file="pafa_data_with_wyear_post_1950.rdata")
cum_snow <- snow %>%
mutate(snow_na=ifelse(is.na(snow_mm),1,0),
snow_mm=ifelse(is.na(snow_mm),0,snow_mm)) %>%
group_by(wyear) %>%
mutate(snow_mm_cum=cumsum(snow_mm),
snow_na=cumsum(snow_na)) %>%
ungroup() %>%
mutate(snow_in_cum=round(snow_mm_cum/25.4, 1),
snwd_in=round(snwd_mm/25.4, 0))
nov_18_snow <- cum_snow %>%
filter(mmdd=='Nov 18') %>%
select(wyear, snow_in_cum, snwd_in) %>%
arrange(snow_in_cum)
decadal_avg <- nov_18_snow %>%
mutate(decade=as.integer(wyear/10)*10) %>%
group_by(decade) %>%
summarize(`Snow depth`=mean(snwd_in),
snwd_sd=sd(snwd_in),
`Cumulative Snowfall`=mean(snow_in_cum),
snow_cum_sd=sd(snow_in_cum))
decadal_averages <- ggplot(decadal_avg %>%
gather(variable, value, -decade) %>%
filter(variable %in% c("Cumulative Snowfall",
"Snow depth")),
aes(x=as.factor(decade), y=value, fill=variable)) +
theme_bw() +
geom_bar(stat="identity", position="dodge") +
scale_x_discrete(name="Decade", breaks=c(1950, 1960, 1970, 1980,
1990, 2000, 2010)) +
scale_y_continuous(name="Inches", breaks=pretty_breaks(n=10)) +
scale_fill_discrete(name="Measurement")
print(decadal_averages)
date_x_scale <- cum_snow %>%
filter(grepl(' (01|15)', mmdd), wyear=='1994') %>%
select(wdoy, mmdd)
cumulative_snowfall <-
ggplot(cum_snow %>% filter(wyear %in% c(1953, 1954, 2014, 2006, 1990),
wdoy>183,
wdoy<320),
aes(x=wdoy, y=snow_in_cum, colour=as.factor(wyear))) +
theme_bw() +
geom_smooth(data=cum_snow %>% filter(wdoy>183, wdoy<320),
aes(x=wdoy, y=snow_in_cum),
size=0.5, colour="darkcyan",
inherit.aes=FALSE,
se=FALSE) +
geom_line(position="jitter") +
scale_x_continuous(name="",
breaks=date_x_scale$wdoy,
labels=date_x_scale$mmdd) +
scale_y_continuous(name="Cumulative snowfall (in)",
breaks=pretty_breaks(n=10)) +
scale_color_discrete(name="Winter year")
print(cumulative_snowfall)
snow_depth <-
ggplot(cum_snow %>% filter(wyear %in% c(1953, 1954, 1962, 2014, 1990),
wdoy>183,
wdoy<320),
aes(x=wdoy, y=snwd_in, colour=as.factor(wyear))) +
theme_bw() +
geom_smooth(data=cum_snow %>% filter(wdoy>183, wdoy<320),
aes(x=wdoy, y=snwd_in),
size=0.5, colour="darkcyan",
inherit.aes=FALSE,
se=FALSE) +
geom_line(position="jitter") +
scale_x_continuous(name="",
breaks=date_x_scale$wdoy,
labels=date_x_scale$mmdd) +
scale_y_continuous(name="Snow Depth (in)",
breaks=pretty_breaks(n=10)) +
scale_color_discrete(name="Winter year")
print(snow_depth)
```

This morning’s weather forecast:

SUNNY. HIGHS IN THE UPPER 70S TO LOWER 80S. LIGHT WINDS.

May 13th seems very early in the year to hit 80 degrees in Fairbanks, so I
decided to check it out. What I’m doing here is selecting all the dates where
the temperature is above 80°F, then ranking those dates by year and date, and
extracting the “winner” for each year (where `rank` is 1).

```
WITH warm AS (
SELECT extract(year from dte) AS year, dte,
c_to_f(tmax_c) AS tmax_f
FROM ghcnd_pivot
WHERE station_name = 'FAIRBANKS INTL AP'
AND c_to_f(tmax_c) >= 80.0),
ranked AS (
SELECT year, dte, tmax_f,
row_number() OVER (PARTITION BY year
ORDER BY dte) AS rank
FROM warm)
SELECT dte,
extract(doy from dte) AS doy,
round(tmax_f, 1) as tmax_f
FROM ranked
WHERE rank = 1
ORDER BY doy;
```

And the results:

Date | Day of year | High temperature (°F) |
---|---|---|

1995-05-09 | 129 | 80.1 |

1975-05-11 | 131 | 80.1 |

1942-05-12 | 132 | 81.0 |

1915-05-14 | 134 | 80.1 |

1993-05-16 | 136 | 82.0 |

2002-05-20 | 140 | 80.1 |

2015-05-22 | 142 | 80.1 |

1963-05-22 | 142 | 84.0 |

1960-05-23 | 144 | 80.1 |

2009-05-24 | 144 | 80.1 |

… | … | … |

If we hit 80°F today, it’ll be the fourth earliest day of year to hit that temperature since records started being kept in 1904.

Update: We didn’t reach 80°F on the 13th, but got to 82°F on May 14th, tied with that date in 1915 for the fourth earliest 80 degree temperature.

# Introduction

One of the best sources of weather data in the United States comes from the National Weather Service's Cooperative Observer Network (COOP), which is available from NCDC. It's daily data, collected by volunteers at more than 10,000 locations. We participate in this program at our house (station id DW1454 / GHCND:USC00503368), collecting daily minimum and maximum temperature, liquid precipitation, snowfall and snow depth. We also collect river heights for Goldstream Creek as part of the Alaska Pacific River Forecast Center (station GSCA2). Traditionally, daily temperature measurements were collecting using a minimum maximum thermometer, which meant that the only way to calculate average daily temperature was by averaging the minimum and maximum temperature. Even though COOP observers typically have an electronic instrument that could calculate average daily temperature from continuous observations, the daily minimum and maximum data is still what is reported.

In an earlier post we looked at methods used to calculate average daily temperature, and if there are any biases present in the way the National Weather Service calculates this using the average of the minimum and maximum daily temperature. We looked at five years of data collected at my house every five minutes, comparing the average of these temperatures against the average of the daily minimum and maximum. Here, we will be repeating this analysis using data from the Climate Reference Network stations in the United States.

The US Climate Reference Network is a collection of 132 weather stations that are properly sited, maintained, and include multiple redundant measures of temperature and precipitation. Data is available from http://www1.ncdc.noaa.gov/pub/data/uscrn/products/ and includes monthly, daily, and hourly statistics, and sub-hourly (5-minute) observations. We’ll be focusing on the sub-hourly data, since it closely matches the data collected at my weather station.

A similar analysis using daily and hourly CRN data appears here.

# Getting the raw data

I downloaded all the data using the following Unix commands:

```
$ wget http://www1.ncdc.noaa.gov/pub/data/uscrn/products/stations.tsv
$ wget -np -m http://www1.ncdc.noaa.gov/pub/data/uscrn/products/subhourly01/
$ find www1.ncdc.noaa.gov/ -type f -name 'CRN*.txt' -exec gzip {} \;
```

The code to insert all of this data into a database can be found
here.
Once inserted, I have a table named `crn_stations` that has the
station data, and one named `crn_subhourly` with the five minute
observation data.

# Methods

Once again, we’ll use R to read the data, process it, and produce plots.

## Libraries

Load the libraries we need:

```
library(dplyr)
library(lubridate)
library(ggplot2)
library(scales)
library(grid)
```

Connect to the database and load the data tables.

```
noaa_db <- src_postgres(dbname="noaa", host="mason")
crn_stations <- tbl(noaa_db, "crn_stations") %>%
collect()
crn_subhourly <- tbl(noaa_db, "crn_subhourly")
```

Remove observations without temperature data, group by station and date, calculate average daily temperature using the two methods, remove any daily data without a full set of data, and collect the results into an R data frame. This looks very similar to the code used to analyze the data from my weather station.

```
crn_daily <-
crn_subhourly %>%
filter(!is.na(air_temperature)) %>%
mutate(date=date(timestamp)) %>%
group_by(wbanno, date) %>%
summarize(t_mean=mean(air_temperature),
t_minmax_avg=(min(air_temperature)+
max(air_temperature))/2.0,
n=n()) %>%
filter(n==24*12) %>%
mutate(anomaly=t_minmax_avg-t_mean) %>%
select(wbanno, date, t_mean, t_minmax_avg, anomaly) %>%
collect()
```

The two types of daily average temperatures are calculated in this step:

```
summarize(t_mean=mean(air_temperature),
t_minmax_avg=(min(air_temperature)+
max(air_temperature))/2.0)
```

Where `t_mean` is the value calculated from all 288 five minute
observations, and `t_minmax_avg` is the value from the daily minimum
and maximum.

Now we join the observation data with the station data. This attaches station information such as the name and latitude of the station to each record.

```
crn_daily_stations <-
crn_daily %>%
inner_join(crn_stations, by="wbanno") %>%
select(wbanno, date, state, location, latitude, longitude,
t_mean, t_minmax_avg, anomaly)
```

Finally, save the data so we don’t have to do these steps again.

```
save(crn_daily_stations, file="crn_daily_averages.rdata")
```

# Results

Here are the overall results of the analysis.

```
summary(crn_daily_stations$anomaly)
```

## Min. 1st Qu. Median Mean 3rd Qu. Max. ## -11.9000 -0.1028 0.4441 0.4641 1.0190 10.7900

The average anomaly across all stations and all dates is 0.44 degrees Celsius (0.79 degrees Farenheit). That’s a pretty significant error. Half the data is between −0.1 and 1.0°C (−0.23 and +1.8°F) and the full range is −11.9 to +10.8°C (−21.4 to +19.4°F).

# Plots

Let’s look at some plots.

## Raw data by latitude

To start, we’ll look at all the anomalies by station latitude. The plot only shows one percent of the actual anomalies because plotting 512,460 points would take a long time and the general pattern is clear from the reduced data set.

```
set.seed(43)
p <- ggplot(data=crn_daily_stations %>% sample_frac(0.01),
aes(x=latitude, y=anomaly)) +
geom_point(position="jitter", alpha="0.2") +
geom_smooth(method="lm", se=FALSE) +
theme_bw() +
scale_x_continuous(name="Station latitude", breaks=pretty_breaks(n=10)) +
scale_y_continuous(name="Temperature anomaly (degrees C)",
breaks=pretty_breaks(n=10))
print(p)
```

The clouds of points show the differences between the min/max daily average and the actual daily average temperature, where numbers above zero represent cases where the min/max calculation overestimates daily average temperature. The blue line is the fit of a linear model relating latitude with temperature anomaly. We can see that the anomaly is always positive, averaging around half a degree at lower latitudes and drops somewhat as we proceed northward. You also get a sense from the actual data of how variable the anomaly is, and at what latitudes most of the stations are found.

Here are the regression results:

```
summary(lm(anomaly ~ latitude, data=crn_daily_stations))
```

## ## Call: ## lm(formula = anomaly ~ latitude, data = crn_daily_stations) ## ## Residuals: ## Min 1Q Median 3Q Max ## -12.3738 -0.5625 -0.0199 0.5499 10.3485 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.7403021 0.0070381 105.19 <2e-16 *** ## latitude -0.0071276 0.0001783 -39.98 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.9632 on 512458 degrees of freedom ## Multiple R-squared: 0.00311, Adjusted R-squared: 0.003108 ## F-statistic: 1599 on 1 and 512458 DF, p-value: < 2.2e-16

The overall model and coefficients are highly significant, and show a slight decrease in the positive anomaly as we move farther north. Perhaps this is part of the reason why the analysis of my station (at a latitude of 64.89) showed an average anomaly close to zero (−0.07°C / −0.13°F).

## Anomalies by month and latitude

One of the results of our earlier analysis was a seasonal pattern in the anomalies at our station. Since we also know there is a latitudinal pattern, in the data, let’s combine the two, plotting anomaly by month, and faceting by latitude.

Station latitude are binned into groups for plotting, and the plots themselves show the range that cover half of all anomalies for that latitude category × month. Including the full range of anomalies in each group tends to obscure the overall pattern, and the plot of the raw data didn’t show an obvious skew to the rarer anomalies.

Here’s how we set up the data frames for the plot.

```
crn_daily_by_month <-
crn_daily_stations %>%
mutate(month=month(date),
lat_bin=factor(ifelse(latitude<30, '<30',
ifelse(latitude>60, '>60',
paste(floor(latitude/10)*10,
(floor(latitude/10)+1)*10,
sep='-'))),
levels=c('<30', '30-40', '40-50',
'50-60', '>60')))
summary_stats <- function(l) {
s <- summary(l)
data.frame(min=s['Min.'],
first=s['1st Qu.'],
median=s['Median'],
mean=s['Mean'],
third=s['3rd Qu.'],
max=s['Max.'])
}
crn_by_month_lat_bin <-
crn_daily_by_month %>%
group_by(month, lat_bin) %>%
do(summary_stats(.$anomaly)) %>%
ungroup()
station_years <-
crn_daily_by_month %>%
mutate(year=year(date)) %>%
group_by(wbanno, lat_bin) %>%
summarize() %>%
group_by(lat_bin) %>%
summarize(station_years=n())
```

And the plot itself. At the end, we’re using a function called
`facet_adjust`, which adds x-axis tick labels to the facet on the
right that wouldn't ordinarily have them. The code comes from this
stack overflow
post.

```
p <- ggplot(data=crn_by_month_lat_bin,
aes(x=month, ymin=first, ymax=third, y=mean)) +
geom_hline(yintercept=0, alpha=0.2) +
geom_hline(data=crn_by_month_lat_bin %>%
group_by(lat_bin) %>%
summarize(mean=mean(mean)),
aes(yintercept=mean), colour="darkorange", alpha=0.5) +
geom_pointrange() +
facet_wrap(~ lat_bin, ncol=3) +
geom_text(data=station_years, size=4,
aes(x=2.25, y=-0.5, ymin=0, ymax=0,
label=paste('n =', station_years))) +
scale_y_continuous(name="Range including 50% of temperature anomalies") +
scale_x_discrete(breaks=1:12,
labels=c('Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun',
'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec')) +
theme_bw() +
theme(axis.text.x=element_text(angle=45, hjust=1, vjust=1.25),
axis.title.x=element_blank())
facet_adjust(p)
```

Each plot shows the range of anomalies from the first to the third quartile (50% of the observed anomalies) by month, with the dot near the middle of the line at the mean anomaly. The orange horizontal line shows the overall mean anomaly for that latitude category, and the count at the bottom of the plot indicates the number of “station years” for that latitude category.

It’s clear that there are seasonal patterns in the differences between the mean daily temperature and the min/max estimate. But each plot looks so different from the next that it’s not clear if the patterns we are seeing in each latitude category are real or artificial. It is also problematic that three of our latitude categories have very little data compared with the other two. It may be worth performing this analysis in a few years when the lower and higher latitude stations have a bit more data.

# Conclusion

This analysis shows that there is a clear bias in using the average of minimum and maximum daily temperature to estimate average daily temperature. Across all of the CRN stations, the min/max estimator overestimates daily average temperature by almost a half a degree Celsius (0.8°F).

We also found that this error is larger at lower latitudes, and that there are seasonal patterns to the anomalies, although the seasonal patterns don’t seem to have clear transitions moving from lower to higher latitudes.

The current length of the CRN record is quite short, especially for the sub-hourly data used here, so the patterns may not be representative of the true situation.

# Abstract

The following is a document-style version of a presentation I gave at work a
couple weeks ago. It's a little less useful for a general audience because you
don't have access to the same database I have, but I figured it might be useful
for someone who is looking at using `dplyr` or in manipulating the GHCND data
from NCDC.

# Introduction

Today we’re going to briefly take a look at the GHCND climate database
and a couple new R packages (`dplyr` and `tidyr`) that make data
import and manipulation a lot easier than using the standard library.

For further reading, consult the vignettes for `dplyr` and `tidyr`,
and download the cheat sheet:

# GHCND database

The GHCND
database
contains daily observation data from locations around the world. The
README linked above describes the data set and the way the data is
formatted. I have written scripts that process the station data and the
yearly download files and insert it into a PostgreSQL database
(`noaa`).

The script for inserting a yearly file (downloaded from http://www1.ncdc.noaa.gov/pub/data/ghcn/daily/by_year/) is here: ghcn-daily-by_year_process.py

# “Tidy” data

Without going into too much detail on the subject (read Hadley Wickham’s paper) for more information, but the basic idea is that it is much easier to analyze data when it is in a particular, “tidy”, form. A Tidy dataset has a single table for each type of real world object or type of data, and each table has one column per variable measured and one row per observation.

For example, here’s a tidy table representing daily weather observations with station × date as rows and the various variables as columns.

Station | Date | tmin_c | tmax_c | prcp | snow | ... |
---|---|---|---|---|---|---|

PAFA | 2014-01-01 | 12 | 24 | 0.00 | 0.0 | ... |

PAFA | 2014-01-01 | 8 | 11 | 0.02 | 0.2 | ... |

... | ... | ... | ... | ... | ... | ... |

Getting raw data into this format is what we’ll look at today.

# R libraries & data import

First, let’s load the libraries we’ll need:

```
library(dplyr) # data import
library(tidyr) # column / row manipulation
library(knitr) # tabular export
library(ggplot2) # plotting
library(scales) # “pretty” scaling
library(lubridate) # date / time manipulations
```

`dplyr` and `tidyr` are the data import and manipulation libraries
we will use, `knitr` is used to produce tabular data in report-quality
forms, `ggplot2` and `scales` are plotting libraries, and
`lubridate` is a library that makes date and time manipulation easier.

Also note the warnings about how several R functions have been “masked”
when we imported `dplyr`. This just means we'll be getting the
`dplyr` versions instead of those we might be used to. In cases where
we need both, you can preface the function with it's package:
`base::filter` would us the normal `filter` function instead of the
one from `dplyr`.

Next, connect to the database and the three tables we will need:

```
noaa_db <- src_postgres(host="mason",
dbname="noaa")
ghcnd_obs <- tbl(noaa_db, "ghcnd_obs")
ghcnd_vars <- tbl(noaa_db, "ghcnd_variables")
```

The first statement connects us to the database and the next two create table links to the observation table and the variables table.

Here’s what those two tables look like:

```
glimpse(ghcnd_obs)
```

## Observations: 29404870 ## Variables: ## $ station_id (chr) "USW00027502", "USW00027502", "USW00027502", "USW0... ## $ dte (date) 2011-05-01, 2011-05-01, 2011-05-01, 2011-05-01, 2... ## $ variable (chr) "AWND", "FMTM", "PRCP", "SNOW", "SNWD", "TMAX", "T... ## $ raw_value (dbl) 32, 631, 0, 0, 229, -100, -156, 90, 90, 54, 67, 1,... ## $ meas_flag (chr) "", "", "T", "T", "", "", "", "", "", "", "", "", ... ## $ qual_flag (chr) "", "", "", "", "", "", "", "", "", "", "", "", ""... ## $ source_flag (chr) "X", "X", "X", "X", "X", "X", "X", "X", "X", "X", ... ## $ time_of_obs (int) NA, NA, 0, NA, NA, 0, 0, NA, NA, NA, NA, NA, NA, N...

```
glimpse(ghcnd_vars)
```

## Observations: 82 ## Variables: ## $ variable (chr) "AWND", "EVAP", "MDEV", "MDPR", "MNPN", "MXPN",... ## $ description (chr) "Average daily wind speed (tenths of meters per... ## $ raw_multiplier (dbl) 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0....

Each row in the observation table rows contain the `station_id`, date,
a variable code, the raw value for that variable, and a series of flags
indicating data quality, source, and special measurements such as the
“trace” value used for precipitation under the minimum measurable value.

Each row in the variables table contains a variable code, description and the multiplier used to convert the raw value from the observation table into an actual value.

This is an example of completely “normalized” data, and it’s stored this way because not all weather stations record all possible variables, and rather than having a single row for each station × date with a whole bunch of empty columns for those variables not measured, each row contains the station × data × variable data.

We are also missing information about the stations, so let’s load that data:

```
fai_stations <-
tbl(noaa_db, "ghcnd_stations") %>%
filter(station_name %in% c("FAIRBANKS INTL AP",
"UNIVERSITY EXP STN",
"COLLEGE OBSY"))
glimpse(fai_stations)
```

## Observations: 3 ## Variables: ## $ station_id (chr) "USC00502107", "USW00026411", "USC00509641" ## $ station_name (chr) "COLLEGE OBSY", "FAIRBANKS INTL AP", "UNIVERSITY ... ## $ latitude (dbl) 64.86030, 64.80389, 64.85690 ## $ longitude (dbl) -147.8484, -147.8761, -147.8610 ## $ elevation (dbl) 181.9656, 131.6736, 144.7800 ## $ coverage (dbl) 0.96, 1.00, 0.98 ## $ start_date (date) 1948-05-16, 1904-09-04, 1904-09-01 ## $ end_date (date) 2015-04-03, 2015-04-02, 2015-03-13 ## $ variables (chr) "TMIN TOBS WT11 SNWD SNOW WT04 WT14 TMAX WT05 DAP... ## $ the_geom (chr) "0101000020E6100000A5BDC117267B62C0EC2FBB270F3750...

The first part is the same as before, loading the `ghcnd_stations`
table, but we are filtering that data down to just the Fairbanks area
stations with long term records. To do this, we use the pipe operator
`%>%` which takes the data from the left side and passes it to the
function on the right side, the `filter` function in this case.

`filter` requires one or more conditional statements with variable
names on the left side and the condition on the right. Multiple
conditions can be separated by commas if you want all the conditions to
be required (AND) or separated by a logic operator (`&` for AND, `|`
for OR). For example: `filter(latitude > 70, longitude < -140)`.

When used on database tables, `filter` can also use conditionals that
are built into the database which are passed directly as part of a
`WHERE` clause. In our code above, we’re using the `%in%` operator
here to select the stations from a list.

Now we have the `station_id`s we need to get just the data we want
from the observation table and combine it with the other tables.

# Combining data

Here’s how we do it:

```
fai_raw <-
ghcnd_obs %>%
inner_join(fai_stations, by="station_id") %>%
inner_join(ghcnd_vars, by="variable") %>%
mutate(value=raw_value*raw_multiplier) %>%
filter(qual_flag=='') %>%
select(station_name, dte, variable, value) %>%
collect()
glimpse(fai_raw)
```

In order, here’s what we’re doing:

- Assign the result to
`fai_raw` - Join the observation table with the filtered station data, using
`station_id`as the variable to combine against. Because this is an “inner” join, we only get results where`station_id`matches in both the observation and the filtered station data. At this point we only have observation data from our long-term Fairbanks stations. - Join the variable table with the Fairbanks area observation data,
using
`variable`to link the tables. - Add a new variable called
`value`which is calculated by multiplying`raw_value`(coming from the observation table) by`raw_multiplier`(coming from the variable table). - Remove rows where the quality flag is not an empty space.
- Select only the station name, date, variable and actual value columns from the data. Before we did this, each row would contain every column from all three tables, and most of that information is not necessary.
- Finally, we “collect” the results.
`dplyr`doesn’t actually perform the full SQL until it absolutely has to. Instead it’s retrieving a small subset so that we can test our operations quickly. When we are happy with the results, we use`collect()`to grab the full data.

# De-normalize it

The data is still in a format that makes it difficult to analyze, with each row in the result containing a single station × date × variable observation. A tidy version of this data requires each variable be a column in the table, each row being a single date at each station.

To “pivot” the data, we use the `spread` function, and we'll also
calculate a new variable and reduce the number of columns in the result.

```
fai_pivot <-
fai_raw %>%
spread(variable, value) %>%
mutate(TAVG=(TMIN+TMAX)/2.0) %>%
select(station_name, dte, TAVG, TMIN, TMAX, TOBS, PRCP, SNOW, SNWD,
WSF1, WDF1, WSF2, WDF2, WSF5, WDF5, WSFG, WDFG, TSUN)
head(fai_pivot)
```

## Source: local data frame [6 x 18] ## ## station_name dte TAVG TMIN TMAX TOBS PRCP SNOW SNWD WSF1 WDF1 ## 1 COLLEGE OBSY 1948-05-16 11.70 5.6 17.8 16.1 NA NA NA NA NA ## 2 COLLEGE OBSY 1948-05-17 15.55 12.2 18.9 17.8 NA NA NA NA NA ## 3 COLLEGE OBSY 1948-05-18 14.40 9.4 19.4 16.1 NA NA NA NA NA ## 4 COLLEGE OBSY 1948-05-19 14.15 9.4 18.9 12.2 NA NA NA NA NA ## 5 COLLEGE OBSY 1948-05-20 10.25 6.1 14.4 14.4 NA NA NA NA NA ## 6 COLLEGE OBSY 1948-05-21 9.75 1.7 17.8 17.8 NA NA NA NA NA ## Variables not shown: WSF2 (dbl), WDF2 (dbl), WSF5 (dbl), WDF5 (dbl), WSFG ## (dbl), WDFG (dbl), TSUN (dbl)

`spread` takes two parameters, the variable we want to spread across
the columns, and the variable we want to use as the data value for each
row × column intersection.

# Examples

Now that we've got the data in a format we can work with, let's look at a few examples.

## Find the coldest temperatures by winter year

First, let’s find the coldest winter temperatures from each station, by winter year. “Winter year” is just a way of grouping winters into a single value. Instead of the 2014–2015 winter, it’s the 2014 winter year. We get this by subtracting 92 days (the days in January, February, March) from the date, then pulling off the year.

Here’s the code.

```
fai_winter_year_minimum <-
fai_pivot %>%
mutate(winter_year=year(dte - days(92))) %>%
filter(winter_year < 2014) %>%
group_by(station_name, winter_year) %>%
select(station_name, winter_year, TMIN) %>%
summarize(tmin=min(TMIN*9/5+32, na.rm=TRUE), n=n()) %>%
filter(n>350) %>%
select(station_name, winter_year, tmin) %>%
spread(station_name, tmin)
last_twenty <-
fai_winter_year_minimum %>%
filter(winter_year > 1993)
last_twenty
```

## Source: local data frame [20 x 4] ## ## winter_year COLLEGE OBSY FAIRBANKS INTL AP UNIVERSITY EXP STN ## 1 1994 -43.96 -47.92 -47.92 ## 2 1995 -45.04 -45.04 -47.92 ## 3 1996 -50.98 -50.98 -54.04 ## 4 1997 -43.96 -47.92 -47.92 ## 5 1998 -52.06 -54.94 -54.04 ## 6 1999 -50.08 -52.96 -50.98 ## 7 2000 -27.94 -36.04 -27.04 ## 8 2001 -40.00 -43.06 -36.04 ## 9 2002 -34.96 -38.92 -34.06 ## 10 2003 -45.94 -45.94 NA ## 11 2004 NA -47.02 -49.00 ## 12 2005 -47.92 -50.98 -49.00 ## 13 2006 NA -43.96 -41.98 ## 14 2007 -38.92 -47.92 -45.94 ## 15 2008 -47.02 -47.02 -49.00 ## 16 2009 -32.98 -41.08 -41.08 ## 17 2010 -36.94 -43.96 -38.02 ## 18 2011 -47.92 -50.98 -52.06 ## 19 2012 -43.96 -47.92 -45.04 ## 20 2013 -36.94 -40.90 NA

See if you can follow the code above. The pipe operator makes is easy to see each operation performed along the way.

There are a couple new functions here, `group_by` and `summarize`.
`group_by` indicates at what level we want to group the data, and
`summarize` uses those groupings to perform summary calculations using
aggregate functions. We group by station and winter year, then we use
the minimum and n functions to get the minimum temperature and number of
days in each year where temperature data was available. You can see we
are using `n` to remove winter years where more than two weeks of data
are missing.

Also notice that we’re using `spread` again in order to make a single
column for each station containing the minimum temperature data.

Here’s how we can write out the table data as a restructuredText document, which can be converted into many document formats (PDF, ODF, HTML, etc.):

```
sink("last_twenty.rst")
print(kable(last_twenty, format="rst"))
sink()
```

winter_year | COLLEGE OBSY | FAIRBANKS INTL AP | UNIVERSITY EXP STN |
---|---|---|---|

1994 | -43.96 | -47.92 | -47.92 |

1995 | -45.04 | -45.04 | -47.92 |

1996 | -50.98 | -50.98 | -54.04 |

1997 | -43.96 | -47.92 | -47.92 |

1998 | -52.06 | -54.94 | -54.04 |

1999 | -50.08 | -52.96 | -50.98 |

2000 | -27.94 | -36.04 | -27.04 |

2001 | -40.00 | -43.06 | -36.04 |

2002 | -34.96 | -38.92 | -34.06 |

2003 | -45.94 | -45.94 | NA |

2004 | NA | -47.02 | -49.00 |

2005 | -47.92 | -50.98 | -49.00 |

2006 | NA | -43.96 | -41.98 |

2007 | -38.92 | -47.92 | -45.94 |

2008 | -47.02 | -47.02 | -49.00 |

2009 | -32.98 | -41.08 | -41.08 |

2010 | -36.94 | -43.96 | -38.02 |

2011 | -47.92 | -50.98 | -52.06 |

2012 | -43.96 | -47.92 | -45.04 |

2013 | -36.94 | -40.90 | NA |

## Plotting

Finally, let’s plot the minimum temperatures for all three stations.

```
q <-
fai_winter_year_minimum %>%
gather(station_name, tmin, -winter_year) %>%
arrange(winter_year) %>%
ggplot(aes(x=winter_year, y=tmin, colour=station_name)) +
geom_point(size=1.5, position=position_jitter(w=0.5,h=0.0)) +
geom_smooth(method="lm", se=FALSE) +
scale_x_continuous(name="Winter Year", breaks=pretty_breaks(n=20)) +
scale_y_continuous(name="Minimum temperature (degrees F)", breaks=pretty_breaks(n=10)) +
scale_color_manual(name="Station",
labels=c("College Observatory",
"Fairbanks Airport",
"University Exp. Station"),
values=c("darkorange", "blue", "darkcyan")) +
theme_bw() +
# theme(legend.position = c(0.150, 0.850)) +
theme(axis.text.x = element_text(angle=45, hjust=1))
print(q)
```

To plot the data, we need the data in a slightly different format with each row containing winter year, station name and the minimum temperature. We’re plotting minimum temperature against winter year, coloring the points and trendlines using the station name. That means all three of those variables need to be on the same row.

To do that we use `gather`. The first parameter is the name of
variable the columns will be moved into (the station names, which are
currently columns, will become values in a row named `station_name`).
The second is the name of the column that stores the observations
(`tmin`) and the parameters after that are the list of columns to
gather together. In our case, rather than specifying the names of the
columns, we're specifying the inverse: all the columns *except*
`winter_year`.

The result of the `gather` looks like this:

```
fai_winter_year_minimum %>%
gather(station_name, tmin, -winter_year)
```

## Source: local data frame [321 x 3] ## ## winter_year station_name tmin ## 1 1905 COLLEGE OBSY NA ## 2 1907 COLLEGE OBSY NA ## 3 1908 COLLEGE OBSY NA ## 4 1909 COLLEGE OBSY NA ## 5 1910 COLLEGE OBSY NA ## 6 1911 COLLEGE OBSY NA ## 7 1912 COLLEGE OBSY NA ## 8 1913 COLLEGE OBSY NA ## 9 1915 COLLEGE OBSY NA ## 10 1916 COLLEGE OBSY NA ## .. ... ... ...

### ggplot2

The plot is produced using `ggplot2`. A full introduction would be a
seminar by itself, but the basics of our plot can be summarized as
follows.

```
ggplot(aes(x=winter_year, y=tmin, colour=station_name)) +
```

`aes` defines variables and grouping.

```
geom_point(size=1.5, position=position_jitter(w=0.5,h=0.0)) +
geom_smooth(method="lm", se=FALSE) +
```

`geom_point` draws points, `geom_smooth` draws fitted lines.

```
scale_x_continuous(name="Winter Year", breaks=pretty_breaks(n=20)) +
scale_y_continuous(name="Minimum temperature (degrees F)",
breaks=pretty_breaks(n=10)) +
scale_color_manual(name="Station",
labels=c("College Observatory", "Fairbanks Airport",
"University Exp. Station"),
values=c("darkorange", "blue", "darkcyan")) +
```

Scale functions define how the data is scaled into a plot and controls labelling.

```
theme_bw() +
theme(axis.text.x = element_text(angle=45, hjust=1))
```

Theme functions controls the style.

For more information:

## Linear regression, winter year and minimum temperature

Finally let’s look at the significance of those regression lines:

```
summary(lm(data=fai_winter_year_minimum, `COLLEGE OBSY` ~ winter_year))
```

## ## Call: ## lm(formula = `COLLEGE OBSY` ~ winter_year, data = fai_winter_year_minimum) ## ## Residuals: ## Min 1Q Median 3Q Max ## -19.0748 -5.8204 0.1907 3.8042 17.1599 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -275.01062 105.20884 -2.614 0.0114 * ## winter_year 0.11635 0.05311 2.191 0.0325 * ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 7.599 on 58 degrees of freedom ## (47 observations deleted due to missingness) ## Multiple R-squared: 0.07643, Adjusted R-squared: 0.06051 ## F-statistic: 4.8 on 1 and 58 DF, p-value: 0.03249

```
summary(lm(data=fai_winter_year_minimum, `FAIRBANKS INTL AP` ~ winter_year))
```

## ## Call: ## lm(formula = `FAIRBANKS INTL AP` ~ winter_year, data = fai_winter_year_minimum) ## ## Residuals: ## Min 1Q Median 3Q Max ## -15.529 -4.605 -1.025 4.007 19.764 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -171.19553 43.55177 -3.931 0.000153 *** ## winter_year 0.06250 0.02221 2.813 0.005861 ** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 7.037 on 104 degrees of freedom ## (1 observation deleted due to missingness) ## Multiple R-squared: 0.07073, Adjusted R-squared: 0.06179 ## F-statistic: 7.916 on 1 and 104 DF, p-value: 0.005861

```
summary(lm(data=fai_winter_year_minimum, `UNIVERSITY EXP STN` ~ winter_year))
```

## ## Call: ## lm(formula = `UNIVERSITY EXP STN` ~ winter_year, data = fai_winter_year_minimum) ## ## Residuals: ## Min 1Q Median 3Q Max ## -15.579 -5.818 -1.283 6.029 19.977 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -158.41837 51.03809 -3.104 0.00248 ** ## winter_year 0.05638 0.02605 2.164 0.03283 * ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 8.119 on 100 degrees of freedom ## (5 observations deleted due to missingness) ## Multiple R-squared: 0.04474, Adjusted R-squared: 0.03519 ## F-statistic: 4.684 on 1 and 100 DF, p-value: 0.03283

Essentially, all the models show a significant increase in minimum temperature over time, but none of them explain very much of the variation in minimum temperature.

# RMarkdown

This presentation was produced with the RMarkdown package. Allows you to mix text and R code, which is then run through R to produce documents in Word, PDF, HTML, and presentation formats.