We’re going to load a handful of packages to pull in BLS data, manipulate data, and demo a little bit of forecasting.
The fpp3
package corresponds to this book, which uses the tidyverts
series of tidy packages for time series.
library(blscrapeR)
library(tidyverse)
library(slider)
library(fpp3)
library(seasonal)
Next up we’ll fetch our data using the bls_api()
function from the blscrapeR
package. The BLS API only allows for up to 20 years of data in a single request, so we’ll break our range of years into two chunks. We can then request the two ranges of data and bind the rows together to make a single dataframe.
codes <- c("LNS14000000", "CES0000000001", "CUSR0000SA0")
d1 <- bls_api(codes, registrationKey = my_key,
startyear = 2021-19, endyear = 2021)
## REQUEST_SUCCEEDED
d2 <- bls_api(codes, registrationKey = my_key,
startyear = 2021-39, endyear = 2021-20)
## REQUEST_SUCCEEDED
d <- bind_rows(d1, d2)
Not sure it’s even necessary to stitch together API calls like this if you work for the BLS, but what do I know?
This approach works fine for a short range of data, but if you have to get a bigger range that requires more requests being stitched together, it can get inefficient. I’ve written a short function that will get any range of data for any number of codes. We won’t use it for this lesson, but you can check it out if you want:
get_last_n_years <-
function(n = 50,
codes = c("LNS14000000", "CES0000000001", "CUSR0000SA0"),
registration_key) {
this_year <- Sys.Date() %>% year()
d <- seq(from = this_year - n, to = this_year)
dsplit <- split(d, ceiling(seq_along(d) / 20))
d <- tibble(min = map_dbl(dsplit, min),
max = map_dbl(dsplit, max))
d %>%
rowwise() %>%
mutate(data = list(
bls_api(
codes,
registrationKey = registration_key,
startyear = min,
endyear = max
)
)) %>%
unnest(data) %>%
select(-min, -max) %>%
distinct() %>%
arrange(year, period)
}
d <- get_last_n_years(n = 50)
Let’s take a look at our dataframe:
d
## # A tibble: 1,419 x 7
## year period periodName latest value footnotes seriesID
## <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 2021 M05 May true 5.8 "" LNS14000000
## 2 2021 M04 April <NA> 6.1 "" LNS14000000
## 3 2021 M03 March <NA> 6 "" LNS14000000
## 4 2021 M02 February <NA> 6.2 "" LNS14000000
## 5 2021 M01 January <NA> 6.3 "" LNS14000000
## 6 2020 M12 December <NA> 6.7 "" LNS14000000
## 7 2020 M11 November <NA> 6.7 "" LNS14000000
## 8 2020 M10 October <NA> 6.9 "" LNS14000000
## 9 2020 M09 September <NA> 7.8 "" LNS14000000
## 10 2020 M08 August <NA> 8.4 "" LNS14000000
## # … with 1,409 more rows
Maybe BLS folks can easily remember what the different codes refer to, but nonetheless, it might be nice to give them slighly more informative names. We’ll make a new column and use the case_when()
function to give informative names according to the code column. The case_when()
function is very handy- you write a conditional statement that uses existing columns, then a tilde, then the value the new column should have if that conditional statement is TRUE
.
The lefthand side of a case_when()
statement can use multiple columns, like period_name == “May” & year > 2000
. Any conditional statement that returns TRUE
or FALSE
will work.
d <- d %>%
mutate(metric = case_when(
seriesID == "LNS14000000" ~ "unemp",
seriesID == "CES0000000001" ~ "total_emp",
seriesID == "CUSR0000SA0" ~ "cpi_u",
))
d
## # A tibble: 1,419 x 8
## year period periodName latest value footnotes seriesID metric
## <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> <chr>
## 1 2021 M05 May true 5.8 "" LNS14000000 unemp
## 2 2021 M04 April <NA> 6.1 "" LNS14000000 unemp
## 3 2021 M03 March <NA> 6 "" LNS14000000 unemp
## 4 2021 M02 February <NA> 6.2 "" LNS14000000 unemp
## 5 2021 M01 January <NA> 6.3 "" LNS14000000 unemp
## 6 2020 M12 December <NA> 6.7 "" LNS14000000 unemp
## 7 2020 M11 November <NA> 6.7 "" LNS14000000 unemp
## 8 2020 M10 October <NA> 6.9 "" LNS14000000 unemp
## 9 2020 M09 September <NA> 7.8 "" LNS14000000 unemp
## 10 2020 M08 August <NA> 8.4 "" LNS14000000 unemp
## # … with 1,409 more rows
Another issue is that we have separate columns for years and months, but we’d like to stitch those together into a date
column that R can work with.
lubridate
You can download the official lubridate
cheat sheet for more information.
lubridate
is a package in the tidyverse
that makes handling dates and times slightly less painful. There are a series of functions that allow you to parse components of dates and times into single date or datetime columns. They are named according to the order of components: ymd()
will take dates ordered year-month-day, and mdy_hs()
will take datetimes ordered month-day-year-hour-second. We’ll use the my()
function to create a date column from our periodName
and year
columns.
d <- d %>%
mutate(date = lubridate::my(paste(periodName, year, sep = "/")))
d
## # A tibble: 1,419 x 9
## year period periodName latest value footnotes seriesID metric date
## <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> <chr> <date>
## 1 2021 M05 May true 5.8 "" LNS14000000 unemp 2021-05-01
## 2 2021 M04 April <NA> 6.1 "" LNS14000000 unemp 2021-04-01
## 3 2021 M03 March <NA> 6 "" LNS14000000 unemp 2021-03-01
## 4 2021 M02 February <NA> 6.2 "" LNS14000000 unemp 2021-02-01
## 5 2021 M01 January <NA> 6.3 "" LNS14000000 unemp 2021-01-01
## 6 2020 M12 December <NA> 6.7 "" LNS14000000 unemp 2020-12-01
## 7 2020 M11 November <NA> 6.7 "" LNS14000000 unemp 2020-11-01
## 8 2020 M10 October <NA> 6.9 "" LNS14000000 unemp 2020-10-01
## 9 2020 M09 September <NA> 7.8 "" LNS14000000 unemp 2020-09-01
## 10 2020 M08 August <NA> 8.4 "" LNS14000000 unemp 2020-08-01
## # … with 1,409 more rows
Notice that all of our dates occur on the first day of the month. Date objects in R require year, month, and day, but since we haven’t specified a day, my()
defaults to the first of the month.
The way Excel handles dates is particularly weird.
It’s worth noting that dates and times are often represented differently on different operating systems or in different software. When storing dates in a file, such as a CSV, it’s almost always best to keep the components in separate columns. That way Excel or other programs can’t mess with them, and you can easily use a lubridate
function to stitch the components together into a date when you need to.
Let’s take a look at our time series, just to get a feel for what they look like. We’ll use ggplot
to make our plots. We’ll put date
on the x axis and the value
s on the y axis to make a time series plot. Since the 3 metrics we’re looking at have very different scales, it doesn’t make much sense to plot them all together. Instead, we’ll use facet_wrap()
to create separate facets for each metric
. We need to set scales = "free_y"
so that each metric’s y axis is scaled independently. Finally, we’ll add theme_minimal()
for a little cleaner appearance.
d %>%
ggplot(aes(x = date, y = value)) +
geom_line() +
facet_wrap(vars(metric), ncol = 1, scales = "free_y") +
theme_minimal()
You can check out the slider
website for more information.
One thing we might want to look at is a moving-window or “rolling” average. We’ll use the slider
package, which has a very flexible interface to make calculations with rolling windows, and its syntax is very similar to the tidyverse
’s purrr
package.
There have been a number of similar packages and functions such as zoo::rollapply()
, tibbletime::rollify()
, and tsibble::slide()
, but slider
is the latest and most flexible approach.
Let’s look at the rolling average of unemployment for the last 10 years. First we’ll filter()
to get the unemployment values and years from 2011 on. Next we’ll select()
only the date and value columns, renaming the value
column to unemployment
. Then we’ll use arrange()
to sort our rows by date.
unemp_last10 <- d %>%
filter(metric == "unemp", year >= 2011) %>%
select(date, unemployment = value) %>%
arrange(date)
unemp_last10
## # A tibble: 125 x 2
## date unemployment
## <date> <dbl>
## 1 2011-01-01 9.1
## 2 2011-02-01 9
## 3 2011-03-01 9
## 4 2011-04-01 9.1
## 5 2011-05-01 9
## 6 2011-06-01 9.1
## 7 2011-07-01 9
## 8 2011-08-01 9
## 9 2011-09-01 9
## 10 2011-10-01 8.8
## # … with 115 more rows
The slider
package also has functions called slide_index()
and slide_period()
that allow for even more flexibility in the way you determine your window. They’re not quite necessary for these monthly time series.
Now we can use a slide_
function to apply the mean()
function to a window of unemployment values, with the window sliding along our dates. We’ll use slide_dbl()
because we want to return a numeric, or double, vector. To get a 5-month average, we set the .before
and .after
values to 2.
unemp_last10 <- unemp_last10 %>%
mutate(five_month_avg = slide_dbl(unemployment, mean,
.before = 2, .after = 2))
Let’s plot it to take a look at the 5-month average time series.
unemp_last10 %>%
ggplot(aes(x = date, y = five_month_avg)) +
geom_line() +
theme_minimal()