Working With BLS Time Series Data

Back To Blog Page

Michael Culshaw-Maurer

June 30, 2021

Load Packages

We’re going to load a handful of packages to pull in BLS data, manipulate data, and demo a little bit of forecasting.

The fpp3 package corresponds to this book, which uses the tidyverts series of tidy packages for time series.

library(blscrapeR)
library(tidyverse)
library(slider)
library(fpp3)
library(seasonal)

Get Data

Next up we’ll fetch our data using the bls_api() function from the blscrapeR package. The BLS API only allows for up to 20 years of data in a single request, so we’ll break our range of years into two chunks. We can then request the two ranges of data and bind the rows together to make a single dataframe.

codes <- c("LNS14000000", "CES0000000001", "CUSR0000SA0")

d1 <- bls_api(codes, registrationKey = my_key, 
             startyear = 2021-19, endyear = 2021)
## REQUEST_SUCCEEDED
d2 <- bls_api(codes, registrationKey = my_key, 
                    startyear = 2021-39, endyear = 2021-20)
## REQUEST_SUCCEEDED
d <- bind_rows(d1, d2)

Not sure it’s even necessary to stitch together API calls like this if you work for the BLS, but what do I know?

This approach works fine for a short range of data, but if you have to get a bigger range that requires more requests being stitched together, it can get inefficient. I’ve written a short function that will get any range of data for any number of codes. We won’t use it for this lesson, but you can check it out if you want:

Function to get N years of data
get_last_n_years <-
  function(n = 50,
           codes = c("LNS14000000", "CES0000000001", "CUSR0000SA0"),
           registration_key) {
    
    this_year <- Sys.Date() %>% year()
    
    d <- seq(from = this_year - n, to = this_year)
    dsplit <- split(d, ceiling(seq_along(d) / 20))
    
    d <- tibble(min = map_dbl(dsplit, min),
                max = map_dbl(dsplit, max))
    
    d %>%
      rowwise() %>%
      mutate(data = list(
        bls_api(
          codes,
          registrationKey = registration_key,
          startyear = min,
          endyear = max
        )
      )) %>%
      unnest(data) %>%
      select(-min, -max) %>%
      distinct() %>% 
      arrange(year, period)
  }

d <- get_last_n_years(n = 50)

Modifying our dataframe

Let’s take a look at our dataframe:

d
## # A tibble: 1,419 x 7
##     year period periodName latest value footnotes seriesID   
##    <dbl> <chr>  <chr>      <chr>  <dbl> <chr>     <chr>      
##  1  2021 M05    May        true     5.8 ""        LNS14000000
##  2  2021 M04    April      <NA>     6.1 ""        LNS14000000
##  3  2021 M03    March      <NA>     6   ""        LNS14000000
##  4  2021 M02    February   <NA>     6.2 ""        LNS14000000
##  5  2021 M01    January    <NA>     6.3 ""        LNS14000000
##  6  2020 M12    December   <NA>     6.7 ""        LNS14000000
##  7  2020 M11    November   <NA>     6.7 ""        LNS14000000
##  8  2020 M10    October    <NA>     6.9 ""        LNS14000000
##  9  2020 M09    September  <NA>     7.8 ""        LNS14000000
## 10  2020 M08    August     <NA>     8.4 ""        LNS14000000
## # … with 1,409 more rows

Maybe BLS folks can easily remember what the different codes refer to, but nonetheless, it might be nice to give them slighly more informative names. We’ll make a new column and use the case_when() function to give informative names according to the code column. The case_when() function is very handy- you write a conditional statement that uses existing columns, then a tilde, then the value the new column should have if that conditional statement is TRUE.

The lefthand side of a case_when() statement can use multiple columns, like period_name == “May” & year > 2000. Any conditional statement that returns TRUE or FALSE will work.

d <- d %>% 
  mutate(metric = case_when(
    seriesID == "LNS14000000" ~ "unemp",
    seriesID == "CES0000000001" ~ "total_emp",
    seriesID == "CUSR0000SA0" ~ "cpi_u",
  ))

d
## # A tibble: 1,419 x 8
##     year period periodName latest value footnotes seriesID    metric
##    <dbl> <chr>  <chr>      <chr>  <dbl> <chr>     <chr>       <chr> 
##  1  2021 M05    May        true     5.8 ""        LNS14000000 unemp 
##  2  2021 M04    April      <NA>     6.1 ""        LNS14000000 unemp 
##  3  2021 M03    March      <NA>     6   ""        LNS14000000 unemp 
##  4  2021 M02    February   <NA>     6.2 ""        LNS14000000 unemp 
##  5  2021 M01    January    <NA>     6.3 ""        LNS14000000 unemp 
##  6  2020 M12    December   <NA>     6.7 ""        LNS14000000 unemp 
##  7  2020 M11    November   <NA>     6.7 ""        LNS14000000 unemp 
##  8  2020 M10    October    <NA>     6.9 ""        LNS14000000 unemp 
##  9  2020 M09    September  <NA>     7.8 ""        LNS14000000 unemp 
## 10  2020 M08    August     <NA>     8.4 ""        LNS14000000 unemp 
## # … with 1,409 more rows

Another issue is that we have separate columns for years and months, but we’d like to stitch those together into a date column that R can work with.

Intro to lubridate

You can download the official lubridate cheat sheet for more information.

lubridate is a package in the tidyverse that makes handling dates and times slightly less painful. There are a series of functions that allow you to parse components of dates and times into single date or datetime columns. They are named according to the order of components: ymd() will take dates ordered year-month-day, and mdy_hs() will take datetimes ordered month-day-year-hour-second. We’ll use the my() function to create a date column from our periodName and year columns.

d <- d %>% 
  mutate(date = lubridate::my(paste(periodName, year, sep = "/")))

d
## # A tibble: 1,419 x 9
##     year period periodName latest value footnotes seriesID    metric date      
##    <dbl> <chr>  <chr>      <chr>  <dbl> <chr>     <chr>       <chr>  <date>    
##  1  2021 M05    May        true     5.8 ""        LNS14000000 unemp  2021-05-01
##  2  2021 M04    April      <NA>     6.1 ""        LNS14000000 unemp  2021-04-01
##  3  2021 M03    March      <NA>     6   ""        LNS14000000 unemp  2021-03-01
##  4  2021 M02    February   <NA>     6.2 ""        LNS14000000 unemp  2021-02-01
##  5  2021 M01    January    <NA>     6.3 ""        LNS14000000 unemp  2021-01-01
##  6  2020 M12    December   <NA>     6.7 ""        LNS14000000 unemp  2020-12-01
##  7  2020 M11    November   <NA>     6.7 ""        LNS14000000 unemp  2020-11-01
##  8  2020 M10    October    <NA>     6.9 ""        LNS14000000 unemp  2020-10-01
##  9  2020 M09    September  <NA>     7.8 ""        LNS14000000 unemp  2020-09-01
## 10  2020 M08    August     <NA>     8.4 ""        LNS14000000 unemp  2020-08-01
## # … with 1,409 more rows

Notice that all of our dates occur on the first day of the month. Date objects in R require year, month, and day, but since we haven’t specified a day, my() defaults to the first of the month.

The way Excel handles dates is particularly weird.

It’s worth noting that dates and times are often represented differently on different operating systems or in different software. When storing dates in a file, such as a CSV, it’s almost always best to keep the components in separate columns. That way Excel or other programs can’t mess with them, and you can easily use a lubridate function to stitch the components together into a date when you need to.

Plotting time series

Let’s take a look at our time series, just to get a feel for what they look like. We’ll use ggplot to make our plots. We’ll put date on the x axis and the values on the y axis to make a time series plot. Since the 3 metrics we’re looking at have very different scales, it doesn’t make much sense to plot them all together. Instead, we’ll use facet_wrap() to create separate facets for each metric. We need to set scales = "free_y" so that each metric’s y axis is scaled independently. Finally, we’ll add theme_minimal() for a little cleaner appearance.

d %>% 
  ggplot(aes(x = date, y = value)) +
  geom_line() +
  facet_wrap(vars(metric), ncol = 1, scales = "free_y") +
  theme_minimal()

Moving windows

You can check out the slider website for more information.

One thing we might want to look at is a moving-window or “rolling” average. We’ll use the slider package, which has a very flexible interface to make calculations with rolling windows, and its syntax is very similar to the tidyverse’s purrr package.

There have been a number of similar packages and functions such as zoo::rollapply(), tibbletime::rollify(), and tsibble::slide(), but slider is the latest and most flexible approach.

Let’s look at the rolling average of unemployment for the last 10 years. First we’ll filter() to get the unemployment values and years from 2011 on. Next we’ll select() only the date and value columns, renaming the value column to unemployment. Then we’ll use arrange() to sort our rows by date.

unemp_last10 <- d %>% 
  filter(metric == "unemp", year >= 2011) %>% 
  select(date, unemployment = value) %>% 
  arrange(date) 

unemp_last10
## # A tibble: 125 x 2
##    date       unemployment
##    <date>            <dbl>
##  1 2011-01-01          9.1
##  2 2011-02-01          9  
##  3 2011-03-01          9  
##  4 2011-04-01          9.1
##  5 2011-05-01          9  
##  6 2011-06-01          9.1
##  7 2011-07-01          9  
##  8 2011-08-01          9  
##  9 2011-09-01          9  
## 10 2011-10-01          8.8
## # … with 115 more rows

The slider package also has functions called slide_index() and slide_period() that allow for even more flexibility in the way you determine your window. They’re not quite necessary for these monthly time series.

Now we can use a slide_ function to apply the mean() function to a window of unemployment values, with the window sliding along our dates. We’ll use slide_dbl() because we want to return a numeric, or double, vector. To get a 5-month average, we set the .before and .after values to 2.

unemp_last10 <- unemp_last10 %>% 
  mutate(five_month_avg = slide_dbl(unemployment, mean, 
                                    .before = 2, .after = 2))

Plotting moving window averages

Let’s plot it to take a look at the 5-month average time series.

unemp_last10 %>% 
  ggplot(aes(x = date, y = five_month_avg)) +
  geom_line() +
  theme_minimal()

It’s probably more useful to plot them together, so we can use pivot_longer() to bring the unemployment and five_month_average values into a single column, and then color the lines according to each type.

unemp_last10 %>% 
  pivot_longer(cols = c(unemployment, five_month_avg)) %>% 
  ggplot(aes(x = date, y = value, color = name)) +
  geom_line() +
  scale_color_manual(values = c("red", "black")) +
  theme_minimal()

tidyverts demo

The tidyverts series of packages is designed to work with time series data in a “tidy” way, meaning it plays well with the broader tidyverse set of packages. The developers of these packages have a fantastic book on time series forecasting with tidyverts, called Forecasting: Principles and Practice. It goes into great depth and I recommend checking it out, but I’ll do a quick demo of how slick these packages are for working with time series.

Making a tsibble object

The yearmonth() function is a bit idiosyncratic to the tidyverts packages, and our ym column isn’t a “date” in the normal R sense, which is why we’re only using it now.

First, we have to convert our dataframe into a tsibble object, which is basically a dataframe that knows that it’s a time series. First we have to make a new column that just hold the “year-month” combination, using the tsibble function yearmonth(). Then we convert the dataframe to a tsibble object with the ym column as the time series index.

unemp_last10 <- unemp_last10 %>% 
  mutate(ym = yearmonth(date)) %>% 
  as_tsibble(index = ym)

unemp_last10
## # A tsibble: 125 x 4 [1M]
##    date       unemployment five_month_avg       ym
##    <date>            <dbl>          <dbl>    <mth>
##  1 2011-01-01          9.1           9.03 2011 Jan
##  2 2011-02-01          9             9.05 2011 Feb
##  3 2011-03-01          9             9.04 2011 Mar
##  4 2011-04-01          9.1           9.04 2011 Apr
##  5 2011-05-01          9             9.04 2011 May
##  6 2011-06-01          9.1           9.04 2011 Jun
##  7 2011-07-01          9             9.02 2011 Jul
##  8 2011-08-01          9             8.98 2011 Aug
##  9 2011-09-01          9             8.88 2011 Sep
## 10 2011-10-01          8.8           8.78 2011 Oct
## # … with 115 more rows

X13-ARIMA-SEATS time series decomposition

The first thing we’ll do is a quick time series decomposition using the X13-ARIMA-SEATS methods.

The X13-ARIMA-SEATS methods rely on the seasonal package, which also installs the X13 binary on your computer.

Next, we use several functions to create two decomposition models. We send our data into the model() function, where we define two models: x11 and seats. Both are created using the X_13ARIMA_SEATS() functions.

x13_unemp_models <- unemp_last10 %>% 
  model(x11 = X_13ARIMA_SEATS(unemployment ~ x11()),
        seats = X_13ARIMA_SEATS(unemployment ~ seats()))

Next we take the object containing both models, pass it to the components() function, which extracts the model components for plotting, then send that to the autoplot() function, which generates a premade ggplot, and we add theme_minimal() and scale_color_brewer() so it looks nice.

x13_unemp_models %>% 
  components() %>% 
  autoplot() +
  theme_minimal() +
  scale_colour_brewer(type = "qual")

That’s a pretty small amount of code to generate two fairly sophisticated time series decompositions!

Simple forecasting workflow

The Forecasting: Principles and Practice book demonstrates some far more sophisticated forecasting methods, I just chose a simple one to show the basic workflow.

Finally, we’ll do a very simple forecast of the data. We again use the model() function, but then we use the RW() function to generate a random walk model, and use the drift() function to do a simple drift model. This basically just draws a line from the starting point to the end point of our time series and projects it forward in time. We then send this to the forecast() function and ask it to forecast 3 years into the future. Finally, we send this to autoplot() and add in the original unemp_last10 dataframe so we see the original time series as well as the forecast.

unemp_last10 %>% 
  model(RW(unemployment ~ drift())) %>% 
  forecast(h = "3 years") %>% 
  autoplot(unemp_last10) +
  theme_minimal()

Not too bad for only a few lines of code!