Working With BLS Time Series Data

Back To Blog Page

Michael Culshaw-Maurer

June 30, 2021

Load Packages

We’re going to load a handful of packages to pull in BLS data, manipulate data, and demo a little bit of forecasting.

The fpp3 package corresponds to this book, which uses the tidyverts series of tidy packages for time series.

library(blscrapeR)
library(tidyverse)
library(slider)
library(fpp3)
library(seasonal)

Get Data

Next up we’ll fetch our data using the bls_api() function from the blscrapeR package. The BLS API only allows for up to 20 years of data in a single request, so we’ll break our range of years into two chunks. We can then request the two ranges of data and bind the rows together to make a single dataframe.

codes <- c("LNS14000000", "CES0000000001", "CUSR0000SA0")

d1 <- bls_api(codes, registrationKey = my_key, 
             startyear = 2021-19, endyear = 2021)
## REQUEST_SUCCEEDED
d2 <- bls_api(codes, registrationKey = my_key, 
                    startyear = 2021-39, endyear = 2021-20)
## REQUEST_SUCCEEDED
d <- bind_rows(d1, d2)

Not sure it’s even necessary to stitch together API calls like this if you work for the BLS, but what do I know?

This approach works fine for a short range of data, but if you have to get a bigger range that requires more requests being stitched together, it can get inefficient. I’ve written a short function that will get any range of data for any number of codes. We won’t use it for this lesson, but you can check it out if you want:

Function to get N years of data
get_last_n_years <-
  function(n = 50,
           codes = c("LNS14000000", "CES0000000001", "CUSR0000SA0"),
           registration_key) {
    
    this_year <- Sys.Date() %>% year()
    
    d <- seq(from = this_year - n, to = this_year)
    dsplit <- split(d, ceiling(seq_along(d) / 20))
    
    d <- tibble(min = map_dbl(dsplit, min),
                max = map_dbl(dsplit, max))
    
    d %>%
      rowwise() %>%
      mutate(data = list(
        bls_api(
          codes,
          registrationKey = registration_key,
          startyear = min,
          endyear = max
        )
      )) %>%
      unnest(data) %>%
      select(-min, -max) %>%
      distinct() %>% 
      arrange(year, period)
  }

d <- get_last_n_years(n = 50)

Modifying our dataframe

Let’s take a look at our dataframe:

d
## # A tibble: 1,419 x 7
##     year period periodName latest value footnotes seriesID   
##    <dbl> <chr>  <chr>      <chr>  <dbl> <chr>     <chr>      
##  1  2021 M05    May        true     5.8 ""        LNS14000000
##  2  2021 M04    April      <NA>     6.1 ""        LNS14000000
##  3  2021 M03    March      <NA>     6   ""        LNS14000000
##  4  2021 M02    February   <NA>     6.2 ""        LNS14000000
##  5  2021 M01    January    <NA>     6.3 ""        LNS14000000
##  6  2020 M12    December   <NA>     6.7 ""        LNS14000000
##  7  2020 M11    November   <NA>     6.7 ""        LNS14000000
##  8  2020 M10    October    <NA>     6.9 ""        LNS14000000
##  9  2020 M09    September  <NA>     7.8 ""        LNS14000000
## 10  2020 M08    August     <NA>     8.4 ""        LNS14000000
## # … with 1,409 more rows

Maybe BLS folks can easily remember what the different codes refer to, but nonetheless, it might be nice to give them slighly more informative names. We’ll make a new column and use the case_when() function to give informative names according to the code column. The case_when() function is very handy- you write a conditional statement that uses existing columns, then a tilde, then the value the new column should have if that conditional statement is TRUE.

The lefthand side of a case_when() statement can use multiple columns, like period_name == “May” & year > 2000. Any conditional statement that returns TRUE or FALSE will work.

d <- d %>% 
  mutate(metric = case_when(
    seriesID == "LNS14000000" ~ "unemp",
    seriesID == "CES0000000001" ~ "total_emp",
    seriesID == "CUSR0000SA0" ~ "cpi_u",
  ))

d
## # A tibble: 1,419 x 8
##     year period periodName latest value footnotes seriesID    metric
##    <dbl> <chr>  <chr>      <chr>  <dbl> <chr>     <chr>       <chr> 
##  1  2021 M05    May        true     5.8 ""        LNS14000000 unemp 
##  2  2021 M04    April      <NA>     6.1 ""        LNS14000000 unemp 
##  3  2021 M03    March      <NA>     6   ""        LNS14000000 unemp 
##  4  2021 M02    February   <NA>     6.2 ""        LNS14000000 unemp 
##  5  2021 M01    January    <NA>     6.3 ""        LNS14000000 unemp 
##  6  2020 M12    December   <NA>     6.7 ""        LNS14000000 unemp 
##  7  2020 M11    November   <NA>     6.7 ""        LNS14000000 unemp 
##  8  2020 M10    October    <NA>     6.9 ""        LNS14000000 unemp 
##  9  2020 M09    September  <NA>     7.8 ""        LNS14000000 unemp 
## 10  2020 M08    August     <NA>     8.4 ""        LNS14000000 unemp 
## # … with 1,409 more rows

Another issue is that we have separate columns for years and months, but we’d like to stitch those together into a date column that R can work with.

Intro to lubridate

You can download the official lubridate cheat sheet for more information.

lubridate is a package in the tidyverse that makes handling dates and times slightly less painful. There are a series of functions that allow you to parse components of dates and times into single date or datetime columns. They are named according to the order of components: ymd() will take dates ordered year-month-day, and mdy_hs() will take datetimes ordered month-day-year-hour-second. We’ll use the my() function to create a date column from our periodName and year columns.

d <- d %>% 
  mutate(date = lubridate::my(paste(periodName, year, sep = "/")))

d
## # A tibble: 1,419 x 9
##     year period periodName latest value footnotes seriesID    metric date      
##    <dbl> <chr>  <chr>      <chr>  <dbl> <chr>     <chr>       <chr>  <date>    
##  1  2021 M05    May        true     5.8 ""        LNS14000000 unemp  2021-05-01
##  2  2021 M04    April      <NA>     6.1 ""        LNS14000000 unemp  2021-04-01
##  3  2021 M03    March      <NA>     6   ""        LNS14000000 unemp  2021-03-01
##  4  2021 M02    February   <NA>     6.2 ""        LNS14000000 unemp  2021-02-01
##  5  2021 M01    January    <NA>     6.3 ""        LNS14000000 unemp  2021-01-01
##  6  2020 M12    December   <NA>     6.7 ""        LNS14000000 unemp  2020-12-01
##  7  2020 M11    November   <NA>     6.7 ""        LNS14000000 unemp  2020-11-01
##  8  2020 M10    October    <NA>     6.9 ""        LNS14000000 unemp  2020-10-01
##  9  2020 M09    September  <NA>     7.8 ""        LNS14000000 unemp  2020-09-01
## 10  2020 M08    August     <NA>     8.4 ""        LNS14000000 unemp  2020-08-01
## # … with 1,409 more rows

Notice that all of our dates occur on the first day of the month. Date objects in R require year, month, and day, but since we haven’t specified a day, my() defaults to the first of the month.

The way Excel handles dates is particularly weird.

It’s worth noting that dates and times are often represented differently on different operating systems or in different software. When storing dates in a file, such as a CSV, it’s almost always best to keep the components in separate columns. That way Excel or other programs can’t mess with them, and you can easily use a lubridate function to stitch the components together into a date when you need to.

Plotting time series

Let’s take a look at our time series, just to get a feel for what they look like. We’ll use ggplot to make our plots. We’ll put date on the x axis and the values on the y axis to make a time series plot. Since the 3 metrics we’re looking at have very different scales, it doesn’t make much sense to plot them all together. Instead, we’ll use facet_wrap() to create separate facets for each metric. We need to set scales = "free_y" so that each metric’s y axis is scaled independently. Finally, we’ll add theme_minimal() for a little cleaner appearance.

d %>% 
  ggplot(aes(x = date, y = value)) +
  geom_line() +
  facet_wrap(vars(metric), ncol = 1, scales = "free_y") +
  theme_minimal()

Moving windows

You can check out the slider website for more information.

One thing we might want to look at is a moving-window or “rolling” average. We’ll use the slider package, which has a very flexible interface to make calculations with rolling windows, and its syntax is very similar to the tidyverse’s purrr package.

There have been a number of similar packages and functions such as zoo::rollapply(), tibbletime::rollify(), and tsibble::slide(), but slider is the latest and most flexible approach.

Let’s look at the rolling average of unemployment for the last 10 years. First we’ll filter() to get the unemployment values and years from 2011 on. Next we’ll select() only the date and value columns, renaming the value column to unemployment. Then we’ll use arrange() to sort our rows by date.

unemp_last10 <- d %>% 
  filter(metric == "unemp", year >= 2011) %>% 
  select(date, unemployment = value) %>% 
  arrange(date) 

unemp_last10
## # A tibble: 125 x 2
##    date       unemployment
##    <date>            <dbl>
##  1 2011-01-01          9.1
##  2 2011-02-01          9  
##  3 2011-03-01          9  
##  4 2011-04-01          9.1
##  5 2011-05-01          9  
##  6 2011-06-01          9.1
##  7 2011-07-01          9  
##  8 2011-08-01          9  
##  9 2011-09-01          9  
## 10 2011-10-01          8.8
## # … with 115 more rows

The slider package also has functions called slide_index() and slide_period() that allow for even more flexibility in the way you determine your window. They’re not quite necessary for these monthly time series.

Now we can use a slide_ function to apply the mean() function to a window of unemployment values, with the window sliding along our dates. We’ll use slide_dbl() because we want to return a numeric, or double, vector. To get a 5-month average, we set the .before and .after values to 2.

unemp_last10 <- unemp_last10 %>% 
  mutate(five_month_avg = slide_dbl(unemployment, mean, 
                                    .before = 2, .after = 2))

Plotting moving window averages

Let’s plot it to take a look at the 5-month average time series.

unemp_last10 %>% 
  ggplot(aes(x = date, y = five_month_avg)) +
  geom_line() +
  theme_minimal()