And you may find yourself
Behind the keys of a large computing machine
And you may find yourself
Copy-pasting tons of code
And you may ask yourself, well
How did I get here?
It’s pretty common that you’ll want to run the same basic bit of code a bunch of times with different inputs. Maybe you want to read in a bunch of data files with different names or calculate something complex on every row of a dataframe. A general rule of thumb is that any code you want to run 3+ times should be iterated instead of copy-pasted. Copy-pasting code and replacing the parts you want to change is generally a bad practice for several reasons:
Lots of functions (including many base
functions) are vectorized, meaning they already work on vectors of values. Here’s an example:
x <- 1:10
log(x)
## [1] 0.0000000 0.6931472 1.0986123 1.3862944 1.6094379 1.7917595 1.9459101
## [8] 2.0794415 2.1972246 2.3025851
The log()
function already knows we want to take the log of each element in x, and it returns a vector that’s the same length as x. If a vectorized function already exists to do what you want, use it! It’s going to be faster and cleaner than trying to iterate everything yourself.
However, we may want to do more complex iterations, which brings us to our first main iterating concept.
A for loop will repeat some bit of code, each time with a new input value. Here’s the basic structure:
for(i in 1:10) {
print(i)
}
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6
## [1] 7
## [1] 8
## [1] 9
## [1] 10
You’ll often see i
used in for loops, you can think of it as the iteration value. For each i
value in the vector 1:10, we’ll print that index value. You can use the i
value more than once in a loop:
for (i in 1:10) {
print(i)
print(i^2)
}
## [1] 1
## [1] 1
## [1] 2
## [1] 4
## [1] 3
## [1] 9
## [1] 4
## [1] 16
## [1] 5
## [1] 25
## [1] 6
## [1] 36
## [1] 7
## [1] 49
## [1] 8
## [1] 64
## [1] 9
## [1] 81
## [1] 10
## [1] 100
What’s happening is the value of i
gets inserted into the code block, the block gets run, the value of i
changes, and the process repeats. For loops can be a way to explicitly lay out fairly complicated procedures, since you can see exactly where your i
value is going in the code.
You can also use the i
value to index a vector or dataframe, which can be very powerful!
for (i in 1:10) {
print(letters[i])
print(mtcars$wt[i])
}
## [1] "a"
## [1] 2.62
## [1] "b"
## [1] 2.875
## [1] "c"
## [1] 2.32
## [1] "d"
## [1] 3.215
## [1] "e"
## [1] 3.44
## [1] "f"
## [1] 3.46
## [1] "g"
## [1] 3.57
## [1] "h"
## [1] 3.19
## [1] "i"
## [1] 3.15
## [1] "j"
## [1] 3.44
Here we printed out the first 10 letters of the alphabet from the letters
vector, as well as the first 10 car weights from the mtcars
dataframe.
If you want to store your results somewhere, it is important that you create an empty object to hold them before you run the loop. If you grow your results vector one value at a time, it will be much slower. Here’s how to make that empty vector first. We’ll also use the function seq_along
to create a sequence that’s the proper length, instead of explicitly writing out something like 1:10
.
results <- rep(NA, nrow(mtcars))
for (i in seq_along(mtcars$wt)) {
results[i] <- mtcars$wt[i] * 1000
}
results
## [1] 2620 2875 2320 3215 3440 3460 3570 3190 3150 3440 3440 4070 3730 3780
## [15] 5250 5424 5345 2200 1615 1835 2465 3520 3435 3840 3845 1935 2140 1513
## [29] 3170 2770 3570 2780
apply
FunctionsR includes another way to iterate, using the apply
family of functions. These functions all do the same basic thing: take a series of values and apply a function to each of them. That function could be a function from a package, or it could be one you write to do something specific.
Here we’ll use sapply
, which will return the simplest form it can. Since we give it a vector, it’ll give us back a vector.
sapply(1:10, sqrt)
## [1] 1.000000 1.414214 1.732051 2.000000 2.236068 2.449490 2.645751
## [8] 2.828427 3.000000 3.162278
This is not a useful example, since sqrt
is vectorized already- we could just call sqrt(1:10)
and get the same result. However, where apply
functions become useful is when we want to do something more complicated.
Oftentimes, the translation from a for loop to apply
is this:
apply
to apply that function across the range of values you wantLet’s do this for a simple example. First our for loop:
result <- rep(NA, 10)
for (i in 1:10) {
result[i] <- sqrt(i) / 2
}
result
## [1] 0.5000000 0.7071068 0.8660254 1.0000000 1.1180340 1.2247449 1.3228757
## [8] 1.4142136 1.5000000 1.5811388
We’ll use what’s called an “anonymous function”, which is a function that we only define within the call to sapply
. With a simple function like sqrt(x)/2
, it’s easier to use an anonymous function than write a whole new function.
sapply(1:10, function(x) sqrt(x)/2)
## [1] 0.5000000 0.7071068 0.8660254 1.0000000 1.1180340 1.2247449 1.3228757
## [8] 1.4142136 1.5000000 1.5811388
Notice that the code here is cleaner and we didn’t have to create a result
vector to store the output. If we wanted to save our output, we could assign it to an object.
With the apply
family of functions, you can also pass other arguments to the functions you apply. Here we’ll try applying mean
to a dataframe with some missing values.
mtcars_na <- mtcars
mtcars_na[1, 1:4] <- NA
sapply(mtcars_na, mean)
## mpg cyl disp hp drat wt qsec
## NA NA NA NA 3.596563 3.217250 17.848750
## vs am gear carb
## 0.437500 0.406250 3.687500 2.812500
sapply(mtcars_na, mean, na.rm = T)
## mpg cyl disp hp drat wt
## 20.061290 6.193548 233.003226 147.870968 3.596563 3.217250
## qsec vs am gear carb
## 17.848750 0.437500 0.406250 3.687500 2.812500
purrr
If you’ve been digging the tidyverse
, well rest assured, they’ve got some slick iteration functions too! The map
series of functions work very similarly to the apply
functions, but they’re a bit more tidyverse-friendly and allow you to more explicitly say what kinds of values you want returned.
For a wonderful and thorough exploration of the purrr
package, check out Jenny Brian’s tutorial.
map
When using the map
family of functions, the first argument (as in all tidyverse functions) is the data. One nice feature is that you can specify the format of the output explicitly by using a different member of the family.
mtcars %>% purrr::map(mean) # gives a list
## $mpg
## [1] 20.09062
##
## $cyl
## [1] 6.1875
##
## $disp
## [1] 230.7219
##
## $hp
## [1] 146.6875
##
## $drat
## [1] 3.596563
##
## $wt
## [1] 3.21725
##
## $qsec
## [1] 17.84875
##
## $vs
## [1] 0.4375
##
## $am
## [1] 0.40625
##
## $gear
## [1] 3.6875
##
## $carb
## [1] 2.8125
mtcars %>% purrr::map_dbl(mean) # gives a numeric vector
## mpg cyl disp hp drat wt
## 20.090625 6.187500 230.721875 146.687500 3.596563 3.217250
## qsec vs am gear carb
## 17.848750 0.437500 0.406250 3.687500 2.812500
mtcars %>% purrr::map_chr(mean) # gives a character vector
## mpg cyl disp hp drat
## "20.090625" "6.187500" "230.721875" "146.687500" "3.596563"
## wt qsec vs am gear
## "3.217250" "17.848750" "0.437500" "0.406250" "3.687500"
## carb
## "2.812500"
map2
You can use the map2
series of functions if you need to map across two lists in parallel. Here, we’ll map across both the names of cars and their mpg values, using an anonymous function to paste the two together into a sentence.
map2_chr(rownames(mtcars), mtcars$mpg, function(x,y) paste(x, "gets", y, "miles per gallon")) %>%
head()
## [1] "Mazda RX4 gets 21 miles per gallon"
## [2] "Mazda RX4 Wag gets 21 miles per gallon"
## [3] "Datsun 710 gets 22.8 miles per gallon"
## [4] "Hornet 4 Drive gets 21.4 miles per gallon"
## [5] "Hornet Sportabout gets 18.7 miles per gallon"
## [6] "Valiant gets 18.1 miles per gallon"
You can use the pmap
series of functions if you need to use more than two input lists.
Here we’ll take a look at a cool way to use map
: applying a linear model across different sets of data. We’ll take the mtcars dataset, use split
to split it into a list, with one data frame for each value of cyl
, and then map the same linear model to each entry in the list.
mtcars %>%
split(.$cyl) %>%
purrr::map(~ lm(mpg ~ wt, data = .x))
## $`4`
##
## Call:
## lm(formula = mpg ~ wt, data = .x)
##
## Coefficients:
## (Intercept) wt
## 39.571 -5.647
##
##
## $`6`
##
## Call:
## lm(formula = mpg ~ wt, data = .x)
##
## Coefficients:
## (Intercept) wt
## 28.41 -2.78
##
##
## $`8`
##
## Call:
## lm(formula = mpg ~ wt, data = .x)
##
## Coefficients:
## (Intercept) wt
## 23.868 -2.192
map_df
What if we want to do the same thing, but extract the useful information from the model object? We can use another tidyverse package, broom
, and its function tidy
to pull out the information from the lm
model object. We’ll repeat what we did last time, but then we’ll use map_dfr
to map across each model object, tidy it up, and include an id
column called cyl, so we know which cylinder value the linear model terms correspond to. The “dfr” portion of map_dfr
will make sure the output is a dataframe, bound together by rows. map_dfc
would bind columns together into a dataframe output.
mtcars %>%
split(.$cyl) %>%
purrr::map(~ lm(mpg ~ wt, data = .x)) %>%
map_dfr(broom::tidy, .id = "cyl")
## # A tibble: 6 x 6
## cyl term estimate std.error statistic p.value
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 4 (Intercept) 39.6 4.35 9.10 0.00000777
## 2 4 wt -5.65 1.85 -3.05 0.0137
## 3 6 (Intercept) 28.4 4.18 6.79 0.00105
## 4 6 wt -2.78 1.33 -2.08 0.0918
## 5 8 (Intercept) 23.9 3.01 7.94 0.00000405
## 6 8 wt -2.19 0.739 -2.97 0.0118
walk
Sometimes you want to use a function for its “side effect”, such as when using the plot
function. Using plot alone doesn’t return anything, but its side effect is to generate a plot. We use the exact same format as with map
, but instead we use the function walk
.
mtcars %>%
select(cyl, mpg, wt) %>%
split(.$cyl) %>%
walk(plot)
Sometimes, you want to do something with your code, but only if a certain condition is true. There are a couple main ways to do this.
if
and else
You can use combinations of if
and else
to create conditional statements. Here’s a quick example:
for (i in 1:10) {
if (i < 5) {
print(paste(i, "is less than 5"))
} else {
print(paste(i, "is greater than or equal to 5"))
}
}
## [1] "1 is less than 5"
## [1] "2 is less than 5"
## [1] "3 is less than 5"
## [1] "4 is less than 5"
## [1] "5 is greater than or equal to 5"
## [1] "6 is greater than or equal to 5"
## [1] "7 is greater than or equal to 5"
## [1] "8 is greater than or equal to 5"
## [1] "9 is greater than or equal to 5"
## [1] "10 is greater than or equal to 5"
Here we’ve combined a couple techniques: we’ve used a for loop to go through a sequence of values, and for each value we’ve printed a statement based on a condition that our value meets.
case_when
Sometimes you might want to do a bunch conditional statements together, but typing out a ton of nested if-else statements can be unwieldy and prone to typos. A really useful function is the tidyverse’s case_when
. You feed it a dataframe and then use a series of two-sited formulas where the left-hand side determines which values you want, and the right supplies the replacement value. Here’s an example where we take the mtcars
dataframe and add a column called car_size
. If the car’s weight is over 3.5 or it has 8 cylinders, we call it “big”. If neither of these conditions is met (denoted by TRUE
), then we call it “small”.
mtcars %>%
mutate(
car_size = case_when(
wt > 3.5 | cyl == 8 ~ "big",
wt > 2.5 ~ "medium",
TRUE ~ "small"
)
)
## mpg cyl disp hp drat wt qsec vs am gear carb car_size
## 1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 medium
## 2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 medium
## 3 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 small
## 4 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 medium
## 5 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 big
## 6 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 medium
## 7 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 big
## 8 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 medium
## 9 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 medium
## 10 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 medium
## 11 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4 medium
## 12 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3 big
## 13 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3 big
## 14 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3 big
## 15 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4 big
## 16 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4 big
## 17 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4 big
## 18 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1 small
## 19 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2 small
## 20 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1 small
## 21 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1 small
## 22 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2 big
## 23 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2 big
## 24 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4 big
## 25 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2 big
## 26 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1 small
## 27 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2 small
## 28 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2 small
## 29 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4 big
## 30 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6 medium
## 31 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8 big
## 32 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2 medium
map
Let’s throw it back to the map
family for a sec. Sometimes you might only want to map a function to part of a dataframe. map_if
allows you to give the data, a condition for the data to meet, and the function you want to apply to the data that meet the condition. Here, we’ll map as.character
to the columns of the iris dataset that meet the condition is.factor
.
str(iris)
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
iris %>%
map_if(is.factor, as.character) %>%
str()
## List of 5
## $ Sepal.Length: num [1:150] 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num [1:150] 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num [1:150] 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num [1:150] 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : chr [1:150] "setosa" "setosa" "setosa" "setosa" ...
map_at
does something similar, but it allows you to directly specify the locations you’d like to map the function to, using either names or positions.
mtcars %>%
map_at(c("cyl", "am"), as.character) %>%
str()
## List of 11
## $ mpg : num [1:32] 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : chr [1:32] "6" "6" "4" "6" ...
## $ disp: num [1:32] 160 160 108 258 360 ...
## $ hp : num [1:32] 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num [1:32] 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num [1:32] 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num [1:32] 16.5 17 18.6 19.4 17 ...
## $ vs : num [1:32] 0 0 1 1 0 1 0 1 1 1 ...
## $ am : chr [1:32] "1" "1" "1" "0" ...
## $ gear: num [1:32] 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num [1:32] 4 4 1 1 2 1 4 2 2 4 ...
The first thing we’ll do is figure out if we can do the right thing once! We want to rescale a vector of values to a 0-1 scale. We’ll try it out on the weights in mtcars
. Our heaviest vehicle will have a scaled weight of 1, and our lightest will have a scaled weight of 0. We’ll do this by taking our weight, subtracting the minimum car weight from it, and dividing this by the range of the car weights (max minus min). We’ll have to be careful about our order of operations…
(mtcars$wt[1] - min(mtcars$wt, na.rm = T)) /
(max(mtcars$wt, na.rm = T) - min(mtcars$wt, na.rm = T))
## [1] 0.2830478
Great! We got a scaled value out of the deal. Because we’re working with base functions like max
, min
, and /
, we can vectorize. This means we can give it the whole weight vector, and we’ll get a whole scaled vector back.
mtcars$wt_scaled <- (mtcars$wt - min(mtcars$wt, na.rm = T)) /
diff(range(mtcars$wt, na.rm = T))
Now let’s replace our reference to a specific vector of data with something generic: x
.
x_scaled <- (x - min(x, na.rm = T)) /
diff(range(x, na.rm = T))
Now that we’ve got a generalized bit of code, we can turn it into a function. All we need is a name, function
, and a list of arguments. In this case, we’ve just got one argument: x
.
rescale_0_1 <- function(x) {
(x - min(x, na.rm = T)) /
diff(range(x, na.rm = T))
}
Now that we’ve got a function that’ll rescale a vector of values, we can use one of the map
functions to iterate across all the columns in a dataframe, rescaling each one. We’ll use map_df
since it returns a dataframe, and we’re feeding it a dataframe.
map_df(mtcars, rescale_0_1)
## # A tibble: 32 x 12
## mpg cyl disp hp drat wt qsec vs am gear carb
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.451 0.5 0.222 0.205 0.525 0.283 0.233 0 1 0.5 0.429
## 2 0.451 0.5 0.222 0.205 0.525 0.348 0.3 0 1 0.5 0.429
## 3 0.528 0 0.0920 0.145 0.502 0.206 0.489 1 1 0.5 0
## 4 0.468 0.5 0.466 0.205 0.147 0.435 0.588 1 0 0 0
## 5 0.353 1 0.721 0.435 0.180 0.493 0.3 0 0 0 0.143
## 6 0.328 0.5 0.384 0.187 0 0.498 0.681 1 0 0 0
## 7 0.166 1 0.721 0.682 0.207 0.526 0.160 0 0 0 0.429
## 8 0.596 0 0.189 0.0353 0.429 0.429 0.655 1 0 0.5 0.143
## 9 0.528 0 0.174 0.152 0.535 0.419 1 1 0 0.5 0.143
## 10 0.374 0.5 0.241 0.251 0.535 0.493 0.452 1 0 0.5 0.429
## # … with 22 more rows, and 1 more variable: wt_scaled <dbl>
There you have it! We went from some code that calculated one value to being able to iterate it across any number of columns in a dataframe. It can be tempting to jump straight to your final iteration code, but it’s often better to start simple and work your way up, verifying that things work at each step, especially if you’re trying to do something even moderately complex.
This lesson was contributed by Michael Culshaw-Maurer, with ideas from Mike Koontz and Brandon Hurr’s D-RUG presentation.