And you may find yourself
Behind the keys of a large computing machine
And you may find yourself
Copy-pasting tons of code
And you may ask yourself, well
How did I get here?
It’s pretty common that you’ll want to run the same basic bit of code a bunch of times with different inputs. Maybe you want to read in a bunch of data files with different names or calculate something complex on every row of a dataframe. A general rule of thumb is that any code you want to run 3+ times should be iterated instead of copy-pasted. Copy-pasting code and replacing the parts you want to change is generally a bad practice for several reasons:
Lots of functions (including many base
functions) are vectorized, meaning they already work on vectors of values. Here’s an example:
x <- 1:10
log(x)
## [1] 0.0000000 0.6931472 1.0986123 1.3862944 1.6094379 1.7917595 1.9459101
## [8] 2.0794415 2.1972246 2.3025851
The log()
function already knows we want to take the log of each element in x, and it returns a vector that’s the same length as x. If a vectorized function already exists to do what you want, use it! It’s going to be faster and cleaner than trying to iterate everything yourself.
However, we may want to do more complex iterations, which brings us to our first main iterating concept.
A for loop will repeat some bit of code, each time with a new input value. Here’s the basic structure:
for(i in 1:10) {
print(i)
}
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6
## [1] 7
## [1] 8
## [1] 9
## [1] 10
You’ll often see i
used in for loops, you can think of it as the iteration value. For each i
value in the vector 1:10, we’ll print that index value. You can use the i
value more than once in a loop:
for (i in 1:10) {
print(i)
print(i^2)
}
## [1] 1
## [1] 1
## [1] 2
## [1] 4
## [1] 3
## [1] 9
## [1] 4
## [1] 16
## [1] 5
## [1] 25
## [1] 6
## [1] 36
## [1] 7
## [1] 49
## [1] 8
## [1] 64
## [1] 9
## [1] 81
## [1] 10
## [1] 100
What’s happening is the value of i
gets inserted into the code block, the block gets run, the value of i
changes, and the process repeats. For loops can be a way to explicitly lay out fairly complicated procedures, since you can see exactly where your i
value is going in the code.
You can also use the i
value to index a vector or dataframe, which can be very powerful!
for (i in 1:10) {
print(letters[i])
print(mtcars$wt[i])
}
## [1] "a"
## [1] 2.62
## [1] "b"
## [1] 2.875
## [1] "c"
## [1] 2.32
## [1] "d"
## [1] 3.215
## [1] "e"
## [1] 3.44
## [1] "f"
## [1] 3.46
## [1] "g"
## [1] 3.57
## [1] "h"
## [1] 3.19
## [1] "i"
## [1] 3.15
## [1] "j"
## [1] 3.44
Here we printed out the first 10 letters of the alphabet from the letters
vector, as well as the first 10 car weights from the mtcars
dataframe.
If you want to store your results somewhere, it is important that you create an empty object to hold them before you run the loop. If you grow your results vector one value at a time, it will be much slower. Here’s how to make that empty vector first. We’ll also use the function seq_along
to create a sequence that’s the proper length, instead of explicitly writing out something like 1:10
.
results <- rep(NA, nrow(mtcars))
for (i in seq_along(mtcars$wt)) {
results[i] <- mtcars$wt[i] * 1000
}
results
## [1] 2620 2875 2320 3215 3440 3460 3570 3190 3150 3440 3440 4070 3730 3780
## [15] 5250 5424 5345 2200 1615 1835 2465 3520 3435 3840 3845 1935 2140 1513
## [29] 3170 2770 3570 2780
purrr
For loops are very handy and important to understand, but they can involve writing a lot of code and can generally look fairly messy.
The tidyverse
includes another way to iterate, using the map
family of functions. These functions all do the same basic thing: take a series of values and apply a function to each of them. That function could be a function from a package, or it could be one you write to do something specific.
For a wonderful and thorough exploration of the purrr
package, check out Jenny Brian’s tutorial.
map
When using the map
family of functions, the first argument (as in all tidyverse functions) is the data. One nice feature is that you can specify the format of the output explicitly by using a different member of the family.
mtcars %>% map(mean) # gives a list
## $mpg
## [1] 20.09062
##
## $cyl
## [1] 6.1875
##
## $disp
## [1] 230.7219
##
## $hp
## [1] 146.6875
##
## $drat
## [1] 3.596563
##
## $wt
## [1] 3.21725
##
## $qsec
## [1] 17.84875
##
## $vs
## [1] 0.4375
##
## $am
## [1] 0.40625
##
## $gear
## [1] 3.6875
##
## $carb
## [1] 2.8125
mtcars %>% map_dbl(mean) # gives a numeric vector
## mpg cyl disp hp drat wt
## 20.090625 6.187500 230.721875 146.687500 3.596563 3.217250
## qsec vs am gear carb
## 17.848750 0.437500 0.406250 3.687500 2.812500
mtcars %>% map_chr(mean) # gives a character vector
## mpg cyl disp hp drat
## "20.090625" "6.187500" "230.721875" "146.687500" "3.596563"
## wt qsec vs am gear
## "3.217250" "17.848750" "0.437500" "0.406250" "3.687500"
## carb
## "2.812500"
You can pass additional arguments to functions that you map across your data. For example, if you have some NAs in your data, you might want to use mean()
with na.rm = TRUE
.
mtcars2 <- mtcars # make a copy of the mtcars dataset
mtcars2[3,c(1,6,8)] <- NA # make one of the cars have NAs for several columns
mtcars2
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 NA 4 108.0 93 3.85 NA 18.61 NA 1 4 1
## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
## Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
## Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
## Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
## Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
## Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
## Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
## Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
## Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
## Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
## Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
## Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
## Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
## Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
## Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
## AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
## Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
## Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
## Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
## Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
## Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
## Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
## Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
mtcars2 %>% map_dbl(mean) # returns NA for mpg, wt, and vs columns
## mpg cyl disp hp drat wt
## NA 6.187500 230.721875 146.687500 3.596563 NA
## qsec vs am gear carb
## 17.848750 NA 0.406250 3.687500 2.812500
mtcars2 %>% map_dbl(mean, na.rm = TRUE)
## mpg cyl disp hp drat wt
## 20.0032258 6.1875000 230.7218750 146.6875000 3.5965625 3.2461935
## qsec vs am gear carb
## 17.8487500 0.4193548 0.4062500 3.6875000 2.8125000
map2
You can use the map2
series of functions if you need to map across two sets of inputs in parallel. Here, we’ll map across both the names of cars and their mpg values, using an anonymous function to paste the two together into a sentence.
We’ll use what’s called an “anonymous function”, which is a small function we define within the map
function call. Our function will take 2 arguments, x and y, and paste them together with some other text.
map2_chr(rownames(mtcars), mtcars$mpg, function(x,y) paste(x, "gets", y, "miles per gallon")) %>%
head()
## [1] "Mazda RX4 gets 21 miles per gallon"
## [2] "Mazda RX4 Wag gets 21 miles per gallon"
## [3] "Datsun 710 gets 22.8 miles per gallon"
## [4] "Hornet 4 Drive gets 21.4 miles per gallon"
## [5] "Hornet Sportabout gets 18.7 miles per gallon"
## [6] "Valiant gets 18.1 miles per gallon"
You can use the pmap
series of functions if you need to use more than two input lists.
Sometimes, you want to do something with your code, but only if a certain condition is true. There are a couple main ways to do this.
if
and else
You can use combinations of if
and else
to create conditional statements. Here’s a quick example:
for (i in 1:10) {
if (i < 5) {
print(paste(i, "is less than 5"))
} else {
print(paste(i, "is greater than or equal to 5"))
}
}
## [1] "1 is less than 5"
## [1] "2 is less than 5"
## [1] "3 is less than 5"
## [1] "4 is less than 5"
## [1] "5 is greater than or equal to 5"
## [1] "6 is greater than or equal to 5"
## [1] "7 is greater than or equal to 5"
## [1] "8 is greater than or equal to 5"
## [1] "9 is greater than or equal to 5"
## [1] "10 is greater than or equal to 5"
Here we’ve combined a couple techniques: we’ve used a for loop to go through a sequence of values, and for each value we’ve printed a statement based on a condition that our value meets.
case_when
Sometimes you might want to do a bunch conditional statements together, but typing out a ton of nested if-else statements can be unwieldy and prone to typos. A really useful function is the tidyverse’s case_when
. You feed it a dataframe and then use a series of two-sided formulas where the left-hand side determines which values you want, and the right supplies the result. Here’s an example where we take the mtcars
dataframe and add a column called car_size
. If the car’s weight is over 3.5 or it has 8 cylinders, we call it “big”. If it’s not big, but its weight is over 2.5, then it’s medium. If neither of these conditions is met (denoted by TRUE
), then we call it “small”.
mtcars %>%
mutate(
car_size = case_when(
wt > 3.5 | cyl == 8 ~ "big",
wt > 2.5 ~ "medium",
TRUE ~ "small"
)
)
## mpg cyl disp hp drat wt qsec vs am gear carb car_size
## 1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 medium
## 2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 medium
## 3 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 small
## 4 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 medium
## 5 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 big
## 6 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 medium
## 7 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 big
## 8 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 medium
## 9 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 medium
## 10 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 medium
## 11 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4 medium
## 12 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3 big
## 13 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3 big
## 14 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3 big
## 15 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4 big
## 16 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4 big
## 17 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4 big
## 18 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1 small
## 19 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2 small
## 20 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1 small
## 21 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1 small
## 22 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2 big
## 23 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2 big
## 24 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4 big
## 25 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2 big
## 26 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1 small
## 27 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2 small
## 28 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2 small
## 29 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4 big
## 30 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6 medium
## 31 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8 big
## 32 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2 medium
map
Let’s throw it back to the map
family for a sec. Sometimes you might only want to map a function to part of a dataframe. map_if
allows you to give the data, a condition for the data to meet, and the function you want to apply to the data that meet the condition. Here, we’ll map as.character
to the columns of the iris dataset that meet the condition is.factor
.
str(iris)
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
iris %>%
map_if(is.factor, as.character) %>%
str()
## List of 5
## $ Sepal.Length: num [1:150] 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num [1:150] 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num [1:150] 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num [1:150] 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : chr [1:150] "setosa" "setosa" "setosa" "setosa" ...
map_at
does something similar, but it allows you to directly specify the locations you’d like to map the function to, using either names or positions.
mtcars %>%
map_at(c("cyl", "am"), as.character) %>%
str()
## List of 11
## $ mpg : num [1:32] 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : chr [1:32] "6" "6" "4" "6" ...
## $ disp: num [1:32] 160 160 108 258 360 ...
## $ hp : num [1:32] 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num [1:32] 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num [1:32] 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num [1:32] 16.5 17 18.6 19.4 17 ...
## $ vs : num [1:32] 0 0 1 1 0 1 0 1 1 1 ...
## $ am : chr [1:32] "1" "1" "1" "0" ...
## $ gear: num [1:32] 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num [1:32] 4 4 1 1 2 1 4 2 2 4 ...
Let’s try working through a complete example of how you might iterate a more complex operation across a dataset. This will follow 3 basic steps:
The first thing we’ll do is figure out if we can do the right thing once! We want to rescale a vector of values to a 0-1 scale. We’ll try it out on the weights in mtcars
. Our heaviest vehicle will have a scaled weight of 1, and our lightest will have a scaled weight of 0. We’ll do this by taking our weight, subtracting the minimum car weight from it, and dividing this by the range of the car weights (max minus min). We’ll have to be careful about our order of operations…
(mtcars$wt[1] - min(mtcars$wt, na.rm = T)) /
(max(mtcars$wt, na.rm = T) - min(mtcars$wt, na.rm = T))
## [1] 0.2830478
Great! We got a scaled value out of the deal. Because we’re working with base functions like max
, min
, and /
, we can vectorize. This means we can give it the whole weight vector, and we’ll get a whole scaled vector back.
mtcars$wt_scaled <- (mtcars$wt - min(mtcars$wt, na.rm = T)) /
diff(range(mtcars$wt, na.rm = T))
mtcars$wt_scaled
## [1] 0.28304781 0.34824853 0.20634109 0.43518282 0.49271286 0.49782664
## [7] 0.52595244 0.42879059 0.41856303 0.49271286 0.49271286 0.65379698
## [13] 0.56686269 0.57964715 0.95551010 1.00000000 0.97980056 0.17565840
## [19] 0.02608029 0.08233188 0.24341601 0.51316799 0.49143442 0.59498849
## [25] 0.59626694 0.10790079 0.16031705 0.00000000 0.42367681 0.32140118
## [31] 0.52595244 0.32395807
Now let’s replace our reference to a specific vector of data with something generic: x
. This code won’t run on its own, since x
doesn’t have a value, but it’s just showing how we would refer to some generic value.
x_scaled <- (x - min(x, na.rm = T)) /
diff(range(x, na.rm = T))
Now that we’ve got a generalized bit of code, we can turn it into a function. All we need is a name, function
, and a list of arguments. In this case, we’ve just got one argument: x
.
rescale_0_1 <- function(x) {
(x - min(x, na.rm = T)) /
diff(range(x, na.rm = T))
}
rescale_0_1(mtcars$mpg) # it works on one of our columns
## [1] 0.4510638 0.4510638 0.5276596 0.4680851 0.3531915 0.3276596 0.1659574
## [8] 0.5957447 0.5276596 0.3744681 0.3148936 0.2553191 0.2936170 0.2042553
## [15] 0.0000000 0.0000000 0.1829787 0.9361702 0.8510638 1.0000000 0.4723404
## [22] 0.2170213 0.2042553 0.1234043 0.3744681 0.7191489 0.6638298 0.8510638
## [29] 0.2297872 0.3957447 0.1957447 0.4680851
Now that we’ve got a function that’ll rescale a vector of values, we can use one of the map
functions to iterate across all the columns in a dataframe, rescaling each one. We’ll use map_df
since it returns a dataframe, and we’re feeding it a dataframe.
map_df(mtcars, rescale_0_1)
## # A tibble: 32 x 12
## mpg cyl disp hp drat wt qsec vs am gear carb
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.451 0.5 0.222 0.205 0.525 0.283 0.233 0 1 0.5 0.429
## 2 0.451 0.5 0.222 0.205 0.525 0.348 0.3 0 1 0.5 0.429
## 3 0.528 0 0.0920 0.145 0.502 0.206 0.489 1 1 0.5 0
## 4 0.468 0.5 0.466 0.205 0.147 0.435 0.588 1 0 0 0
## 5 0.353 1 0.721 0.435 0.180 0.493 0.3 0 0 0 0.143
## 6 0.328 0.5 0.384 0.187 0 0.498 0.681 1 0 0 0
## 7 0.166 1 0.721 0.682 0.207 0.526 0.160 0 0 0 0.429
## 8 0.596 0 0.189 0.0353 0.429 0.429 0.655 1 0 0.5 0.143
## 9 0.528 0 0.174 0.152 0.535 0.419 1 1 0 0.5 0.143
## 10 0.374 0.5 0.241 0.251 0.535 0.493 0.452 1 0 0.5 0.429
## # … with 22 more rows, and 1 more variable: wt_scaled <dbl>
There you have it! We went from some code that calculated one value to being able to iterate it across any number of columns in a dataframe. It can be tempting to jump straight to your final iteration code, but it’s often better to start simple and work your way up, verifying that things work at each step, especially if you’re trying to do something even moderately complex.
apply
FunctionsWhile we learned the tidyverse
series of map
functions, it’s worth mentioning that there is a similar series of packages in base R called the apply
series of functions. They are very similar to map
functions, but the syntax is a little different and you have to be a little more careful about the data types you put in and get out.
We’re not going to go into the apply
family, but if you want to learn more, here is a good tutorial. You might come across the apply
functions in someone else’s code, so it’s good to know they exist.
This lesson was contributed by Michael Culshaw-Maurer, with ideas from Mike Koontz and Brandon Hurr’s D-RUG presentation.