read.csv()
to read a CSV file into Rtidyverse
Now that we’ve learned a bit about how R is thinking about data under the hood, using different types of vectors to build more complicated data structures, let’s actually look at some data.
We are studying the species repartition and weight of animals caught in plots in our study area. The dataset is stored as a comma separated value (CSV) file. Each row holds information for a single animal, and the columns represent:
Column | Description |
---|---|
record_id | Unique id for the observation |
month | month of observation |
day | day of observation |
year | year of observation |
plot_id | ID of a particular plot |
species_id | 2-letter code |
sex | sex of animal (“M”, “F”) |
hindfoot_length | length of the hindfoot in mm |
weight | weight of the animal in grams |
genus | genus of animal |
species | species of animal |
taxon | e.g. Rodent, Reptile, Bird, Rabbit |
plot_type | type of plot |
Your current R project should already have a data
folder with the surveys data CSV file in it. We can read it into R and assign it to an object by using the read.csv()
function. The first argument to read.csv()
is the path of the file you want to read, in quotes. This path will be relative to your current working directory, which in our case is the R Project folder. So from there, we want to access the “data” folder, and then the name of the CSV file.
surveys <- read.csv("data/portal_data_joined.csv")
Take a look at your Environment pane and you should see an object called “surveys”. We can print out the object to take a look at it by just running the name of the object. We can also check to see what class it is.
surveys
## record_id month day year plot_id species_id sex hindfoot_length weight
## 1 1 7 16 1977 2 NL M 32 NA
## 2 72 8 19 1977 2 NL M 31 NA
## 3 224 9 13 1977 2 NL NA NA
## genus species taxa plot_type
## 1 Neotoma albigula Rodent Control
## 2 Neotoma albigula Rodent Control
## 3 Neotoma albigula Rodent Control
## [ reached 'max' / getOption("max.print") -- omitted 34783 rows ]
class(surveys)
## [1] "data.frame"
Wow, printing a data frame gives us quite a bit of output. This is a lot more data than the small vectors we worked with last lesson, but the basic principles remain the same.
Data frames are really just a collection of vectors: every column is a vector with a single data type, and every column is the exact same length. You can make a data frame “by hand”, but they’re usually created when you import some sort of tabular data into R using a function like read.csv()
.
data.frame
ObjectsWhen working with a large data frame, it’s usually impractical to try to look at it all at once, so we’ll need to arm ourselves with a series of tools for inspecting them. Here is a non-exhaustive list of some common functions to do this:
nrow(surveys)
- returns the number of rowsncol(surveys)
- returns the number of columnshead(surveys)
- shows the first 6 rowstail(surveys)
- shows the last 6 rowsView(surveys)
- opens a new tab in RStudio that shows the entire data frame. Useful at times, but you shouldn’t become overly reliant on checking data frames by eye, it’s easy to make mistakescolnames(surveys)
- returns the column namesrownames(surveys)
- returns the row namesstr(surveys)
- structure of the object and information about the class, length and content of each columnsummary(surveys)
- summary statistics for each columnNote: most of these functions are “generic”, they can be used on other types of objects besides data.frame
.
Based on the output of str(surveys)
, can you answer the following questions? * What is the class of the object surveys
? * How many rows and how many columns are in this object? * How are our character data represented in this data frame? * How many species have been recorded during these surveys?
ANSWER
str(surveys)
## 'data.frame': 34786 obs. of 13 variables:
## $ record_id : int 1 72 224 266 349 363 435 506 588 661 ...
## $ month : int 7 8 9 10 11 11 12 1 2 3 ...
## $ day : int 16 19 13 16 12 12 10 8 18 11 ...
## $ year : int 1977 1977 1977 1977 1977 1977 1977 1978 1978 1978 ...
## $ plot_id : int 2 2 2 2 2 2 2 2 2 2 ...
## $ species_id : Factor w/ 48 levels "AB","AH","AS",..: 16 16 16 16 16 16 16 16 16 16 ...
## $ sex : Factor w/ 3 levels "","F","M": 3 3 1 1 1 1 1 1 3 1 ...
## $ hindfoot_length: int 32 31 NA NA NA NA NA NA NA NA ...
## $ weight : int NA NA NA NA NA NA NA NA 218 NA ...
## $ genus : Factor w/ 26 levels "Ammodramus","Ammospermophilus",..: 13 13 13 13 13 13 13 13 13 13 ...
## $ species : Factor w/ 40 levels "albigula","audubonii",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ taxa : Factor w/ 4 levels "Bird","Rabbit",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ plot_type : Factor w/ 5 levels "Control","Long-term Krat Exclosure",..: 1 1 1 1 1 1 1 1 1 1 ...
## * class: data frame
## * how many rows: 34786, how many columns: 13
## * the character data are factors
## * how many species: 48
When we wanted to extract particular values from a vector, we used square brackets and put index values in them. Since data frames are made out of vectors, we can use the square brackets again, but with one change. Data frames are 2-dimensional, so we need to specify row and column indices. Row numbers come first, then a comma, then column numbers. Leaving the row number blank will return all rows, and the same thing applies to column numbers.
One thing to note is that the different ways you write out these indices can give you back either a data frame or a vector.
# first element in the first column of the data frame (as a vector)
surveys[1, 1]
## [1] 1
# first element in the 6th column (as a vector)
surveys[1, 6]
## [1] NL
## 48 Levels: AB AH AS BA CB CM CQ CS CT CU CV DM DO DS DX NL OL OT OX ... ZL
# first column of the data frame (as a vector)
surveys[, 1]
## [1] 1 72 224 266 349 363 435 506 588 661 748 845 990 1164
## [15] 1261 1374 1453 1756 1818 1882 2133 2184 2406 2728 3000 3002 4667 4859
## [29] 5048 5180 5299 5485 5558 5583 5966 6020 6023 6036 6167 6479 6500 8022
## [43] 8263 8387 8394 8407 8514 8543 8657 8675
## [ reached getOption("max.print") -- omitted 34736 entries ]
# first column of the data frame (as a data.frame)
surveys[1]
## record_id
## 1 1
## 2 72
## 3 224
## 4 266
## 5 349
## 6 363
## 7 435
## 8 506
## 9 588
## 10 661
## 11 748
## 12 845
## 13 990
## 14 1164
## 15 1261
## 16 1374
## 17 1453
## 18 1756
## 19 1818
## 20 1882
## 21 2133
## 22 2184
## 23 2406
## 24 2728
## 25 3000
## 26 3002
## 27 4667
## 28 4859
## 29 5048
## 30 5180
## 31 5299
## 32 5485
## 33 5558
## 34 5583
## 35 5966
## 36 6020
## 37 6023
## 38 6036
## 39 6167
## 40 6479
## 41 6500
## 42 8022
## 43 8263
## 44 8387
## 45 8394
## 46 8407
## 47 8514
## 48 8543
## 49 8657
## 50 8675
## [ reached 'max' / getOption("max.print") -- omitted 34736 rows ]
# first three elements in the 7th column (as a vector)
surveys[1:3, 7]
## [1] M M
## Levels: F M
# the 3rd row of the data frame (as a data.frame)
surveys[3, ]
## record_id month day year plot_id species_id sex hindfoot_length weight
## 3 224 9 13 1977 2 NL NA NA
## genus species taxa plot_type
## 3 Neotoma albigula Rodent Control
# equivalent to head_surveys <- head(surveys)
head_surveys <- surveys[1:6, ]
:
is a special function that creates numeric vectors of integers in increasing or decreasing order; try running 1:10
and 10:1
to check this out.
You can also exclude certain indices of a data frame using the “-
” sign:
surveys[, -1] # The whole data frame, except the first column
## month day year plot_id species_id sex hindfoot_length weight genus
## 1 7 16 1977 2 NL M 32 NA Neotoma
## 2 8 19 1977 2 NL M 31 NA Neotoma
## 3 9 13 1977 2 NL NA NA Neotoma
## 4 10 16 1977 2 NL NA NA Neotoma
## species taxa plot_type
## 1 albigula Rodent Control
## 2 albigula Rodent Control
## 3 albigula Rodent Control
## 4 albigula Rodent Control
## [ reached 'max' / getOption("max.print") -- omitted 34782 rows ]
surveys[-c(7:34786), ] # Equivalent to head(surveys)
## record_id month day year plot_id species_id sex hindfoot_length weight
## 1 1 7 16 1977 2 NL M 32 NA
## 2 72 8 19 1977 2 NL M 31 NA
## 3 224 9 13 1977 2 NL NA NA
## genus species taxa plot_type
## 1 Neotoma albigula Rodent Control
## 2 Neotoma albigula Rodent Control
## 3 Neotoma albigula Rodent Control
## [ reached 'max' / getOption("max.print") -- omitted 3 rows ]
Data frames can be subset by calling indices (as shown previously), but also by calling their column names directly:
surveys["species_id"] # Result is a data.frame
surveys[, "species_id"] # Result is a vector
surveys[["species_id"]] # Result is a vector
surveys$species_id # Result is a vector
In general, when you’re working with data frames, you should make sure you know whether your code returns a data frame or a vector, as we see that different methods yield different results. Sometimes you get a data frame with one column, sometimes you get one vector.
You will probably end up using the $
subsetting quite a bit. What’s nice about it is that it supports tab-completion! Type out your data frame name, then a dollar sign, then hit tab to get a list of the column names that you can scroll through.
data.frame
(surveys_200
) containing only the data in row 200 of the surveys
dataset.nrow()
gave you the number of rows in a data.frame
?
tail()
to make sure it’s meeting expectations.nrow()
instead of the row number.surveys_last
) from that last row.nrow()
to extract the row that is in the middle of the data frame. Store the content of this row in an object named surveys_middle
.nrow()
with the -
notation above to reproduce the behavior of head(surveys)
, keeping just the first through 6th rows of the surveys dataset.ANSWER
## 1.
surveys_200 <- surveys[200, ]
## 2.
# Saving `n_rows` to improve readability and reduce duplication
n_rows <- nrow(surveys)
surveys_last <- surveys[n_rows, ]
## 3.
surveys_middle <- surveys[n_rows / 2, ]
## 4.
surveys_head <- surveys[-(7:n_rows), ]
tidyverse
Almost every time you work in R, you will be using different “packages” to work with data. A package is a collection of functions used for some common purpose; there are packages for manipulating data, plotting, interfacing with other programs, and much much more.
All of the stuff we’ve covered so far has been using R’s “base” functionality, the built in functions and techniques that come with R by default. There is a new-ish set of packages called the tidyverse
which does a lot of the same stuff as base R, plus much much more. The tidyverse
is what we will focus on primarily from here on out, as it is a very powerful set of tools with a philosophy that focuses on being readable and intuitive when working with data. There are a few reasons we’ve taught you a bunch of base R stuff so far:
tidyverse
still works with the same building blocks as base R: vectors!tidyverse
is constantly evolving, which can be good (new features!) and bad (really old tidyverse
code may behave differently when you update)For example, using []
to subset data and using read.csv()
are base R ways of doing things, but we’ll show you tidyverse
ways of doing them as well.
In R, there are almost always several ways of accomplishing the same task. Showing you every single way of getting a job done seems like a waste of time, but we also don’t want you to feel lost when you come across some base R code, so that’s why there might be a bit of redundancy.
Almost every time you work in R, you will be using different “packages” to work with data. A package is a collection of functions used for some common purpose; there are packages for manipulating data, plotting, interfacing with other programs, and much much more.
For much of this course, we’ll be working with a series of packages collectively referred to as the tidyverse
. They are packages designed to help you work with data, from cleaning and manipulation to plotting. They are all designed to work together nicely, and share a lot of similar principles. They are increasingly popular, have large user bases, and are generally very well-documented. You can install the core set of tidyverse
packages with the install.packages()
function:
install.packages("tidyverse")
It is usually recommended that you do NOT write this code into a script, or the package will be reinstalled every time you run the script. Instead, just run it once in your console, and it will be permanently installed so you can use it any time.
Once a package has been installed on your computer, you can load it in order to use it:
library(tidyverse)
Loading the tidyverse
package actually loads a whole bunch of commonly used tidyverse packages at once, which is pretty convenient.
A common feature of tidyverse
functions is that they use underscores in the name. For example, the tidyverse
function for reading a CSV file is read_csv()
instead of read.csv()
. Let’s try it:
t_surveys <- read_csv("data/portal_data_joined.csv")
## Parsed with column specification:
## cols(
## record_id = col_double(),
## month = col_double(),
## day = col_double(),
## year = col_double(),
## plot_id = col_double(),
## species_id = col_character(),
## sex = col_character(),
## hindfoot_length = col_double(),
## weight = col_double(),
## genus = col_character(),
## species = col_character(),
## taxa = col_character(),
## plot_type = col_character()
## )
Now let’s take a look at how prints and check the class:
t_surveys
## # A tibble: 34,786 x 13
## record_id month day year plot_id species_id sex hindfoot_length
## <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl>
## 1 1 7 16 1977 2 NL M 32
## 2 72 8 19 1977 2 NL M 31
## 3 224 9 13 1977 2 NL <NA> NA
## 4 266 10 16 1977 2 NL <NA> NA
## 5 349 11 12 1977 2 NL <NA> NA
## 6 363 11 12 1977 2 NL <NA> NA
## 7 435 12 10 1977 2 NL <NA> NA
## 8 506 1 8 1978 2 NL <NA> NA
## 9 588 2 18 1978 2 NL M NA
## 10 661 3 11 1978 2 NL <NA> NA
## # … with 34,776 more rows, and 5 more variables: weight <dbl>,
## # genus <chr>, species <chr>, taxa <chr>, plot_type <chr>
class(t_surveys)
## [1] "spec_tbl_df" "tbl_df" "tbl" "data.frame"
Ooh, doesn’t that print out nicely? It only prints 10 rows by default, NAs are now colored red, and under the name of each column is the type of data! One important thing to notice is that the column types are only double
and character
, no factors here. By default, read_csv()
keeps character data as character
columns, which would be like setting stringsAsFactors=FALSE
in read.csv()
.
Also, class()
returned multiple things! You’ll notice one of them is data.frame
, but there are things like tbl_df
as well. The tidyverse
has a special type of data.frame
called a “tibble”. Tibbles are the same as data frames, but they print nicely as we just saw, and they usually return a tibble when you’re using bracket subsetting. As always, just be sure to check whether you’re getting a tibble or a vector back.
surveys[,1] # gives a vector back
## [1] 1 72 224 266 349 363 435 506 588 661 748 845 990 1164
## [15] 1261 1374 1453 1756 1818 1882 2133 2184 2406 2728 3000 3002 4667 4859
## [29] 5048 5180 5299 5485 5558 5583 5966 6020 6023 6036 6167 6479 6500 8022
## [43] 8263 8387 8394 8407 8514 8543 8657 8675
## [ reached getOption("max.print") -- omitted 34736 entries ]
t_surveys[,1] # gives a tibble back
## # A tibble: 34,786 x 1
## record_id
## <dbl>
## 1 1
## 2 72
## 3 224
## 4 266
## 5 349
## 6 363
## 7 435
## 8 506
## 9 588
## 10 661
## # … with 34,776 more rows
This lesson is adapted from the Data Carpentry: R for Data Analysis and Visualization of Ecological Data Starting With Data materials.