{dplyr}
and {tidyr}
Session 3
2023-09-13
These slides are adapted from Ch. 4 Data transformation in R for Data Science (2e).
{dplyr}
{dplyr}
?filter()
, which changes which rows are present without changing their order, andarrange()
, which changes the order of the rows without changing which are present.distinct()
which finds rows with unique values but unlike arrange()
and filter()
it can also optionally modify the columns.mutate()
creates new columns that are derived from the existing columns,select()
changes which columns are present,rename()
changes the names of the columns, andrelocate()
changes the positions of the columns.group_by()
,summarize()
, andslice_()
family of functions.The base pipe was added as a base R function with version 4.1.0 released in 2021. For simple uses, the base pipe is identical to the pipe from the magrittr package.
The %>%
pipe is part of the magrittr package which is loaded as part of the tidyverse. This operator has some additional features but they are less frequently used.
select()
Select (and optionally rename) variables in a data frame, using a concise mini-language that makes it easy to refer to variables based on their name (e.g. a:f selects all columns from a on the left to f on the right) or type (e.g. where(is.numeric) selects all numeric columns).
Select the variables to keep:
Select the variables to drop using -
or !
:
Select a range of variables to keep using :
:
Use helper functions from {tidyselect}
:
select works with unquoted and quoted inputs:
select doesn’t work if you use the wrong variable names:
Your turn: use select to subset the variables for wind, pressure, tropical storm force diameter, and hurricane force dimeter.
OK. Here is one answer:
Here is another answer:
And here is yet another answer:
filter()
The filter()
function is used to subset a data frame, retaining all rows that satisfy your conditions.
Use a logical condition to get rows that return TRUE
:
Multiple tests separated by commas are combined so the returned rows pass all tests:
You can combine tests with a logical “OR” operator (|
):
But it may be easier to use write the condition using a different operator:
Your turn: Hurricane Lee has hurricane-force winds extending up to 115 miles from its center and tropical storm-force winds extending for some 240 miles. Can you use filter to find the name and year of a hurricane with observed wind speeds that are the same or greater?
…
OK, here is an answer:
mutate()
mutate()
creates new columns that are functions of existing variables.
It can also modify (if the name is the same as an existing column) and delete columns (by setting their value to NULL
).
Use a function to add a new column based on existing variables:
mutate()
also has a .before
or .after
parameter allowing you to add the new variables before or after a selected variable or range of variables:
Window functions (also known as vectorized functions) are a group of functions that you can use with {dplyr}
.
For example, lag()
returns the previous value for a variable (effectively assumes the variable the observations are arranged in a meaningful order):
There is actually a problem with this new variable. Can you think what it is?
In this case, the lag()
function needs to be applied to a grouped data frame or it may return a value from a different storm and different year:
mutate()
lag()
cumsum()
ntile()
between()
case_when()
(one of the best!)case_when()
is an especially useful vector function with varied applications.
For example, we can use it to create new categorical variables based on continuous variables:
mutate(
storms,
beaufort_desc = case_when(
wind < 1 ~ "Calm",
wind < 4 ~ "Light Air",
wind < 8 ~ "Light Breeze",
wind < 13 ~ "Gentle Breeze",
wind < 19 ~ "Moderate Breeze",
wind < 25 ~ "Fresh Breze",
wind < 32 ~ "Strong Breeze",
wind < 39 ~ "Near Gale",
wind < 47 ~ "Gale",
wind < 55 ~ "Strong Gale",
wind < 64 ~ "Whole Gale",
wind < 75 ~ "Storm Force",
.default = "Hurricane Force"
)
)
summarise()
, group_by()
, and slice_()
summarise()
summarise()
creates a new data frame with:
summarise()
works well with “summary” or analysis functions that take a vector and return a single value:
summarize()
Most often you will want to use summarise()
in combination with group_by()
:
You can also use the .by
parameter to define the groups for summarise()
:
across()
is a helper function that you can use in combination with mutate()
or summarise()
:
sf
objectsFirst convert storms into a sf
object:
sf
objectsIf you are just working with attributes (variables), sf
objects work just like any other data frame:
But, you can use a special set of predicate functions that work with sf
objects to return a logical vector that also works with filter:
By default, st_intersects()
returns a matrix with the index for each value of the first parameter that intersects with each value of the second parameter.
For example, this takes each observation and checks if the POINT geometry intersects with each U.S. state:
sf
objectssf
objectsYou can use summarise to combine geometry by grouping variables:
You can also work with the geometry column directly to modify the returned geometry.
For example, we can use st_combine()
to turn the POINT geometry into MULTIPOINT geometry and then use st_cast()
to transform the MULTIPOINT geometry into lines: