::pkg_install("nycflights13") pak
Exercise 03
ℹ️ See week 3 for related slides and readings
1 Overview
This week’s exercise comes directly from the data transformation chapter of R for Data Science. More typically, our exercises will always include spatial data but I wanted to use a more tried and tested exercise for this week’s material.
2 Setup
If you don’t already have the {nycflights13}
package installed, go ahead and install it then restart before continuing with the exercise.
In addition to nycflights13, you will also need {dplyr}
and {ggplot2}
. Load the tidyverse library to make sure you have everything you need:
library(nycflights13)
library(tidyverse)
3 Exercises
3.1 Working with rows
In a single pipeline for each condition, find all flights that meet the condition:
- Had an arrival delay of two or more hours
- Flew to Houston (
IAH
orHOU
) - Were operated by United, American, or Delta
- Departed in summer (July, August, and September)
- Arrived more than two hours late, but didn’t leave late
- Were delayed by at least an hour, but made up over 30 minutes in flight
|>
flights ____
Sort flights
to find the flights with longest departure delays. Find the flights that left earliest in the morning.
|>
flights arrange(____)
Sort flights
to find the fastest flights. (Hint: Try including a math calculation inside of your function.)
|>
flights ____
Was there a flight on every day of 2013?
|>
flights ____
Which flights traveled the farthest distance? Which traveled the least distance?
|>
flights ____
Does it matter what order you used filter()
and arrange()
if you’re using both? Why/why not? Think about the results and how much work the functions would have to do.
____
3.2 Working with columns
Compare dep_time
, sched_dep_time
, and dep_delay
. How would you expect those three numbers to be related?
____
Brainstorm as many ways as possible to select dep_time
, dep_delay
, arr_time
, and arr_delay
from flights
.
select(flights, ____)
What happens if you specify the name of the same variable multiple times in a select()
call?
select(flights, ____)
What does the any_of()
function do? Why might it be helpful in conjunction with this vector?
<- c("year", "month", "day", "dep_delay", "arr_delay") variables
Does the result of running the following code surprise you? How do the select helpers deal with upper and lower case by default? How can you change that default?
|> select(contains("TIME")) flights
Rename air_time
to air_time_min
to indicate units of measurement and move it to the beginning of the data frame.
|>
flights rename(____)
Why doesn’t the following work, and what does the error mean?
|>
flights select(tailnum) |>
arrange(arr_delay)
3.3 Working with groups
Which carrier has the worst average delays? Challenge: can you disentangle the effects of bad airports vs. bad carriers? Why/why not? (Hint: think about flights |> group_by(carrier, dest) |> summarize(n())
)
|>
flights ____
Find the flights that are most delayed upon departure from each destination.
|>
flights ____
How do delays vary over the course of the day. Illustrate your answer with a plot.
What happens if you supply a negative n
to slice_min()
and friends?
slice_min(flights, ____)
Explain what count()
does in terms of the dplyr verbs you just learned. What does the sort
argument to count()
do?
count(flights, ____)
count(flights, ____, sort = ____)