::pkg_install("nycflights13") pak
Exercise 03
1 Overview
This week’s exercise comes directly from the data transformation chapter of R for Data Science. More typically, our exercises will always include spatial data but I wanted to use a more tried and tested exercise for this week’s material.
2 Setup
If you don’t already have the {nycflights13}
package installed, go ahead and install it then restart before continuing with the exercise.
In addition to nycflights13, you will also need {dplyr}
and {ggplot2}
. Load the tidyverse library to make sure you have everything you need:
library(tidyverse)
library(nycflights13)
3 Exercises
3.1 Working with rows
In a single pipeline for each condition, find all flights that meet the condition:
- Had an arrival delay of two or more hours
|>
flights ____
- Flew to Houston (
IAH
orHOU
)
|>
flights ____
- Were operated by United, American, or Delta
|>
flights ____
- Departed in summer (July, August, and September)
|>
flights ____
- Arrived more than two hours late, but didn’t leave late
|>
flights ____
- Were delayed by at least an hour, but made up over 30 minutes in flight
|>
flights ____
Sort flights
to find the flights with longest departure delays. Find the flights that left earliest in the morning.
|>
flights arrange(____)
Sort flights
to find the fastest flights. (Hint: Try including a math calculation inside of your function.)
|>
flights ____
Answer the following questions including code blocks showing the code used in determining your answer.
Was there a flight on every day of 2013? ____
Which flights traveled the farthest distance? ____
Which traveled the least distance? ____
Does it matter what order you used filter()
and arrange()
if you’re using both? Why/why not? Think about the results and how much work the functions would have to do.
____
Now is a good time to render, commit, and push your changes to GitHub with an informative commit message.
Make sure to commit and push all changed files so that your Git pane is empty afterwards.
3.2 Working with columns
Compare dep_time
, sched_dep_time
, and dep_delay
. How would you expect those three numbers to be related?
____
Brainstorm as many ways as possible to select dep_time
, dep_delay
, arr_time
, and arr_delay
from flights
.
select(flights, ____)
What happens if you specify the name of the same variable multiple times in a select()
call?
select(flights, ____)
What does the any_of()
function do? Why might it be helpful in conjunction with this vector?
<- c("year", "month", "day", "dep_delay", "arr_delay") variables
Does the result of running the following code surprise you? How do the select helpers deal with upper and lower case by default? How can you change that default?
|> select(contains("TIME")) flights
Rename air_time
to air_time_min
to indicate units of measurement and move it to the beginning of the data frame.
|>
flights rename(____)
Why doesn’t the following work, and what does the error mean?
|>
flights select(tailnum) |>
arrange(arr_delay)
Don’t forget to render, commit, and push your changes to GitHub with an informative commit message.
3.3 Working with groups
Which carrier has the worst average delays? Challenge: can you disentangle the effects of bad airports vs. bad carriers? Why/why not? (Hint: think about flights |> group_by(carrier, dest) |> summarize(n())
)
|>
flights ____
Find the flights that are most delayed upon departure from each destination.
|>
flights ____
How do delays vary over the course of the day. Illustrate your answer with a plot.
What happens if you supply a negative n
to slice_min()
and friends?
slice_min(flights, ____)
Explain what count()
does in terms of the dplyr verbs you just learned. What does the sort
argument to count()
do?
count(flights, ____)
count(flights, ____, sort = ____)
Render, commit, and push your final changes to GitHub with a meaningful commit message.
Make sure to commit and push all changed files so that your Git pane is empty afterwards.