Exercise 03

Modified

September 5, 2024

Exercise due on 2023-09-18

ℹ️ See week 3 for related slides and readings

1 Overview

This week’s exercise comes directly from the data transformation chapter of R for Data Science. More typically, our exercises will always include spatial data but I wanted to use a more tried and tested exercise for this week’s material.

2 Setup

If you don’t already have the {nycflights13} package installed, go ahead and install it then restart before continuing with the exercise.

pak::pkg_install("nycflights13")

In addition to nycflights13, you will also need {dplyr} and {ggplot2}. Load the tidyverse library to make sure you have everything you need:

library(nycflights13)
library(tidyverse)

3 Exercises

3.1 Working with rows

In a single pipeline for each condition, find all flights that meet the condition:

  • Had an arrival delay of two or more hours
  • Flew to Houston (IAH or HOU)
  • Were operated by United, American, or Delta
  • Departed in summer (July, August, and September)
  • Arrived more than two hours late, but didn’t leave late
  • Were delayed by at least an hour, but made up over 30 minutes in flight
flights |> 
  ____

Sort flights to find the flights with longest departure delays. Find the flights that left earliest in the morning.

flights |> 
  arrange(____)

Sort flights to find the fastest flights. (Hint: Try including a math calculation inside of your function.)

flights |> 
  ____

Was there a flight on every day of 2013?

flights |> 
  ____

Which flights traveled the farthest distance? Which traveled the least distance?

flights |> 
  ____

Does it matter what order you used filter() and arrange() if you’re using both? Why/why not? Think about the results and how much work the functions would have to do.

____

3.2 Working with columns

Compare dep_time, sched_dep_time, and dep_delay. How would you expect those three numbers to be related?

____

Brainstorm as many ways as possible to select dep_time, dep_delay, arr_time, and arr_delay from flights.

select(flights, ____)

What happens if you specify the name of the same variable multiple times in a select() call?

select(flights, ____)

What does the any_of() function do? Why might it be helpful in conjunction with this vector?

variables <- c("year", "month", "day", "dep_delay", "arr_delay")

Does the result of running the following code surprise you? How do the select helpers deal with upper and lower case by default? How can you change that default?

flights |> select(contains("TIME"))

Rename air_time to air_time_min to indicate units of measurement and move it to the beginning of the data frame.

flights |> 
  rename(____)

Why doesn’t the following work, and what does the error mean?

flights |> 
  select(tailnum) |> 
  arrange(arr_delay)

3.3 Working with groups

Which carrier has the worst average delays? Challenge: can you disentangle the effects of bad airports vs. bad carriers? Why/why not? (Hint: think about flights |> group_by(carrier, dest) |> summarize(n()))

flights |> 
  ____

Find the flights that are most delayed upon departure from each destination.

flights |> 
  ____

How do delays vary over the course of the day. Illustrate your answer with a plot.

What happens if you supply a negative n to slice_min() and friends?

slice_min(flights, ____)

Explain what count() does in terms of the dplyr verbs you just learned. What does the sort argument to count() do?

count(flights, ____)

count(flights, ____, sort = ____)