Exercise 04

Modified

November 20, 2024

1 Overview

This week’s exercises are excerpted from Ch. 3, Ch. 4, and Ch. 5 in Geocomputation with R.

These exercises build on our last exercise using {dplyr} and include some of the same skills including:

Filtering rows or observations
Grouping and summarizing data by variable

New skills you will practice with this exercise include:

Using non-spatial joins for data frames
Computing geometric measurements
Using spatial filters
Using geometric operations on a simple feature geometry set
Using geometric operations on pairs of simple feature geometries

2 Setup

This exercise uses the sf and tidyverse packages:

library(tidyverse)
library(sf)

We are also going to use the us_states and us_states_df data from the {spData} package:

library(spData)

Note that the us_states loaded for this exercise is different than the us_states we created during class with the tigris::states() function. For this exercise, the bonus exercises are mixed in with the other questions but you are welcome to skip them if you do not want go for the bonus part of the exercise.

3 Exercises

3.1 Filtering data

Find all states that belong to the West region, have an area below 250,000 km2and in 2015 a population greater than 5,000,000 residents (Hint: you may need to use the function units::set_units() or as.numeric()).

us_states |> 
  ____

Find all states that belong to the South region, had an area larger than 150,000 km2 or a total population in 2015 larger than 7,000,000 residents.

us_states |> 
  ____

Render, commit, and push your changes to GitHub with the commit message “Added answers for filtering data questions”.

3.2 Joining and summarizing data

What was the total population in 2015 in the us_states dataset? What was the minimum and maximum total population in 2015?

us_states |> 
  ____

Add variables from us_states_df to us_states, and create a new object called us_states_stats.

What function did you use and why?
Which variable is the key in both datasets?
What is the class of the new object?

Tip: we are covering joins in more detail next week—check out the R for Data Science chapter on Joins for more information.

us_states_stats <- us_states |> 
  ____

us_states_df has two more rows than us_states. How can you find them?

Hint: try to use the dplyr::anti_join() function.

____(us_states, us_states_df)

How much has population density changed between 2010 and 2015 in each state?

Calculate the change in percentages and map them with plot() or geom_sf():

Calculate the change in the number of residents living below the poverty level between 2010 and 2015 for each state.

Hint: See ?us_states_df for documentation on the poverty level columns.

Optional: Calculate the change in the percentage of residents living below the poverty level in each state.

What was the minimum, average and maximum state’s number of people living below the poverty line in 2015 for each region?

Optional: What is the region with the largest increase in people living below the poverty line?

3.3 Spatial operations

Section 4.2 (in Geocomputation with R) established that Canterbury was the region of New Zealand containing most of the 100 highest points in the country.

How many of these high points does the Canterbury region contain?

canterbury <- nz |>
  filter(Name == "Canterbury")

nz_height |> 
  ____

Optional: plot the result using the ggplot2::geom_sf() function to show all of New Zealand, canterbury region highlighted in yellow, high points in Canterbury represented by red crosses (Hint: try using shape = 7) and high points in other parts of New Zealand represented by blue circles.

See the help page ?ggplot2::shape and run the examples to see an illustration of different shape values.

Which region has the second highest number of nz_height points, and how many does it have?

nz_height |> 
  ____

Generalizing the question to all regions: how many of New Zealand’s 16 regions contain points which belong to the top 100 highest points in the country? Which regions?

Optional: create a table listing these regions in order of the number of points and their name. Hint: use dplyr::slice_max() and gt::gt().

Using st_buffer(), how many points in nz_height are within 100 km of Canterbury?

canterbury_area <- st_buffer(canterbury, dist = ____)

nz_height |> 
  ____

Render, commit, and push your changes to GitHub again with a second informative commit message.

3.4 Spatial predicates

Test your knowledge of spatial predicates by finding out and plotting how US states relate to each other and other spatial objects.

The starting point of this part of the exercise is to create an object representing Maryland state in the USA using the filter() function and plot the resulting object in the context of US states.

maryland <- filter(____, ____)

ggplot() +
  geom_sf(data = us_states) +
  geom_sf(data = ____)

Create a new object representing all the states that geographically intersect with Maryland and plot the result (hint: the most concise way to do this is with the subsetting method [ but you can also use sf::st_filter()).

states_intersecting_md <- ____

Create another object representing all the objects that touch (have a shared boundary with) Maryland and plot the result.

Hint: remember you can use the argument op = st_intersects when subsetting with base R or .predicate = st_intersects when using st_filter()

states_touching_md <- ____

Optional: create a straight line from the centroid of Maryland to the centroid of California near the West coast of the USA (hint: functions st_centroid(), st_union() and st_cast() described in Chapter 5 may help) and identify which states this long East-West line crosses.

How far is the geographic centroid of Maryland from the geographic centroid of Canterbury, New Zealand?

Calculate the perimeter of the boundary lines of US states in meters. Which state has the longest border and which has the shortest?

Hint: st_perimeter is a recent addition to {sf} that works with POLYGON or MULTIPOLYGON geometry. If you use st_length, make sure you convert your data to LINESTRING or MULTILINESTRING geometry.

us_states |> 
  ____

3.5 Tidy data

us_states_df has information on median income and poverty in a “wide” format. Pivot the data into a long format using tidyr::pivot_longer()

us_states_df |> 
  pivot_longer(
    cols = ____
  )

Why could it be useful to have this data in a wide format?

Optional: By default, the new “name” created by pivot_longer() contains the existing column names. Try using the names_pattern or names_transform arguments to create a separate year and variable column:

us_states_df |> 
  pivot_longer(
    cols = ____,
    ____
  )

The last question is from Ch. 5 Data Tidying from R4DS. For this question, we are using a handful of sample tables included with the {tidyr} package:

table1

table2

table3

Sketch out the process you’d use to calculate the rate for table2 and table3. You will need to perform four operations:

Extract the number of TB cases per country per year.
Extract the matching population per country per year.
Divide cases by population, and multiply by 10000.
Store back in the appropriate place.

You haven’t yet learned all the functions you’d need to actually perform these operations, but you should still be able to think through the transformations you’d need.

Don’t forget to render, commit, and push your changes to GitHub one last time with an informative commit message.