library(tidyverse)
library(sf)
Exercise 04
1 Overview
This week’s exercises are excerpted from Ch. 3, Ch. 4, and Ch. 5 in Geocomputation with R.
These exercises build on our last exercise using {dplyr}
and include some of the same skills including:
- Filtering rows or observations
- Grouping and summarizing data by variable
New skills you will practice with this exercise include:
- Using non-spatial joins for data frames
- Computing geometric measurements
- Using spatial filters
- Using geometric operations on a simple feature geometry set
- Using geometric operations on pairs of simple feature geometries
2 Setup
This exercise uses the sf
and tidyverse
packages:
We are also going to use the us_states
and us_states_df
data from the {spData}
package:
library(spData)
Note that the us_states
loaded for this exercise is different than the us_states
we created during class with the tigris::states()
function. For this exercise, the bonus exercises are mixed in with the other questions but you are welcome to skip them if you do not want go for the bonus part of the exercise.
3 Exercises
3.1 Filtering data
Find all states that belong to the West region, have an area below 250,000 km2and in 2015 a population greater than 5,000,000 residents (Hint: you may need to use the function units::set_units()
or as.numeric()
).
|>
us_states ____
Find all states that belong to the South region, had an area larger than 150,000 km2 or a total population in 2015 larger than 7,000,000 residents.
|>
us_states ____
Render, commit, and push your changes to GitHub with the commit message “Added answers for filtering data questions”.
3.2 Joining and summarizing data
What was the total population in 2015 in the us_states
dataset? What was the minimum and maximum total population in 2015?
|>
us_states ____
Add variables from us_states_df
to us_states
, and create a new object called us_states_stats
.
- What function did you use and why?
- Which variable is the key in both datasets?
- What is the class of the new object?
Tip: we are covering joins in more detail next week—check out the R for Data Science chapter on Joins for more information.
<- us_states |>
us_states_stats ____
us_states_df
has two more rows than us_states
. How can you find them?
Hint: try to use the dplyr::anti_join()
function.
____(us_states, us_states_df)
How much has population density changed between 2010 and 2015 in each state?
Calculate the change in percentages and map them with plot()
or geom_sf()
:
Calculate the change in the number of residents living below the poverty level between 2010 and 2015 for each state.
Hint: See ?us_states_df
for documentation on the poverty level columns.
Optional: Calculate the change in the percentage of residents living below the poverty level in each state.
What was the minimum, average and maximum state’s number of people living below the poverty line in 2015 for each region?
Optional: What is the region with the largest increase in people living below the poverty line?
3.3 Spatial operations
Section 4.2 (in Geocomputation with R) established that Canterbury was the region of New Zealand containing most of the 100 highest points in the country.
How many of these high points does the Canterbury region contain?
<- nz |>
canterbury filter(Name == "Canterbury")
|>
nz_height ____
Optional: plot the result using the ggplot2::geom_sf()
function to show all of New Zealand, canterbury
region highlighted in yellow, high points in Canterbury represented by red crosses (Hint: try using shape = 7
) and high points in other parts of New Zealand represented by blue circles.
See the help page ?ggplot2::shape
and run the examples to see an illustration of different shape
values.
Which region has the second highest number of nz_height
points, and how many does it have?
|>
nz_height ____
Generalizing the question to all regions: how many of New Zealand’s 16 regions contain points which belong to the top 100 highest points in the country? Which regions?
Optional: create a table listing these regions in order of the number of points and their name. Hint: use dplyr::slice_max()
and gt::gt()
.
Using st_buffer()
, how many points in nz_height
are within 100 km of Canterbury?
<- st_buffer(canterbury, dist = ____)
canterbury_area
|>
nz_height ____
Render, commit, and push your changes to GitHub again with a second informative commit message.
3.4 Spatial predicates
Test your knowledge of spatial predicates by finding out and plotting how US states relate to each other and other spatial objects.
The starting point of this part of the exercise is to create an object representing Maryland state in the USA using the filter()
function and plot the resulting object in the context of US states.
<- filter(____, ____)
maryland
ggplot() +
geom_sf(data = us_states) +
geom_sf(data = ____)
Create a new object representing all the states that geographically intersect with Maryland and plot the result (hint: the most concise way to do this is with the subsetting method [
but you can also use sf::st_filter()
).
<- ____ states_intersecting_md
Create another object representing all the objects that touch (have a shared boundary with) Maryland and plot the result.
Hint: remember you can use the argument op = st_intersects
when subsetting with base R or .predicate = st_intersects
when using st_filter()
<- ____ states_touching_md
Optional: create a straight line from the centroid of Maryland to the centroid of California near the West coast of the USA (hint: functions st_centroid()
, st_union()
and st_cast()
described in Chapter 5 may help) and identify which states this long East-West line crosses.
How far is the geographic centroid of Maryland from the geographic centroid of Canterbury, New Zealand?
Calculate the perimeter of the boundary lines of US states in meters. Which state has the longest border and which has the shortest?
Hint: st_perimeter
is a recent addition to {sf}
that works with POLYGON
or MULTIPOLYGON
geometry. If you use st_length
, make sure you convert your data to LINESTRING
or MULTILINESTRING
geometry.
|>
us_states ____
3.5 Tidy data
us_states_df
has information on median income and poverty in a “wide” format. Pivot the data into a long format using tidyr::pivot_longer()
|>
us_states_df pivot_longer(
cols = ____
)
Why could it be useful to have this data in a wide format?
Optional: By default, the new “name” created by pivot_longer()
contains the existing column names. Try using the names_pattern
or names_transform
arguments to create a separate year and variable column:
|>
us_states_df pivot_longer(
cols = ____,
____ )
The last question is from Ch. 5 Data Tidying from R4DS. For this question, we are using a handful of sample tables included with the {tidyr}
package:
table1
table2
table3
Sketch out the process you’d use to calculate the rate for table2
and table3
. You will need to perform four operations:
- Extract the number of TB cases per country per year.
- Extract the matching population per country per year.
- Divide cases by population, and multiply by 10000.
- Store back in the appropriate place.
You haven’t yet learned all the functions you’d need to actually perform these operations, but you should still be able to think through the transformations you’d need.
Don’t forget to render, commit, and push your changes to GitHub one last time with an informative commit message.