Getting started with spatial data using {sf} and the {tidyverse}

Session 1

2024-08-29

1 Overview

  • What is R?
  • How does R work with spatial data?
  • Take a look at {sf} and tidyverse in action

2 What is R?

R is best known as a statistical programming language often used in data science and research.

Like Python or C++, R is an object-oriented, functional programming language where the base set of features can be extended through open-source packages or libraries.

R has been around a while 📜

R turns 30 years old this month 🎂

R is based an S—a language created at Bell Labs in 1976 to support exploratory data analysis.

Digital Equipment Corporation VAX 11/780 mainframe computer. Source: Boston Public Library Copyright Spencer Grant

R doesn’t use a graphical interface 🤖

R is growing in popularity 📈

R is growing in popularity 📈

The average daily downloads for {sf} (the most popular package for working with spatial data in R) has grown from just 1,300 in June 2018 to over 58,000 in June 2023.

R is flexible 🛠️

There can’t be packages for everything—but sometimes it feels that way.

Packages let you work with everything from Microsoft Word documents to 3D renderings of digital elevation data to Google Drive to Google Earth Engine.

Animated loop showing a mountain with lighting from different directions.

Created with {rayshader}

3 How does R work with spatial data?

The {sf} package, first published in 2016, is the most popular R package for spatial data.

Extension packages include:

  • lwgeom for selected liblwgeom/PostGIS functions
  • stars for raster data, and raster or vector data cubes (spatial time series)
  • sfnetworks for geospatial network data

sf package logo

sfnetworks package logo

{sf} is built on open source libraries

Like QGIS, {sf} rests on a foundation of open source libraries:

  • SQLite (a C library that implements a SQL database engine),
  • GDAL (the Geospatial Data Abstraction Library),
  • PROJ (a coordinate transformation software library),
  • and GEOS (a C/C++ library for computational geometry).

GDAL logo

How are sf objects organized

sf objects are an implementation of the simple feature standard.

Simple feature objects are a data frame attached to a “sticky” geometry column (known as a sfc list-column).

Illustration (c) 2018 by Allison Horst

How are sf objects organized

The sf vignette Simple Feature for R explains:

  • sf, the table (data.frame) with feature attributes and “sticky” feature geometries, which contains
  • sfc, the list-column with the geometries for each feature (record), which is composed of
  • sfg, the feature geometry of an individual simple feature.

Reading a sf object from a file

nc <- st_read(system.file("shape/nc.shp", package = "sf"))
Reading layer `nc' from data source 
  `/Users/elipousson/Library/R/arm64/4.4/library/sf/shape/nc.shp' 
  using driver `ESRI Shapefile'
Simple feature collection with 100 features and 14 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
Geodetic CRS:  NAD27
nc
Simple feature collection with 100 features and 14 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
Geodetic CRS:  NAD27
First 10 features:
    AREA PERIMETER CNTY_ CNTY_ID        NAME  FIPS FIPSNO CRESS_ID BIR74 SID74
1  0.114     1.442  1825    1825        Ashe 37009  37009        5  1091     1
2  0.061     1.231  1827    1827   Alleghany 37005  37005        3   487     0
3  0.143     1.630  1828    1828       Surry 37171  37171       86  3188     5
4  0.070     2.968  1831    1831   Currituck 37053  37053       27   508     1
5  0.153     2.206  1832    1832 Northampton 37131  37131       66  1421     9
6  0.097     1.670  1833    1833    Hertford 37091  37091       46  1452     7
7  0.062     1.547  1834    1834      Camden 37029  37029       15   286     0
8  0.091     1.284  1835    1835       Gates 37073  37073       37   420     0
9  0.118     1.421  1836    1836      Warren 37185  37185       93   968     4
10 0.124     1.428  1837    1837      Stokes 37169  37169       85  1612     1
   NWBIR74 BIR79 SID79 NWBIR79                       geometry
1       10  1364     0      19 MULTIPOLYGON (((-81.47276 3...
2       10   542     3      12 MULTIPOLYGON (((-81.23989 3...
3      208  3616     6     260 MULTIPOLYGON (((-80.45634 3...
4      123   830     2     145 MULTIPOLYGON (((-76.00897 3...
5     1066  1606     3    1197 MULTIPOLYGON (((-77.21767 3...
6      954  1838     5    1237 MULTIPOLYGON (((-76.74506 3...
7      115   350     2     139 MULTIPOLYGON (((-76.00897 3...
8      254   594     2     371 MULTIPOLYGON (((-76.56251 3...
9      748  1190     2     844 MULTIPOLYGON (((-78.30876 3...
10     160  2038     5     176 MULTIPOLYGON (((-80.02567 3...

Pulling out the sfc list

You can take a look at the geometry (the sfc list-column) using st_geometry():

st_geometry(nc)
Geometry set for 100 features 
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
Geodetic CRS:  NAD27
First 5 geometries:
MULTIPOLYGON (((-81.47276 36.23436, -81.54084 3...
MULTIPOLYGON (((-81.23989 36.36536, -81.24069 3...
MULTIPOLYGON (((-80.45634 36.24256, -80.47639 3...
MULTIPOLYGON (((-76.00897 36.3196, -76.01735 36...
MULTIPOLYGON (((-77.21767 36.24098, -77.23461 3...

Listing the sfc object attributes

sfc objects have attributes for the coordinate reference system (crs), bounding box (bbox), precision, and number of empty geometries (n_empty).

attributes(st_geometry(nc))
$n_empty
[1] 0

$crs
Coordinate Reference System:
  User input: NAD27 
  wkt:
GEOGCRS["NAD27",
    DATUM["North American Datum 1927",
        ELLIPSOID["Clarke 1866",6378206.4,294.978698213898,
            LENGTHUNIT["metre",1]]],
    PRIMEM["Greenwich",0,
        ANGLEUNIT["degree",0.0174532925199433]],
    CS[ellipsoidal,2],
        AXIS["latitude",north,
            ORDER[1],
            ANGLEUNIT["degree",0.0174532925199433]],
        AXIS["longitude",east,
            ORDER[2],
            ANGLEUNIT["degree",0.0174532925199433]],
    ID["EPSG",4267]]

$class
[1] "sfc_MULTIPOLYGON" "sfc"             

$precision
[1] 0

$bbox
     xmin      ymin      xmax      ymax 
-84.32385  33.88199 -75.45698  36.58965 

Pulling out a sfg object

sfc objects are composed of sfg objects. Each object hold the feature geometry of an individual simple feature.

st_geometry(nc)[[1]]

Creating sfg objects from scratch

The {sf} package also includes functions that allow you to build feature geometry from scratch using st_point(), st_linestring(), and other functions:

st_point(c(0, 1))

st_linestring(matrix(c(0, 0, 1, 1), , 2))

4 Take a look at {sf} and the tidyverse in action

Installing packages

Packages are simple to install:

install.packages(c("sf", "ggplot2", "dplyr"))

Loading packages

And simple to load:

library(sf)
library(ggplot2)
library(dplyr)

tidyverse packages are used for everyday data analyses

Courtesy R for Data Science

Import spatial data from a file

Using the {sf} package, you can use st_read() to read data into R from a local file, URL, or database:

md <- st_read("files/md_counties.gpkg")
Reading layer `md_counties' from data source 
  `/Users/elipousson/Projects/03_teaching/website/slides/files/md_counties.gpkg' 
  using driver `GPKG'
Simple feature collection with 24 features and 17 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: -8848525 ymin: 4563419 xmax: -8347435 ymax: 4825776
Projected CRS: WGS 84 / Pseudo-Mercator

Dropping geometry

Or, if you drop the geometry, a sf object is just like a spreadsheet:

st_drop_geometry(md)
   STATEFP COUNTYFP COUNTYNS GEOID            NAME               NAMELSAD LSAD
1       24      047 01668802 24047       Worcester       Worcester County   06
2       24      001 01713506 24001        Allegany        Allegany County   06
3       24      510 01702381 24510       Baltimore         Baltimore city   25
4       24      015 00596115 24015           Cecil           Cecil County   06
5       24      005 01695314 24005       Baltimore       Baltimore County   06
6       24      013 01696228 24013         Carroll         Carroll County   06
7       24      009 01676636 24009         Calvert         Calvert County   06
8       24      019 00596495 24019      Dorchester      Dorchester County   06
9       24      003 01710958 24003    Anne Arundel    Anne Arundel County   06
10      24      021 01711211 24021       Frederick       Frederick County   06
11      24      033 01714670 24033 Prince George's Prince George's County   06
12      24      027 01709077 24027          Howard          Howard County   06
13      24      023 01711058 24023         Garrett         Garrett County   06
14      24      031 01712500 24031      Montgomery      Montgomery County   06
15      24      035 00596089 24035    Queen Anne's    Queen Anne's County   06
16      24      041 00592947 24041          Talbot          Talbot County   06
17      24      045 01668606 24045        Wicomico        Wicomico County   06
18      24      039 00596907 24039        Somerset        Somerset County   06
19      24      025 01698178 24025         Harford         Harford County   06
20      24      037 01697853 24037      St. Mary's      St. Mary's County   06
21      24      029 00593907 24029            Kent            Kent County   06
22      24      017 01676992 24017         Charles         Charles County   06
23      24      043 01714220 24043      Washington      Washington County   06
24      24      011 00595737 24011        Caroline        Caroline County   06
   CLASSFP MTFCC CSAFP CBSAFP METDIVFP FUNCSTAT      ALAND     AWATER
1       H1 G4020   480  41540     <NA>        A 1213156434  586531107
2       H1 G4020  <NA>  19060     <NA>        A 1093489884   14710932
3       C7 G4020   548  12580     <NA>        F  209649327   28758743
4       H1 G4020   428  37980    48864        A  896912533  185281256
5       H1 G4020   548  12580     <NA>        A 1549740652  215957832
6       H1 G4020   548  12580     <NA>        A 1159355859   13112464
7       H1 G4020   548  47900    47894        A  552158542  341580668
8       H1 G4020   480  15700     <NA>        A 1400573746 1145353068
9       H1 G4020   548  12580     <NA>        A 1074353889  448032843
10      H1 G4020   548  47900    23224        A 1710922224   17674121
11      H1 G4020   548  47900    47894        A 1250057213   41922695
12      H1 G4020   548  12580     <NA>        A  649956423    6336170
13      H1 G4020  <NA>   <NA>     <NA>        A 1681102437   22498420
14      H1 G4020   548  47900    23224        A 1277193339   35686502
15      H1 G4020   548  12580     <NA>        A  962673214  360020725
16      H1 G4020   548  20660     <NA>        A  695561637  539363457
17      H1 G4020   480  41540     <NA>        A  969767208   66764613
18      H1 G4020   480  41540     <NA>        A  828145301  752652883
19      H1 G4020   548  12580     <NA>        A 1132152044  231885675
20      H1 G4020   548  15680     <NA>        A  928809940 1050592561
21      H1 G4020  <NA>   <NA>     <NA>        A  717497117  353321619
22      H1 G4020   548  47900    47894        A 1185757843  479438830
23      H1 G4020   548  25180     <NA>        A 1185655248   24820607
24      H1 G4020  <NA>   <NA>     <NA>        A  827350254   16777066
      INTPTLAT     INTPTLON
1  +38.2221332 -075.3099315
2  +39.6123134 -078.7031037
3  +39.3000324 -076.6104761
4  +39.5623537 -075.9415852
5  +39.4431666 -076.6165693
6  +39.5633280 -077.0153297
7  +38.5227191 -076.5297621
8  +38.4291957 -076.0474333
9  +38.9916174 -076.5608941
10 +39.4701773 -077.3976358
11 +38.8292778 -076.8481880
12 +39.2522639 -076.9244057
13 +39.5472985 -079.2746192
14 +39.1373815 -077.2030633
15 +39.0406929 -076.0824053
16 +38.7483486 -076.1784757
17 +38.3673699 -075.6320828
18 +38.0744501 -075.8533228
19 +39.5374292 -076.2997894
20 +38.2230765 -076.5344870
21 +39.2412793 -076.1259867
22 +38.4728532 -077.0154272
23 +39.6036207 -077.8146709
24 +38.8715306 -075.8316620

Ploting sf objects

You can plot data with R’s built-in plot() function:

plot(md[, 2])

Summarizing data frames with summary()

You can summarize data using the summary() function:

summary(md)
   STATEFP            COUNTYFP           COUNTYNS            GEOID          
 Length:24          Length:24          Length:24          Length:24         
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
     NAME             NAMELSAD             LSAD             CLASSFP         
 Length:24          Length:24          Length:24          Length:24         
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
    MTFCC              CSAFP              CBSAFP            METDIVFP        
 Length:24          Length:24          Length:24          Length:24         
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
   FUNCSTAT             ALAND               AWATER            INTPTLAT        
 Length:24          Min.   :2.096e+08   Min.   :6.336e+06   Length:24         
 Class :character   1st Qu.:8.279e+08   1st Qu.:2.424e+07   Class :character  
 Mode  :character   Median :1.084e+09   Median :2.006e+08   Mode  :character  
                    Mean   :1.048e+09   Mean   :2.908e+08                     
                    3rd Qu.:1.222e+09   3rd Qu.:4.559e+08                     
                    Max.   :1.711e+09   Max.   :1.145e+09                     
   INTPTLON                    geom   
 Length:24          MULTIPOLYGON :24  
 Class :character   epsg:3857    : 0  
 Mode  :character   +proj=merc...: 0  
                                      
                                      
                                      

Using {sf} with tidyverse packages

The tidyverse family of R packages developed by Posit (formerly known as RStudio) including:

  • ggplot2 for making graphics and data visualizations from bar charts to box plots to maps
  • dplyr for common data manipulation challenges, such as filtering, re-arranging, or summarizing data
  • readr for reading rectangular data in a fast and friendly way

Tidyverse packages

Transform and visualize

tidyverse packages work well with {sf} for transforming and visualizing spatial data.

Mapping with geom_sf()

For example, we can use geom_sf() from {ggplot2} to make a simple map:

md_map <- ggplot(data = md) +
  geom_sf() +
  theme_minimal()

md_map

Mapping with geom_sf()

Get a glimpse of your data

Take a peek at the values of the data with the glimpse() function from {dplyr}:

glimpse(md)
Rows: 24
Columns: 18
$ STATEFP  <chr> "24", "24", "24", "24", "24", "24", "24", "24", "24", "24", "…
$ COUNTYFP <chr> "047", "001", "510", "015", "005", "013", "009", "019", "003"…
$ COUNTYNS <chr> "01668802", "01713506", "01702381", "00596115", "01695314", "…
$ GEOID    <chr> "24047", "24001", "24510", "24015", "24005", "24013", "24009"…
$ NAME     <chr> "Worcester", "Allegany", "Baltimore", "Cecil", "Baltimore", "…
$ NAMELSAD <chr> "Worcester County", "Allegany County", "Baltimore city", "Cec…
$ LSAD     <chr> "06", "06", "25", "06", "06", "06", "06", "06", "06", "06", "…
$ CLASSFP  <chr> "H1", "H1", "C7", "H1", "H1", "H1", "H1", "H1", "H1", "H1", "…
$ MTFCC    <chr> "G4020", "G4020", "G4020", "G4020", "G4020", "G4020", "G4020"…
$ CSAFP    <chr> "480", NA, "548", "428", "548", "548", "548", "480", "548", "…
$ CBSAFP   <chr> "41540", "19060", "12580", "37980", "12580", "12580", "47900"…
$ METDIVFP <chr> NA, NA, NA, "48864", NA, NA, "47894", NA, NA, "23224", "47894…
$ FUNCSTAT <chr> "A", "A", "F", "A", "A", "A", "A", "A", "A", "A", "A", "A", "…
$ ALAND    <dbl> 1213156434, 1093489884, 209649327, 896912533, 1549740652, 115…
$ AWATER   <dbl> 586531107, 14710932, 28758743, 185281256, 215957832, 13112464…
$ INTPTLAT <chr> "+38.2221332", "+39.6123134", "+39.3000324", "+39.5623537", "…
$ INTPTLON <chr> "-075.3099315", "-078.7031037", "-076.6104761", "-075.9415852…
$ geom     <MULTIPOLYGON [m]> MULTIPOLYGON (((-8380505 45..., MULTIPOLYGON (((…

Use {sf} for spatial transformations

{sf} includes a range of spatial transformation functions with names matching the spatial functions of PostGIS.

To show how this works, we can get the center of Baltimore City, buffer by 25 miles, and filter intersecting counties.

ST_intersects relationships with different geometry types

Using {sf} for spatial transformations

balt_center <- st_centroid(md[3, ])

balt_center
Simple feature collection with 1 feature and 17 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: -8528259 ymin: 4764882 xmax: -8528259 ymax: 4764882
Projected CRS: WGS 84 / Pseudo-Mercator
  STATEFP COUNTYFP COUNTYNS GEOID      NAME       NAMELSAD LSAD CLASSFP MTFCC
3      24      510 01702381 24510 Baltimore Baltimore city   25      C7 G4020
  CSAFP CBSAFP METDIVFP FUNCSTAT     ALAND   AWATER    INTPTLAT     INTPTLON
3   548  12580     <NA>        F 209649327 28758743 +39.3000324 -076.6104761
                      geom
3 POINT (-8528259 4764882)

Using {sf} for spatial transformations

balt_center <- st_centroid(md[3, ])

balt_area <- st_buffer(balt_center, dist = as_units(25, "mi"))

balt_area
Simple feature collection with 1 feature and 17 fields
Geometry type: POLYGON
Dimension:     XY
Bounding box:  xmin: -8568493 ymin: 4724649 xmax: -8488026 ymax: 4805116
Projected CRS: WGS 84 / Pseudo-Mercator
  STATEFP COUNTYFP COUNTYNS GEOID      NAME       NAMELSAD LSAD CLASSFP MTFCC
3      24      510 01702381 24510 Baltimore Baltimore city   25      C7 G4020
  CSAFP CBSAFP METDIVFP FUNCSTAT     ALAND   AWATER    INTPTLAT     INTPTLON
3   548  12580     <NA>        F 209649327 28758743 +39.3000324 -076.6104761
                            geom
3 POLYGON ((-8488026 4764882,...

Using {sf} for spatial transformations

balt_center <- st_centroid(md[3, ])

balt_area <- st_buffer(balt_center, dist = as_units(25, "mi"))

balt_area_counties <- st_filter(md, balt_area, .predicate = st_intersects)

balt_area_counties
Simple feature collection with 9 features and 17 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: -8630342 ymin: 4655334 xmax: -8433120 ymax: 4825554
Projected CRS: WGS 84 / Pseudo-Mercator
  STATEFP COUNTYFP COUNTYNS GEOID            NAME               NAMELSAD LSAD
1      24      510 01702381 24510       Baltimore         Baltimore city   25
2      24      005 01695314 24005       Baltimore       Baltimore County   06
3      24      013 01696228 24013         Carroll         Carroll County   06
4      24      003 01710958 24003    Anne Arundel    Anne Arundel County   06
5      24      033 01714670 24033 Prince George's Prince George's County   06
6      24      027 01709077 24027          Howard          Howard County   06
7      24      031 01712500 24031      Montgomery      Montgomery County   06
8      24      025 01698178 24025         Harford         Harford County   06
9      24      029 00593907 24029            Kent            Kent County   06
  CLASSFP MTFCC CSAFP CBSAFP METDIVFP FUNCSTAT      ALAND    AWATER    INTPTLAT
1      C7 G4020   548  12580     <NA>        F  209649327  28758743 +39.3000324
2      H1 G4020   548  12580     <NA>        A 1549740652 215957832 +39.4431666
3      H1 G4020   548  12580     <NA>        A 1159355859  13112464 +39.5633280
4      H1 G4020   548  12580     <NA>        A 1074353889 448032843 +38.9916174
5      H1 G4020   548  47900    47894        A 1250057213  41922695 +38.8292778
6      H1 G4020   548  12580     <NA>        A  649956423   6336170 +39.2522639
7      H1 G4020   548  47900    23224        A 1277193339  35686502 +39.1373815
8      H1 G4020   548  12580     <NA>        A 1132152044 231885675 +39.5374292
9      H1 G4020  <NA>   <NA>     <NA>        A  717497117 353321619 +39.2412793
      INTPTLON                           geom
1 -076.6104761 MULTIPOLYGON (((-8539486 47...
2 -076.6165693 MULTIPOLYGON (((-8537944 48...
3 -077.0153297 MULTIPOLYGON (((-8566905 48...
4 -076.5608941 MULTIPOLYGON (((-8524879 47...
5 -076.8481880 MULTIPOLYGON (((-8564237 47...
6 -076.9244057 MULTIPOLYGON (((-8545393 47...
7 -077.2030633 MULTIPOLYGON (((-8577825 47...
8 -076.2997894 MULTIPOLYGON (((-8483226 47...
9 -076.1259867 MULTIPOLYGON (((-8479204 47...

Using {sf} for spatial transformations

ggplot(data = md) +
  geom_sf() +
  geom_sf(data = balt_area_counties, fill = "darkorange", alpha = 0.4) +
  geom_sf(data = balt_area, color = "orange", fill = NA, linewidth = 1) +
  theme_void()

Using {sf} for spatial transformations