Tidying and joining data

Session 6

2024-09-05

Categories of functions in {tidyr}:

  1. Pivoting
  2. Rectangling
  3. Nesting and unnesting
  4. Splitting and combining character columns
  5. Handling missing values

1 What is involved in tidying messy datasets?

  • Column headers are values, not variable names
  • Multiple variables stored in one column
  • Variables are stored in both rows and columns
  • Multiple types in one table
  • One type in multiple tables

2 Pivoting

3 How would this look if we turn it into tidy data?

City 2000 2010 2020
Baltimore city, Maryland 300000 290000 285000
Washington, DC 300000 315000 330000
Richmond, Virginia 200000 202000 210000
City Year Value
Baltimore city, Maryland 2000 300000
Baltimore city, Maryland 2010 290000
Baltimore city, Maryland 2020 285000

pivot_longer(
  cols = !any_of("City"),
  names_to = "Year",
  values_to = "Value"
)

4 Tidying messy datasets

  • Column headers are values, not variable names
  • Multiple variables stored in one column
  • Variables are stored in both rows and columns
  • Multiple types in one table
  • One type in multiple tables

5 How would this look if we turn it into tidy data?

City name 2000 2010 2020
Baltimore city, Maryland 300000 290000 285000
Washington, DC 300000 315000 330000
Richmond, Virginia 200000 202000 210000
City State Year Value
Baltimore city Maryland 2000 300000
Washington DC 2000 300000
Richmond Virginia 2000 200000

6 Data organization in spreadsheets

What are the benefits of careful data organization in spreadsheets?

  • reduce errors
  • ease later analyses

What are the practical approaches for data organization in spreadsheets?

  1. be consistent,
  2. write dates like YYYY-MM-DD,
  3. do not leave any cells empty,
  4. put just one thing in a cell,
  5. organize the data as a single rectangle (with subjects as rows and variables as columns, and with a single header row),
  6. create a data dictionary,
  7. do not include calculations in the raw data files,
  8. do not use font color or highlighting as data,
  9. choose good names for things,
  10. make backups,
  11. use data validation to avoid data entry errors,
  12. and save the data in plain text files.