Exploratory data analysis

Session 8

2023-10-18

1 Working openly

1.1 Why open your work?

  1. Improve the quality of your work: “be more organized, more accurate, less likely to miss errors”
  2. Broaden reach and impact
  3. Foster data literacy: “others can follow and learn—which can enrich and diversify data ecosystems, practices, and communities”

1.2 How to open your work?

Consider:

  • when is transparency valuable?
  • when is transparency a lower priority?
  • when is transparency potentially harmful?

2 Exploratory data analysis

This next section is straight from R for Data Science (2e) - 11  Exploratory data analysis.

2.1 What do you do when you do exploratory data analysis?

  1. Generate questions about your data.
  2. Search for answers by visualizing, transforming, and modelling your data.
  3. Use what you learn to refine your questions and/or generate new questions.

“More than anything, EDA is a state of mind.”

2.2 Use questions as tools to guide your investigation

When you ask a question…

  • the question focuses your attention on a specific part of your dataset
  • this helps you decide which graphs, models, or transformations to make.
  • the key to asking quality questions is to generate a large quantity of questions

2.3 Two useful questions to start

  1. What type of variation occurs within my variables?
  2. What type of covariation occurs between my variables?

2.4 What are you “exploring” when you do exploratory data analysis?

General summary

When you start with a dataset, you might do something where you look at the general summary, using functions such as:

“These work really well when you’ve got a small amount of data, but when you have more data, you are generally limited by how much you can read.”

2.5 Variation

Typical values

  • Which values are the most common? Why?
  • Which values are rare? Why? Does that match your expectations?
  • Can you see any unusual patterns? What might explain them?
smaller <- diamonds |>
  filter(carat < 3)
Error in eval(expr, envir, enclos): object 'diamonds' not found
ggplot(smaller, aes(x = carat)) +
  geom_histogram(binwidth = 0.01)
Error in ggplot(smaller, aes(x = carat)): could not find function "ggplot"

Sub-groups

To understand the subgroups, ask:

  • How are the observations within each subgroup similar to each other?
  • How are the observations in separate clusters different from each other?
  • How can you explain or describe the clusters?
  • Why might the appearance of clusters be misleading?

Unusual values

What makes a value unusual?

ggplot(diamonds, aes(x = y)) +
  geom_histogram(binwidth = 0.5)
Error in ggplot(diamonds, aes(x = y)): could not find function "ggplot"

Handling unusual values can include:

  • Dropping observations with unusual values
  • Replacing unusual values with missing values

2.6 Covariation

A categorical and a numerical variable

ggplot(diamonds, aes(x = price)) +
  geom_freqpoly(aes(color = cut), binwidth = 500, linewidth = 0.75)
Error in ggplot(diamonds, aes(x = price)): could not find function "ggplot"

Two categorical variables

ggplot(diamonds, aes(x = cut, y = color)) +
  geom_count()
Error in ggplot(diamonds, aes(x = cut, y = color)): could not find function "ggplot"

Two numerical variables

ggplot(smaller, aes(x = carat, y = price)) +
  geom_point()
Error in ggplot(smaller, aes(x = carat, y = price)): could not find function "ggplot"

2.7 Patterns and models

If a systematic relationship exists between two variables it will appear as a pattern in the data. If you spot a pattern, ask yourself:

  • Could this pattern be due to coincidence (i.e. random chance)?
  • How can you describe the relationship implied by the pattern?
  • How strong is the relationship implied by the pattern?
  • What other variables might affect the relationship?
  • Does the relationship change if you look at individual subgroups of the data?

2.8 Tools for data exploration

Missing data

naniar

naniar provides principled, tidy ways to summarise, visualise, and manipulate missing data with minimal deviations from the workflows in ggplot2 and tidy data.

Distributions

Visualizations of Distributions and Uncertainty • ggdist

ggdist is an R package that provides a flexible set of ggplot2 geoms and stats designed especially for visualizing distributions and uncertainty.

  • What type of data do you have?
  • How much time do you have?
  • How do you expect to communicate what you learn?

2.9 Communicating

Codebooks

Create Codebooks from Data Frames • codebookr

The codebookr package is intended to make it easy for users to create codebooks (also called data dictionaries) directly from an R data frame.

Option to put interactive elements in an HTML table — opt_interactive • gt

Additional packages

inspectdf

inspectdf is collection of utilities for columnwise summary, comparison and visualisation of data frames. Functions are provided to summarise missingness, categorical levels, numeric distribution, correlation, column types and memory usage.

Other tools

Datasette:

Datasette is a tool for exploring and publishing data. It helps people take data of any shape, analyze and explore it, and publish it as an interactive website and accompanying API.