Session 8
2023-10-18
Consider:
This next section is largely excerpted from R for Data Science (2e) - 11 Exploratory data analysis.
“More than anything, EDA is a state of mind.”
When you ask a question…
When you start with a dataset, you might do something where you look at the general summary using:
Types of values to explore:
To understand the subgroups, ask:
What makes a value unusual?
Handling unusual values can include:
If unusual values are the result of a issue with the data import or original data collection, these issues can be addressed by re-coding or cleaning data.
If a systematic relationship exists between two variables it will appear as a pattern in the data. If you spot a pattern, ask yourself:
From The Art of Data Science (2017) by Roger D. Peng and Elizabeth Matsui:
Formulate your question
Read in your data
Check the packaging
Look at the top and the bottom of your data
Check your “n”s
Validate with at least one external data source
Make a plot
Try the easy solution first
Follow up
Specialty packages for working with data.
naniar
provides principled, tidy ways to summarise, visualise, and manipulate missing data with minimal deviations from the workflows in ggplot2 and tidy data.
Visualizations of Distributions and Uncertainty • ggdist
ggdist
is an R package that provides a flexible set of ggplot2 geoms and stats designed especially for visualizing distributions and uncertainty.
Create Codebooks from Data Frames • codebookr
The
codebookr
package is intended to make it easy for users to create codebooks (also called data dictionaries) directly from an R data frame.
inspectdf is collection of utilities for columnwise summary, comparison and visualisation of data frames. Functions are provided to summarise missingness, categorical levels, numeric distribution, correlation, column types and memory usage.
Datasette is a tool for exploring and publishing data. It helps people take data of any shape, analyze and explore it, and publish it as an interactive website and accompanying API.