16 Data insights

16.1 Learning Objectives

Use summary functions to explore the structure and completeness of a dataset.
Create simple summaries and grouped summaries using count(), group_by(), and summarise().
Calculate descriptive statistics (mean, SD) across groups.
Use janitor tools Firke (2024) to quickly tabulate and summarise categorical data.
Recognise how grouping can change the interpretation of summaries and relationships

16.2 A first glimpse

When starting with a new dataset, we want to get an initial idea:

How many rows and columns are there?
What are the column names?
What types of data are in each column?
What are their possible values or ranges?
These answers are useful to know before jumping into wrangling and cleaning data.

There are several ways to return an overview of your data, ranging in how comprehensively you wish to summarise your data’s structure.

16.3 The data

penguins <- read_csv(here("data", "penguins_clean.csv"))

glimpse(penguins)

Rows: 344
Columns: 19
$ study_name        <chr> "PAL0708", "PAL0708", "PAL0708", "PAL0708", "PAL0708…
$ sample_number     <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1…
$ species           <chr> "Adelie", "Adelie", "Adelie", "Adelie", "Adelie", "A…
$ region            <chr> "Anvers", "Anvers", "Anvers", "Anvers", "Anvers", "A…
$ island            <chr> "Torgersen", "Torgersen", "Torgersen", "Torgersen", …
$ stage             <chr> "Adult, 1 Egg Stage", "Adult, 1 Egg Stage", "Adult, …
$ individual_id     <chr> "N1A1", "N1A2", "N2A1", "N2A2", "N3A1", "N3A2", "N4A…
$ clutch_completion <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "No", "No"…
$ date_egg          <date> 2007-11-11, 2007-11-11, 2007-11-16, 2007-11-16, 200…
$ culmen_length_mm  <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
$ culmen_depth_mm   <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
$ flipper_length_mm <dbl> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
$ body_mass_g       <dbl> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
$ sex               <chr> "Male", "Female", "Female", NA, "Female", "Male", "F…
$ delta_15_n_o_oo   <dbl> NA, 8.94956, 8.36821, NA, 8.76651, 8.66496, 9.18718,…
$ delta_13_c_o_oo   <dbl> NA, -24.69454, -25.33302, NA, -25.32426, -25.29805, …
$ comments          <chr> "Not enough blood for isotopes.", NA, NA, "Adult not…
$ year              <dbl> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…
$ mass_range        <fct> mid penguin, mid penguin, smol penguin, NA, smol pen…

str(penguins)

tibble [344 × 19] (S3: tbl_df/tbl/data.frame)
 $ study_name       : chr [1:344] "PAL0708" "PAL0708" "PAL0708" "PAL0708" ...
 $ sample_number    : num [1:344] 1 2 3 4 5 6 7 8 9 10 ...
 $ species          : chr [1:344] "Adelie" "Adelie" "Adelie" "Adelie" ...
 $ region           : chr [1:344] "Anvers" "Anvers" "Anvers" "Anvers" ...
 $ island           : chr [1:344] "Torgersen" "Torgersen" "Torgersen" "Torgersen" ...
 $ stage            : chr [1:344] "Adult, 1 Egg Stage" "Adult, 1 Egg Stage" "Adult, 1 Egg Stage" "Adult, 1 Egg Stage" ...
 $ individual_id    : chr [1:344] "N1A1" "N1A2" "N2A1" "N2A2" ...
 $ clutch_completion: chr [1:344] "Yes" "Yes" "Yes" "Yes" ...
 $ date_egg         : Date[1:344], format: "2007-11-11" "2007-11-11" ...
 $ culmen_length_mm : num [1:344] 39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
 $ culmen_depth_mm  : num [1:344] 18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
 $ flipper_length_mm: num [1:344] 181 186 195 NA 193 190 181 195 193 190 ...
 $ body_mass_g      : num [1:344] 3750 3800 3250 NA 3450 ...
 $ sex              : chr [1:344] "Male" "Female" "Female" NA ...
 $ delta_15_n_o_oo  : num [1:344] NA 8.95 8.37 NA 8.77 ...
 $ delta_13_c_o_oo  : num [1:344] NA -24.7 -25.3 NA -25.3 ...
 $ comments         : chr [1:344] "Not enough blood for isotopes." NA NA "Adult not sampled." ...
 $ year             : num [1:344] 2007 2007 2007 2007 2007 ...
 $ mass_range       : Factor w/ 3 levels "smol penguin",..: 2 2 1 NA 1 2 2 3 1 2 ...

library(skimr)

skim(penguins)

Data summary
Name	penguins
Number of rows	344
Number of columns	19
_______________________
Column type frequency:
character	9
Date	1
factor	1
numeric	8
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	n_unique
study_name	0	1.00	7	7	3
species	0	1.00	6	9	3
region	0	1.00	6	6	1
island	0	1.00	5	9	3
stage	0	1.00	18	18	1
individual_id	0	1.00	4	6	190
clutch_completion	0	1.00	2	3	2
sex	11	0.97	4	6	2
comments	290	0.16	18	68	10

Variable type: Date

skim_variable	n_missing	complete_rate	min	max	median	n_unique
date_egg	0	1	2007-11-09	2009-12-01	2008-11-09	50

Variable type: factor

skim_variable	n_missing	complete_rate	ordered	n_unique	top_counts
mass_range	2	0.99	FALSE	3	mid: 146, cho: 118, smo: 78

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
sample_number	0	1.00	63.15	40.43	1.00	29.00	58.00	95.25	152.00	▇▇▆▅▃
culmen_length_mm	2	0.99	43.92	5.46	32.10	39.23	44.45	48.50	59.60	▃▇▇▆▁
culmen_depth_mm	2	0.99	17.15	1.97	13.10	15.60	17.30	18.70	21.50	▅▅▇▇▂
flipper_length_mm	2	0.99	200.92	14.06	172.00	190.00	197.00	213.00	231.00	▂▇▃▅▂
body_mass_g	2	0.99	4201.75	801.95	2700.00	3550.00	4050.00	4750.00	6300.00	▃▇▆▃▂
delta_15_n_o_oo	14	0.96	8.73	0.55	7.63	8.30	8.65	9.17	10.03	▃▇▆▅▂
delta_13_c_o_oo	13	0.96	-25.69	0.79	-27.02	-26.32	-25.83	-25.06	-23.79	▆▇▅▅▂
year	0	1.00	2008.03	0.82	2007.00	2007.00	2008.00	2009.00	2009.00	▇▁▇▁▇

At this early stage, it’s helpful to assess whether your dataset meets your expectations. Consider if the data appear as anticipated. Are the values in each column reasonable? Are there any noticeable gaps or errors that might need to be corrected, or that could potentially render the data unusable?

Your turn

The dataset has rows (including the headers) and 17 columns.

It also provides information on the type of data in each column

<chr> - means character or text data
<dbl> - means numerical data

Q Based on our summary functions are any variables assigned to the wrong data type (should be character when numeric or vice versa)?

Although some columns like date might not be correctly treated as character variables, they are not strictly numeric either, all other columns appear correct

Q Based on our summary functions do we have complete data for all variables?

No, they are 2 missing data points for body measurements (culmen, flipper, body mass), 11 missing data points for sex, 13/14 missing data points for blood isotopes (Delta N/C) and 290 missing data points for comments

We have just learned some ways to initially inspect our dataset. Keep in mind, we don’t expect everything to be perfect. This initial inspection is a good opportunity to identify where these issues might be and assess their severity.

When you are confident that the dataset is largely as expected, you are ready to start summarising your data.

16.4 Summary counts

In the previous section, we learned how to get an overview of our data’s structure, including the number of rows, the columns present, and any missing data. In this section, we will focus on summarising the data. Summarising data can provide insight into the scope and variation in our dataset, and help in evaluating its suitability for our analysis.

With our data we can count the total number of occurrences for different groups either by:

16.4.1 Filtering

penguins |> 
  filter(species == "Adelie") |> 
  count()

n
152

16.4.2 Grouping

Or by grouping :

penguins |> 
  group_by(species) |> 
  count()

species	n
Adelie	152
Chinstrap	68
Gentoo	124

At this stage, these counts tell us what is in our dataset, not how variables relate to one another. To understand relationships, we will later need visualisations and models.

16.5 Frequency counts by subgroups

We can apply multiple grouping parameters at the same time - for example if we wish to know the frequency of observations by species and sex.

We can do this using dplyr or with functions in the janitor package:

penguins |> 
  group_by(species,sex) |> 
  count() |> 
  arrange(desc(n))

species	sex	n
Adelie	Female	73
Adelie	Male	73
Gentoo	Male	61
Gentoo	Female	58
Chinstrap	Female	34
Chinstrap	Male	34
Adelie	NA	6
Gentoo	NA	5

penguins |>
  tabyl(sex, species) |> 
  adorn_percentages("all") |>
  adorn_totals(c("row", "col")) |>
  adorn_pct_formatting(digits = 1)

sex	Adelie	Chinstrap	Gentoo	Total
Female	21.2%	9.9%	16.9%	48.0%
Male	21.2%	9.9%	17.7%	48.8%
NA	1.7%	0.0%	1.5%	3.2%
Total	44.2%	19.8%	36.0%	100.0%

Grouped summaries implicitly assume that differences between groups may matter. Later, we will see that ignoring important grouping variables can lead to misleading conclusions

16.6 Visualising Frequencies

Graphs make summaries easier to interpret at a glance.

geom_col by default plots single values or “identity”, so numbers must be pre-calculated

penguins |> 
  group_by(species,sex) |> 
  count() |> 
  arrange(desc(n)) |> 
  ggplot(aes(x = species,
             y = n,
             fill = sex))+
  geom_col(position=position_dodge2(preserve="single"))+
  coord_flip()+
  labs(x = "")+
  theme(legend.position = "bottom")

penguins |> 
  group_by(species,sex) |> 
  count() |> 
  arrange(desc(n)) |> 
  ggplot(aes(x = species,
             y = n,
             fill = sex))+
  geom_col(position=position_dodge2(preserve="single"))+
  geom_label(aes(label = n),
             position=position_dodge2(preserve="single",
                                      width = .9))+ #<- dodge the text and label bars
  coord_flip()+
  labs(x = "")+
  theme(legend.position = "bottom")

geom_bar by default will summarise

penguins |> 
  ggplot(aes(x = species,
             fill = sex))+
  geom_bar(position=position_dodge2(preserve="single"))+
  coord_flip()+
  labs(x = "")+
  theme(legend.position = "bottom")

16.7 Exploring relationships between two variables

So far, we have focused on understanding the structure of the dataset and summarising individual variables or groups of observations. In many analyses, however, we are interested in how two variables vary together.

Exploring relationships between variables is an important step before modelling. At this stage, our goal is not to explain or predict outcomes, but to identify patterns, assess whether variables appear related, and consider whether these relationships differ across meaningful groups.

Plots are particularly important here, as numerical summaries alone can obscure important structure in the data.

16.7.1 Scatterplots

When both variables of interest are numeric, a scatterplot is the most informative first tool. Scatterplots allow us to see how values of one variable change in relation to another and whether any patterns are present.

penguins|>
ggplot(aes(x = culmen_length_mm,
           y = culmen_depth_mm)) +
geom_point(alpha = 0.7)

This plot shows the relationship between culmen length and culmen depth, with points coloured by penguin species.

Scatterplots such as this allow us to consider questions such as:

Do the variables appear positively or negatively associated?
Is the relationship approximately linear?

At this stage, we focus on visual comparison, rather than numerical summaries.

Your turn

Based on the scatterplot above, can you make a plot that includes species as an important grouping variable?

How does this change your observation?

penguins|>
ggplot(aes(x = culmen_length_mm,
           y = culmen_depth_mm,
           colour = species)) +
geom_point(alpha = 0.7)+
scale_colour_discrete_qualitative()

We now see a striking reversal of the apparent association between culmen length and depth, demonstrating the importance of considering groups in our analysis.

16.8 Correlation

In some situations, it can be useful to summarise the relationship between two numeric variables using a single numerical measure. Correlation provides such a summary by describing the strength and direction of a linear association between two variables.

Below, we calculate the overall correlation between culmen length and culmen depth across all observations.

penguins |>
summarise(
r = cor(culmen_length_mm,
        culmen_depth_mm,
use = "complete.obs") # Important to include if there are any missing values
)

r
-0.2350529

Your turn

How can we generate summary statistics that reflect our important subgrouping (species)?

How does this change your observation?

penguins |>
  group_by(species) |> 
summarise(
r = cor(culmen_length_mm,
        culmen_depth_mm,
use = "complete.obs") # Important to include if there are any missing values
)

species	r
Adelie	0.3914917
Chinstrap	0.6535362
Gentoo	0.6433839

16.9 Summary statistics

We can extend our summaries to show not just counts, but also measures of central tendency (mean) and spread (standard deviation).

These are powerful ways to understand variation within groups. These summaries describe average differences, but they do not tell us whether one variable predicts another, nor whether relationships differ across groups.

penguins |>
group_by(species) |> # Calculate withing groups
summarise(
mean_mass = mean(body_mass_g, na.rm = TRUE),
sd_mass = sd(body_mass_g, na.rm = TRUE),
n = n(),
median = median(body_mass_g, na.rm = TRUE),
iqr = IQR(body_mass_g, na.rm = TRUE)
)

species	mean_mass	sd_mass	n	median	iqr
Adelie	3700.662	458.5661	152	3700	650.0
Chinstrap	3733.088	384.3351	68	3700	462.5
Gentoo	5076.016	504.1162	124	5000	800.0

Your turn

Add sex to the group_by() function to see how mean and SD of body mass differ by sex within species.

penguins |>
group_by(species, sex) |> 
drop_na(sex) |> # Optional remove rows where Sex is unknown
summarise(
mean_mass_g = mean(body_mass_g, na.rm = TRUE),
sd_mass_g = sd(body_mass_g, na.rm = TRUE),
n = n()
)

species	sex	mean_mass_g	sd_mass_g	n
Adelie	Female	3368.836	269.3801	73
Adelie	Male	4043.493	346.8116	73
Chinstrap	Female	3527.206	285.3339	34
Chinstrap	Male	3938.971	362.1376	34
Gentoo	Female	4679.741	281.5783	58
Gentoo	Male	5484.836	313.1586	61

16.9.1 Summarise multiple variables

These functions allow us to generate

summarise_at()

Summarise specific selected variables:

penguins |> 
  group_by(species) |> 
  summarise_at(c("flipper_length_mm",
                 "culmen_length_mm",
                 "culmen_depth_mm"),
               mean, na.rm =T) # mean function

species	flipper_length_mm	culmen_length_mm	culmen_depth_mm
Adelie	189.9536	38.79139	18.34636
Chinstrap	195.8235	48.83382	18.42059
Gentoo	217.1870	47.50488	14.98211

summarise_if()

penguins |> 
  group_by(species) |> 
  summarise_if(is.numeric, # selects only numeric columns
               mean, na.rm =T)

species	sample_number	culmen_length_mm	culmen_depth_mm	flipper_length_mm	body_mass_g	delta_15_n_o_oo	delta_13_c_o_oo	year
Adelie	76.5	38.79139	18.34636	189.9536	3700.662	8.859733	-25.80419	2008.013
Chinstrap	34.5	48.83382	18.42059	195.8235	3733.088	9.356155	-24.54654	2007.971
Gentoo	62.5	47.50488	14.98211	217.1870	5076.016	8.245338	-26.18530	2008.081

16.10 Visualising distributions

Numerical summaries such as the mean and standard deviation provide useful information about a variable, but they do not show the full distribution of values.

Visualising distributions allows us to see how values are spread, whether observations are concentrated in particular ranges, and how distributions differ across groups.

Different plots are suited to different questions. In this section, we focus on choosing appropriate plots for comparing distributions across groups.

16.10.1 Boxplots for group comparisons

A boxplot provides a compact summary of the distribution of a numeric variable. It displays the median, the spread of the central values, and potential extreme observations, making it particularly useful for comparing distributions between groups.

Below, we visualise the distribution of body mass across penguin species.

penguins |>
ggplot(aes(x = species,
           y = body_mass_g)) +
geom_boxplot() +
coord_flip()

This plot allows us to compare:

Median body mass across species,
IQR (the interquartile range) the spread of values within each species,
Whether some species show more variability than others.

Boxplots are especially effective when the primary goal is comparison between groups, rather than detailed inspection of the distribution shape.

When visualising distributions, the choice of plot depends on the question being asked. In some cases, we are interested in comparing groups; in others, we want to explore the shape of the distribution within a single group.

Your turn

You are exploring body mass in the penguin dataset.

If you wanted to explore the shape of the body-mass distribution for each single species, how would you change the plot? (Hint: consider changing the geom rather than the data.)

A histogram could be used by changing the geom to geom_histogram() and focusing on one species, as this allows closer inspection of the distribution shape within that group.

penguins |> 
  ggplot(aes(body_mass_g,
             fill = species))+
  geom_histogram()+
  scale_fill_discrete_qualitative()+
  facet_wrap(~species, # make facets by species
             ncol = 1) # stack the plots for easier comparison

16.11 Useful `summary` functions

16.11.0.1 Measure of location:

mean(x): sum of x divided by the length
median(x): 50% of x is above and 50% is below

16.11.0.2 Measure of variation:

sd(x): standard deviation
IQR(x): interquartile range (robust equivalent of sd when outliers are present in the data)

16.11.0.3 Measure of rank:

min(x): minimum value of x
max(x): maximum value of x
quantile(x, 0.25): 25% of x is below this value

16.11.0.4 Counts:

n(x): the number of element in x
sum(!is.na(x)): count non-missing values
n_distinct(x): count the number of unique value

16.12 GGally

GGally is an invaluable tool in a researcher’s toolkit. It seamlessly extends the capabilities of the widely used ggplot2 package. With GGally, you can effortlessly create a variety of visualizations to explore and understand distributions and correlations among your variables. Its flexibility and ease of use make it a go-to choice for streamlining the process of creating insightful plots and charts for data analysis.

library(GGally)
penguins |> 
  select(species, island, culmen_length_mm, culmen_depth_mm, flipper_length_mm, body_mass_g, sex) |> 
  ggpairs()

If we want to focus our exploration or include important grouping variables then this is also supported.

penguins |> 
  ggpairs(columns = 10:12, ggplot2::aes(colour = species))

16.13 Summary

In this section we learned to:

Inspect structure and completeness
- Use glimpse(), str(), and skim() to understand column types, missing data, and variable ranges.
- Confirm that variables are stored in appropriate formats (e.g. numeric vs character).
Summarise counts and categories

Count observations using count() and group_by() to explore dataset composition.

Use janitor::tabyl() for fast, readable cross-tabulations and percentages.
Calculate descriptive statistics
- Compute group-wise summaries with summarise() such as means, SDs, and counts.
Visualise descriptive ranges and associations
- Use ggplot to produce a range of descriptive plots

16.1 Learning Objectives

16.2 A first glimpse

16.3 The data

16.4 Summary counts

16.4.1 Filtering

16.4.2 Grouping

16.5 Frequency counts by subgroups

16.6 Visualising Frequencies

16.7 Exploring relationships between two variables

16.7.1 Scatterplots

16.8 Correlation

16.9 Summary statistics

16.9.1 Summarise multiple variables

16.10 Visualising distributions

16.10.1 Boxplots for group comparisons

16.11 Useful summary functions

16.11.0.1 Measure of location:

16.11.0.2 Measure of variation:

16.11.0.3 Measure of rank:

16.11.0.4 Counts:

16.12 GGally

16.13 Summary

16.11 Useful `summary` functions