10 Missing data

Datasets often contain missing values. However, analysis and visualisation software can differ in their default response to missing values. They might drop missing values, provide a warning about them, or return an error. We don’t always know which response we’ll get, which can cause problems for us (possibly without our knowing) during analysis later on. This is why it’s important to clean missing values effectively.

Understanding the nature of missing data can also be useful for informing how you proceed with analysis. For instance, understanding whether missing data are random or not can influence what you choose to do with them (e.g. remove vs. impute), which in turn can affect the results of statistical models based on those datasets. Identifying where missing data occur in a dataset is therefore an important part of exploratory data analysis. This chapter will show you how to detect and remove missing values.

Penguin clean names dataset

penguins_clean_names <- readRDS(url("https://github.com/Philip-Leftwich/Oct-Intro-Analytics/raw/refs/heads/dev/data/penguin_clean_names.RDS"))

Your turn

Run this command in your console:

penguins_clean_names |> 
  group_by(species) |> 
  summarise(mean = mean(body_mass_g))

species	mean
Adelie Penguin (Pygoscelis adeliae)	NA
Chinstrap penguin (Pygoscelis antarctica)	3733.088
Gentoo penguin (Pygoscelis papua)	NA

Question: Why is this happening?

10.1 Find Missing values

There are numerous ways to identify missing values in your data. Here we illustrate a few ways with very different output types.

summary(penguins_clean_names)

  study_name        sample_number      species             region         
 Length:344         Min.   :  1.00   Length:344         Length:344        
 Class :character   1st Qu.: 29.00   Class :character   Class :character  
 Mode  :character   Median : 58.00   Mode  :character   Mode  :character  
                    Mean   : 63.15                                        
                    3rd Qu.: 95.25                                        
                    Max.   :152.00                                        
                                                                          
    island             stage           individual_id      clutch_completion 
 Length:344         Length:344         Length:344         Length:344        
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
                                                                            
   date_egg         culmen_length_mm culmen_depth_mm flipper_length_mm
 Length:344         Min.   :32.10    Min.   :13.10   Min.   :172.0    
 Class :character   1st Qu.:39.23    1st Qu.:15.60   1st Qu.:190.0    
 Mode  :character   Median :44.45    Median :17.30   Median :197.0    
                    Mean   :43.92    Mean   :17.15   Mean   :200.9    
                    3rd Qu.:48.50    3rd Qu.:18.70   3rd Qu.:213.0    
                    Max.   :59.60    Max.   :21.50   Max.   :231.0    
                    NA's   :2        NA's   :2       NA's   :2        
  body_mass_g       sex              delta_15n        delta_13c     
 Min.   :2700   Length:344         Min.   : 7.632   Min.   :-27.02  
 1st Qu.:3550   Class :character   1st Qu.: 8.300   1st Qu.:-26.32  
 Median :4050   Mode  :character   Median : 8.652   Median :-25.83  
 Mean   :4202                      Mean   : 8.733   Mean   :-25.69  
 3rd Qu.:4750                      3rd Qu.: 9.172   3rd Qu.:-25.06  
 Max.   :6300                      Max.   :10.025   Max.   :-23.79  
 NA's   :2                         NA's   :14       NA's   :13      
   comments        
 Length:344        
 Class :character  
 Mode  :character

skimr::skim(penguins_clean_names)

Data summary
Name	penguins_clean_names
Number of rows	344
Number of columns	17
_______________________
Column type frequency:
character	10
numeric	7
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	n_unique
study_name	0	1.00	7	7	3
species	0	1.00	33	41	3
region	0	1.00	6	6	1
island	0	1.00	5	9	3
stage	0	1.00	18	18	1
individual_id	0	1.00	4	6	190
clutch_completion	0	1.00	2	3	2
date_egg	0	1.00	10	10	50
sex	11	0.97	4	6	2
comments	290	0.16	18	68	10

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
sample_number	0	1.00	63.15	40.43	1.00	29.00	58.00	95.25	152.00	▇▇▆▅▃
culmen_length_mm	2	0.99	43.92	5.46	32.10	39.23	44.45	48.50	59.60	▃▇▇▆▁
culmen_depth_mm	2	0.99	17.15	1.97	13.10	15.60	17.30	18.70	21.50	▅▅▇▇▂
flipper_length_mm	2	0.99	200.92	14.06	172.00	190.00	197.00	213.00	231.00	▂▇▃▅▂
body_mass_g	2	0.99	4201.75	801.95	2700.00	3550.00	4050.00	4750.00	6300.00	▃▇▆▃▂
delta_15n	14	0.96	8.73	0.55	7.63	8.30	8.65	9.17	10.03	▃▇▆▅▂
delta_13c	13	0.96	-25.69	0.79	-27.02	-26.32	-25.83	-25.06	-23.79	▆▇▅▅▂

naniar::vis_miss(penguins_clean_names)

We can return all rows with a missing variable.

penguins_clean_names |> 
  filter(if_any(everything(), is.na)) |>
  select(culmen_length_mm, culmen_depth_mm, flipper_length_mm, 
         sex, delta_15n, delta_13c,comments,
         everything()) # reorder columns

Or we can specify we just want to look at a specific columns NA:

penguins_clean_names |> 
  filter(if_any(culmen_length_mm, is.na))  # reorder columns

study_name	sample_number	species	region	island	stage	individual_id	clutch_completion	date_egg	culmen_length_mm	culmen_depth_mm	flipper_length_mm	body_mass_g	sex	delta_15n	delta_13c	comments
PAL0708	4	Adelie Penguin (Pygoscelis adeliae)	Anvers	Torgersen	Adult, 1 Egg Stage	N2A2	Yes	16/11/2007	NA	NA	NA	NA	NA	NA	NA	Adult not sampled.
PAL0910	120	Gentoo penguin (Pygoscelis papua)	Anvers	Biscoe	Adult, 1 Egg Stage	N38A2	No	01/12/2009	NA	NA	NA	NA	NA	NA	NA	Adult not sampled. Nest never observed with full clutch.

10.2 Remove missing values

Three main strategies:

Drop all rows with any NA (drop_na()): simple but can remove lots - the nuclear option
Drop NAs in specific columns only.
Keep rows, use na.rm = TRUE inside summary functions (least destructive).

10.2.1 `drop_na()` on everything before we start.

Using drop_na() to remove rows that contain any missing values across all variables can be useful, but it also runs the risk of losing a large portion of the data.

Pros: It’s a simple, all-in-one solution for datasets where missing values are widespread and problematic.
Cons: This method may remove entire rows of data, even if the missing value is only in a single, unimportant column. This can lead to significant data loss if many rows have missing values in non-critical variables.

penguins_clean_names |> 
  drop_na()

10.2.2 `drop_na()` on a particular variable.

Another approach is to remove rows with missing data in a specific variable, such as body_mass_g. This is less drastic but still needs to be done thoughtfully, especially if you overwrite the original dataset with the modified version. It’s often better to drop NAs temporarily for specific analyses or calculations.

Pros: You retain more data because you only remove rows with missing values in the column that matters for your current analysis.
Cons: You might still lose useful information from other columns. If you overwrite the dataset permanently (e.g., using penguins_clean_names <- penguins_clean_names |> drop_na(body_mass_g)), you can’t recover the dropped rows without reloading the data.

10.2.3 Use arguments inside functions

A more cautious approach is to handle missing data within summary or calculation functions, using the na.rm = TRUE argument. Many summary functions (like mean(), median(), or sum()) include this argument, which, when set to TRUE, excludes NA values from the calculation. This allows you to keep missing values in the dataset but ignore them only when performing calculations.

penguins_clean_names |> 
  group_by(species) |> 
  summarise(
    mean_body_mass = mean(body_mass_g, na.rm = T)
  )

Pros: This is the least destructive option because it allows you to handle missing data without removing any rows from the dataset. You only exclude NA values during specific calculations.
Cons: This doesn’t remove missing data, so you need to ensure that they won’t cause issues later in other analyses (e.g., regressions or visualizations).

10.3 Summary

This chapter showed ways to find and remove different types of missing records from your dataset. It’s important to note that there might not be a single cleaned dataset without any missing values. You might find that in the same workflow, you will clean your dataset in multiple ways to preserve rows that are missing some, but not all, information. For example, some records may have complete taxonomic identification but lack spatial coordinates, while others have spatial coordinates but lack taxonomic information. Depending on the type of analysis you intend to perform, you may need to adjust your data cleaning approach accordingly.

For a more comprehensive resource about missing values, including how to clean, visually represent, and impute them, check out The Missing Book.

10.1 Find Missing values

10.2 Remove missing values

10.2.1 drop_na() on everything before we start.

10.2.2 drop_na() on a particular variable.

10.2.3 Use arguments inside functions

10.3 Summary

10.2.1 `drop_na()` on everything before we start.

10.2.2 `drop_na()` on a particular variable.