9 Duplicates

It is very easy when inputting data to make mistakes, copy something in twice for example, or if someone did a lot of copy-pasting to assemble a spreadsheet (yikes!). We can check this pretty quickly

Penguin clean names dataset

penguins_clean_names <- readRDS(url("https://github.com/Philip-Leftwich/Oct-Intro-Analytics/raw/refs/heads/dev/data/penguin_clean_names.RDS"))

9.1 Duplicated rows

# check for whole duplicate 
# rows in the data
penguins_clean_names |> 
  filter(duplicated(across(everything())))
  sum()

[1] 0

library(janitor)

penguins_clean_names |> 
  get_dupes()

study_name	sample_number	species	region	island	stage	individual_id	clutch_completion	date_egg	culmen_length_mm	culmen_depth_mm	flipper_length_mm	body_mass_g	sex	delta_15n	delta_13c	comments	dupe_count

Great! In our dataset we have no examples of duplicated rows of data

9.1.1 Working with duplications

As our dataset is duplication free, lets quickly add a few dupes to our data:

penguins_demo <- penguins_clean_names |> 
  slice(1:50) |> 
  bind_rows(slice(penguins_clean_names, c(1,5,10,15,30)))

Your turn

Run the code above to make a dataset with some duplications
Re-run a duplication check

If I did have duplications I could remove these with a few commands:

# Keep only unduplicated data with !
penguins_demo |> 
  filter(!duplicated(across(everything())))

penguins_demo |> 
  distinct()

9.1.2 Counting unique entries

Using the n_distinct() function from dplyr, you can count the number of distinct values in an R data frame using one of the following methods.

penguins_clean_names |> 
  summarise(
  n = n(),
  n_distinct(individual_id)
  )

n	n_distinct(individual_id)
344	190