9  Duplicates

It is very easy when inputting data to make mistakes, copy something in twice for example, or if someone did a lot of copy-pasting to assemble a spreadsheet (yikes!). We can check this pretty quickly

penguins_clean_names <- readRDS(url("https://github.com/Philip-Leftwich/Oct-Intro-Analytics/raw/refs/heads/dev/data/penguin_clean_names.RDS"))

9.1 Duplicated rows

# check for whole duplicate 
# rows in the data
penguins_clean_names |> 
  filter(duplicated(across(everything())))
  sum() 
[1] 0
library(janitor)

penguins_clean_names |> 
  get_dupes()
study_name sample_number species region island stage individual_id clutch_completion date_egg culmen_length_mm culmen_depth_mm flipper_length_mm body_mass_g sex delta_15n delta_13c comments dupe_count

Great! In our dataset we have no examples of duplicated rows of data

9.1.1 Working with duplications

As our dataset is duplication free, lets quickly add a few dupes to our data:

penguins_demo <- penguins_clean_names |> 
  slice(1:50) |> 
  bind_rows(slice(penguins_clean_names, c(1,5,10,15,30)))

Your turn

If I did have duplications I could remove these with a few commands:

# Keep only unduplicated data with !
penguins_demo |> 
  filter(!duplicated(across(everything())))
penguins_demo |> 
  distinct()

9.1.2 Counting unique entries

Using the n_distinct() function from dplyr, you can count the number of distinct values in an R data frame using one of the following methods.

penguins_clean_names |> 
  summarise(
  n = n(),
  n_distinct(individual_id)
  )
n n_distinct(individual_id)
344 190