8  Strings

penguins_clean_names <- readRDS(url("https://github.com/Philip-Leftwich/Oct-Intro-Analytics/raw/refs/heads/dev/data/penguin_clean_names.RDS"))

8.1 Learning Outcomes

  • Use stringr functions to tidy and manipulate text data.

  • Detect and fix inconsistencies in categorical variables.

  • Apply conditional replacements with case_when() and if_else().

  • Split, join, and clean up messy strings efficiently.

  • Understand when and how to use regex patterns for flexible text matching.

8.2 Strings

Strings are sequences of characters that make up spaces, letters, abbreviations, words or sentences. They can be formatted in many different ways in an individual dataset. For example, these are just some possible ways the scientific name of species could be recorded in a dataset:

  • “Adelie Penguin (Pygoscelis adeliae)”

  • “adelie penguin (pygoscelis adeliae)”

  • ” Adelie Penguin (Pygoscelis adeliae)”

  • “Adelie Penguin - Pygoscelis adeliae”

Terms might be capitalised (or not), have accidental spaces at the beginning or end of a word or sentence, contain typos, or include punctuation; all of these things can impact your ability to consolidate and analyse data accurately.

8.3 Basic string manipulation

The stringr package provides lots of handy tools for cleaning and manipulating text (or “strings”) in R. Let’s look at some examples using the penguins_clean_names dataset.

Note

You can think of string manipulation as tidying text variables — useful when species names, island names, or other labels contain extra spaces or unwanted characters.

8.3.1 Trim

Trim whitespace on either side of a string.

str_trim(" Adelie Penguin (Pygoscelis adeliae) ")
[1] "Adelie Penguin (Pygoscelis adeliae)"

Or just from one side:

str_trim("  Adelie Penguin (Pygoscelis adeliae)  ", side = "left")
[1] "Adelie Penguin (Pygoscelis adeliae)  "

8.3.2 Squish

Sometimes extra spaces sneak into your data. str_squish() removes leading, trailing, and extra internal whitespace, leaving only single spaces between words.

str_squish("  Adelie    Penguin   (Pygoscelis   adeliae)  ")
[1] "Adelie Penguin (Pygoscelis adeliae)"

8.3.3 Truncate

You can shorten long strings to a specific width — handy when making plots or reports.

str_trunc("Adelie Penguin (Pygoscelis adeliae)", width = 18, side = "right")
[1] "Adelie Penguin ..."

8.3.4 Split

Split a string into smaller pieces based on a separator. For example, separating the scientific name in parentheses:

str_split("Adelie Penguin (Pygoscelis adeliae)", " ")
[[1]]
[1] "Adelie"      "Penguin"     "(Pygoscelis" "adeliae)"   

8.3.5 Concatenate

Join pieces of text into one string.

str_c("Adelie", "Penguin", sep = "_")
[1] "Adelie_Penguin"

8.4 Cleaning strings with dplyr

We can look for typos by asking R to produce all of the distinct values in a variable. This is very useful for categorical data, where we expect there to be only a few distinct categories

# Print only unique character strings in this variable
penguins_clean_names |>  
  distinct(sex)
sex
MALE
FEMALE
NA

Here if someone had mistyped e.g. ‘FMALE’ it would be obvious. We could do the same thing (and probably should have before we changed the names) for species.

8.5 Conditional Changes with case_when() and if_else()

Sometimes you’ll want to change the text or category labels in your data based on certain conditions.
The functions case_when() and if_else() from dplyr are designed for this.

8.5.1 case_when()

Use case_when() when you have multiple conditions to check and different outcomes for each one.
It works a bit like an “if / else if / else” ladder.

# use mutate and case_when 
# for a statement that conditionally changes 
# the names of the values in a variable
penguins <- penguins_clean_names |> 
  mutate(species = case_when(
  species == "Adelie Penguin (Pygoscelis adeliae)" ~ "Adelie",
  species == "Gentoo penguin (Pygoscelis papua)" ~ "Gentoo",
  species == "Chinstrap penguin (Pygoscelis antarctica)" ~ "Chinstrap",
  .default = as.character(species)
  )
  )
Tip

Think of case_when() as a tidy way to handle many possible categories.

8.5.2 if_else()

if_else() is simpler — it’s for two-way decisions (something is either true or false).

# use mutate and if_else
# for a statement that conditionally changes 
# the names of the values in a variable
penguins <- penguins_clean_names |> 
  mutate(sex = if_else(
    sex == "MALE", "Male", "Female"
  )
  )
Note

if_else() always requires:

A logical test (something that’s TRUE or FALSE),

A value if TRUE, and

A value if FALSE.

NoteQuestion

When might using if_else be dangerous for data cleaning?

8.6 Rename text values with stringr

Datasets often contain words, and we call these words “(character) strings”.

Often these aren’t quite how we want them to be, but we can manipulate these as much as we like. Functions in the package stringr, are fantastic. And the number of different types of manipulations are endless!

Below we repeat the outcomes above, but with string matching:

# use mutate and case_when 
# for a statement that conditionally changes 
# the names of the values in a variable
penguins_clean_names |> 
  mutate(species = stringr::word(species, 1)
  ) |> 
  mutate(sex = stringr::str_to_title(sex))
study_name sample_number species region island stage individual_id clutch_completion date_egg culmen_length_mm culmen_depth_mm flipper_length_mm body_mass_g sex delta_15n delta_13c comments
PAL0708 1 Adelie Anvers Torgersen Adult, 1 Egg Stage N1A1 Yes 11/11/2007 39.1 18.7 181 3750 Male NA NA Not enough blood for isotopes.
PAL0708 2 Adelie Anvers Torgersen Adult, 1 Egg Stage N1A2 Yes 11/11/2007 39.5 17.4 186 3800 Female 8.94956 -24.69454 NA
PAL0708 3 Adelie Anvers Torgersen Adult, 1 Egg Stage N2A1 Yes 16/11/2007 40.3 18.0 195 3250 Female 8.36821 -25.33302 NA
PAL0708 4 Adelie Anvers Torgersen Adult, 1 Egg Stage N2A2 Yes 16/11/2007 NA NA NA NA NA NA NA Adult not sampled.
PAL0708 5 Adelie Anvers Torgersen Adult, 1 Egg Stage N3A1 Yes 16/11/2007 36.7 19.3 193 3450 Female 8.76651 -25.32426 NA
PAL0708 6 Adelie Anvers Torgersen Adult, 1 Egg Stage N3A2 Yes 16/11/2007 39.3 20.6 190 3650 Male 8.66496 -25.29805 NA

Your turn

Convert all species names to uppercase using str_to_upper().

8.7 Split columns

Alternatively we could decide we want simpler species names but that we would like to keep the latin name information, but in a separate column. To do this we are using regex. Regular expressions are a concise and flexible tool for describing patterns in strings.

They are however also complex and unintuitive. ChatGPT (or any LLM) is great for building more complex regex snippets.

penguins_clean_names |> 
    separate(
        species,
        into = c("species", "full_latin_name"),
        sep = "(?=\\()"
    ) 
study_name sample_number species full_latin_name region island stage individual_id clutch_completion date_egg culmen_length_mm culmen_depth_mm flipper_length_mm body_mass_g sex delta_15n delta_13c comments
PAL0708 1 Adelie Penguin (Pygoscelis adeliae) Anvers Torgersen Adult, 1 Egg Stage N1A1 Yes 11/11/2007 39.1 18.7 181 3750 MALE NA NA Not enough blood for isotopes.
PAL0708 2 Adelie Penguin (Pygoscelis adeliae) Anvers Torgersen Adult, 1 Egg Stage N1A2 Yes 11/11/2007 39.5 17.4 186 3800 FEMALE 8.94956 -24.69454 NA
PAL0708 3 Adelie Penguin (Pygoscelis adeliae) Anvers Torgersen Adult, 1 Egg Stage N2A1 Yes 16/11/2007 40.3 18.0 195 3250 FEMALE 8.36821 -25.33302 NA
PAL0708 4 Adelie Penguin (Pygoscelis adeliae) Anvers Torgersen Adult, 1 Egg Stage N2A2 Yes 16/11/2007 NA NA NA NA NA NA NA Adult not sampled.
PAL0708 5 Adelie Penguin (Pygoscelis adeliae) Anvers Torgersen Adult, 1 Egg Stage N3A1 Yes 16/11/2007 36.7 19.3 193 3450 FEMALE 8.76651 -25.32426 NA
PAL0708 6 Adelie Penguin (Pygoscelis adeliae) Anvers Torgersen Adult, 1 Egg Stage N3A2 Yes 16/11/2007 39.3 20.6 190 3650 MALE 8.66496 -25.29805 NA

8.8 Matching

8.8.1 Detect a pattern

str_detect("Genus specificus", "Genus")
[1] TRUE

Use str_detect() to filter your data. Here, we filter the species names to only those containing the pattern “papua”.

# 3 possible names in species column
penguins_clean_names |> distinct(species)
species
Adelie Penguin (Pygoscelis adeliae)
Gentoo penguin (Pygoscelis papua)
Chinstrap penguin (Pygoscelis antarctica)
penguins_clean_names |>
  filter(str_detect(species, "papua")) |>
  select(species)
species
Gentoo penguin (Pygoscelis papua)
Gentoo penguin (Pygoscelis papua)
Gentoo penguin (Pygoscelis papua)
Gentoo penguin (Pygoscelis papua)
Gentoo penguin (Pygoscelis papua)
Gentoo penguin (Pygoscelis papua)

8.8.2 Remove a pattern

# remove match for Genus (followed by a whitespace)
str_remove("Genus specificus", pattern = "Genus ")
[1] "specificus"

In the example above we split the species column into a common and latin name column - but it left some ugly brackets - we can use str_remove to strip away those brackets:

penguins_clean_names |> 
    separate(
        species,
        into = c("species", "full_latin_name"),
        sep = "(?=\\()" # regex pattern: split before the '('
    ) |> 
  mutate(full_latin_name = str_remove_all(full_latin_name, "[\\(\\)]"))
study_name sample_number species full_latin_name region island stage individual_id clutch_completion date_egg culmen_length_mm culmen_depth_mm flipper_length_mm body_mass_g sex delta_15n delta_13c comments
PAL0708 1 Adelie Penguin Pygoscelis adeliae Anvers Torgersen Adult, 1 Egg Stage N1A1 Yes 11/11/2007 39.1 18.7 181 3750 MALE NA NA Not enough blood for isotopes.
PAL0708 2 Adelie Penguin Pygoscelis adeliae Anvers Torgersen Adult, 1 Egg Stage N1A2 Yes 11/11/2007 39.5 17.4 186 3800 FEMALE 8.94956 -24.69454 NA
PAL0708 3 Adelie Penguin Pygoscelis adeliae Anvers Torgersen Adult, 1 Egg Stage N2A1 Yes 16/11/2007 40.3 18.0 195 3250 FEMALE 8.36821 -25.33302 NA
PAL0708 4 Adelie Penguin Pygoscelis adeliae Anvers Torgersen Adult, 1 Egg Stage N2A2 Yes 16/11/2007 NA NA NA NA NA NA NA Adult not sampled.
PAL0708 5 Adelie Penguin Pygoscelis adeliae Anvers Torgersen Adult, 1 Egg Stage N3A1 Yes 16/11/2007 36.7 19.3 193 3450 FEMALE 8.76651 -25.32426 NA
PAL0708 6 Adelie Penguin Pygoscelis adeliae Anvers Torgersen Adult, 1 Egg Stage N3A2 Yes 16/11/2007 39.3 20.6 190 3650 MALE 8.66496 -25.29805 NA