8 Strings

Penguin clean names dataset

penguins_clean_names <- readRDS(url("https://github.com/Philip-Leftwich/Oct-Intro-Analytics/raw/refs/heads/dev/data/penguin_clean_names.RDS"))

8.1 Learning Outcomes

Use stringr functions to tidy and manipulate text data.
Detect and fix inconsistencies in categorical variables.
Apply conditional replacements with case_when() and if_else().
Split, join, and clean up messy strings efficiently.
Understand when and how to use regex patterns for flexible text matching.

8.2 Strings

Strings are sequences of characters that make up spaces, letters, abbreviations, words or sentences. They can be formatted in many different ways in an individual dataset. For example, these are just some possible ways the scientific name of species could be recorded in a dataset:

“Adelie Penguin (Pygoscelis adeliae)”
“adelie penguin (pygoscelis adeliae)”
” Adelie Penguin (Pygoscelis adeliae)”
“Adelie Penguin - Pygoscelis adeliae”

Terms might be capitalised (or not), have accidental spaces at the beginning or end of a word or sentence, contain typos, or include punctuation; all of these things can impact your ability to consolidate and analyse data accurately.

8.3 Basic string manipulation

The stringr package provides lots of handy tools for cleaning and manipulating text (or “strings”) in R. Let’s look at some examples using the penguins_clean_names dataset.

Note

You can think of string manipulation as tidying text variables — useful when species names, island names, or other labels contain extra spaces or unwanted characters.

8.3.1 Trim

Trim whitespace on either side of a string.

str_trim(" Adelie Penguin (Pygoscelis adeliae) ")

[1] "Adelie Penguin (Pygoscelis adeliae)"

Or just from one side:

str_trim("  Adelie Penguin (Pygoscelis adeliae)  ", side = "left")

[1] "Adelie Penguin (Pygoscelis adeliae)  "

8.3.2 Squish

Sometimes extra spaces sneak into your data. str_squish() removes leading, trailing, and extra internal whitespace, leaving only single spaces between words.

str_squish("  Adelie    Penguin   (Pygoscelis   adeliae)  ")

[1] "Adelie Penguin (Pygoscelis adeliae)"

8.3.3 Truncate

You can shorten long strings to a specific width — handy when making plots or reports.

str_trunc("Adelie Penguin (Pygoscelis adeliae)", width = 18, side = "right")

[1] "Adelie Penguin ..."

8.3.4 Split

Split a string into smaller pieces based on a separator. For example, separating the scientific name in parentheses:

str_split("Adelie Penguin (Pygoscelis adeliae)", " ")

[[1]]
[1] "Adelie"      "Penguin"     "(Pygoscelis" "adeliae)"

8.3.5 Concatenate

Join pieces of text into one string.

str_c("Adelie", "Penguin", sep = "_")

[1] "Adelie_Penguin"

8.4 Cleaning strings with dplyr

We can look for typos by asking R to produce all of the distinct values in a variable. This is very useful for categorical data, where we expect there to be only a few distinct categories

# Print only unique character strings in this variable
penguins_clean_names |>  
  distinct(sex)

sex
MALE
FEMALE
NA

Here if someone had mistyped e.g. ‘FMALE’ it would be obvious. We could do the same thing (and probably should have before we changed the names) for species.

8.5 Conditional Changes with `case_when()` and `if_else()`

Sometimes you’ll want to change the text or category labels in your data based on certain conditions.
The functions case_when() and if_else() from dplyr are designed for this.

8.5.1 `case_when()`

Use case_when() when you have multiple conditions to check and different outcomes for each one.
It works a bit like an “if / else if / else” ladder.

# use mutate and case_when 
# for a statement that conditionally changes 
# the names of the values in a variable
penguins <- penguins_clean_names |> 
  mutate(species = case_when(
  species == "Adelie Penguin (Pygoscelis adeliae)" ~ "Adelie",
  species == "Gentoo penguin (Pygoscelis papua)" ~ "Gentoo",
  species == "Chinstrap penguin (Pygoscelis antarctica)" ~ "Chinstrap",
  .default = as.character(species)
  )
  )

Tip

Think of case_when() as a tidy way to handle many possible categories.

8.5.2 `if_else()`

if_else() is simpler — it’s for two-way decisions (something is either true or false).

# use mutate and if_else
# for a statement that conditionally changes 
# the names of the values in a variable
penguins <- penguins_clean_names |> 
  mutate(sex = if_else(
    sex == "MALE", "Male", "Female"
  )
  )

Note

if_else() always requires:

A logical test (something that’s TRUE or FALSE),

A value if TRUE, and

A value if FALSE.

Question

When might using if_else be dangerous for data cleaning?

8.6 Rename text values with stringr

Datasets often contain words, and we call these words “(character) strings”.

Often these aren’t quite how we want them to be, but we can manipulate these as much as we like. Functions in the package stringr, are fantastic. And the number of different types of manipulations are endless!

Below we repeat the outcomes above, but with string matching:

# use mutate and case_when 
# for a statement that conditionally changes 
# the names of the values in a variable
penguins_clean_names |> 
  mutate(species = stringr::word(species, 1)
  ) |> 
  mutate(sex = stringr::str_to_title(sex))

study_name	sample_number	species	region	island	stage	individual_id	clutch_completion	date_egg	culmen_length_mm	culmen_depth_mm	flipper_length_mm	body_mass_g	sex	delta_15n	delta_13c	comments
PAL0708	1	Adelie	Anvers	Torgersen	Adult, 1 Egg Stage	N1A1	Yes	11/11/2007	39.1	18.7	181	3750	Male	NA	NA	Not enough blood for isotopes.
PAL0708	2	Adelie	Anvers	Torgersen	Adult, 1 Egg Stage	N1A2	Yes	11/11/2007	39.5	17.4	186	3800	Female	8.94956	-24.69454	NA
PAL0708	3	Adelie	Anvers	Torgersen	Adult, 1 Egg Stage	N2A1	Yes	16/11/2007	40.3	18.0	195	3250	Female	8.36821	-25.33302	NA
PAL0708	4	Adelie	Anvers	Torgersen	Adult, 1 Egg Stage	N2A2	Yes	16/11/2007	NA	NA	NA	NA	NA	NA	NA	Adult not sampled.
PAL0708	5	Adelie	Anvers	Torgersen	Adult, 1 Egg Stage	N3A1	Yes	16/11/2007	36.7	19.3	193	3450	Female	8.76651	-25.32426	NA
PAL0708	6	Adelie	Anvers	Torgersen	Adult, 1 Egg Stage	N3A2	Yes	16/11/2007	39.3	20.6	190	3650	Male	8.66496	-25.29805	NA

Your turn

Convert all species names to uppercase using str_to_upper().

8.7 Split columns

Alternatively we could decide we want simpler species names but that we would like to keep the latin name information, but in a separate column. To do this we are using regex. Regular expressions are a concise and flexible tool for describing patterns in strings.

They are however also complex and unintuitive. ChatGPT (or any LLM) is great for building more complex regex snippets.

penguins_clean_names |> 
    separate(
        species,
        into = c("species", "full_latin_name"),
        sep = "(?=\\()"
    )

study_name	sample_number	species	full_latin_name	region	island	stage	individual_id	clutch_completion	date_egg	culmen_length_mm	culmen_depth_mm	flipper_length_mm	body_mass_g	sex	delta_15n	delta_13c	comments
PAL0708	1	Adelie Penguin	(Pygoscelis adeliae)	Anvers	Torgersen	Adult, 1 Egg Stage	N1A1	Yes	11/11/2007	39.1	18.7	181	3750	MALE	NA	NA	Not enough blood for isotopes.
PAL0708	2	Adelie Penguin	(Pygoscelis adeliae)	Anvers	Torgersen	Adult, 1 Egg Stage	N1A2	Yes	11/11/2007	39.5	17.4	186	3800	FEMALE	8.94956	-24.69454	NA
PAL0708	3	Adelie Penguin	(Pygoscelis adeliae)	Anvers	Torgersen	Adult, 1 Egg Stage	N2A1	Yes	16/11/2007	40.3	18.0	195	3250	FEMALE	8.36821	-25.33302	NA
PAL0708	4	Adelie Penguin	(Pygoscelis adeliae)	Anvers	Torgersen	Adult, 1 Egg Stage	N2A2	Yes	16/11/2007	NA	NA	NA	NA	NA	NA	NA	Adult not sampled.
PAL0708	5	Adelie Penguin	(Pygoscelis adeliae)	Anvers	Torgersen	Adult, 1 Egg Stage	N3A1	Yes	16/11/2007	36.7	19.3	193	3450	FEMALE	8.76651	-25.32426	NA
PAL0708	6	Adelie Penguin	(Pygoscelis adeliae)	Anvers	Torgersen	Adult, 1 Egg Stage	N3A2	Yes	16/11/2007	39.3	20.6	190	3650	MALE	8.66496	-25.29805	NA

8.8 Matching

8.8.1 Detect a pattern

str_detect("Genus specificus", "Genus")

[1] TRUE

Use str_detect() to filter your data. Here, we filter the species names to only those containing the pattern “papua”.

# 3 possible names in species column
penguins_clean_names |> distinct(species)

species
Adelie Penguin (Pygoscelis adeliae)
Gentoo penguin (Pygoscelis papua)
Chinstrap penguin (Pygoscelis antarctica)

penguins_clean_names |>
  filter(str_detect(species, "papua")) |>
  select(species)

species
Gentoo penguin (Pygoscelis papua)
Gentoo penguin (Pygoscelis papua)
Gentoo penguin (Pygoscelis papua)
Gentoo penguin (Pygoscelis papua)
Gentoo penguin (Pygoscelis papua)
Gentoo penguin (Pygoscelis papua)

8.8.2 Remove a pattern

# remove match for Genus (followed by a whitespace)
str_remove("Genus specificus", pattern = "Genus ")

[1] "specificus"

In the example above we split the species column into a common and latin name column - but it left some ugly brackets - we can use str_remove to strip away those brackets:

penguins_clean_names |> 
    separate(
        species,
        into = c("species", "full_latin_name"),
        sep = "(?=\\()" # regex pattern: split before the '('
    ) |> 
  mutate(full_latin_name = str_remove_all(full_latin_name, "[\\(\\)]"))

study_name	sample_number	species	full_latin_name	region	island	stage	individual_id	clutch_completion	date_egg	culmen_length_mm	culmen_depth_mm	flipper_length_mm	body_mass_g	sex	delta_15n	delta_13c	comments
PAL0708	1	Adelie Penguin	Pygoscelis adeliae	Anvers	Torgersen	Adult, 1 Egg Stage	N1A1	Yes	11/11/2007	39.1	18.7	181	3750	MALE	NA	NA	Not enough blood for isotopes.
PAL0708	2	Adelie Penguin	Pygoscelis adeliae	Anvers	Torgersen	Adult, 1 Egg Stage	N1A2	Yes	11/11/2007	39.5	17.4	186	3800	FEMALE	8.94956	-24.69454	NA
PAL0708	3	Adelie Penguin	Pygoscelis adeliae	Anvers	Torgersen	Adult, 1 Egg Stage	N2A1	Yes	16/11/2007	40.3	18.0	195	3250	FEMALE	8.36821	-25.33302	NA
PAL0708	4	Adelie Penguin	Pygoscelis adeliae	Anvers	Torgersen	Adult, 1 Egg Stage	N2A2	Yes	16/11/2007	NA	NA	NA	NA	NA	NA	NA	Adult not sampled.
PAL0708	5	Adelie Penguin	Pygoscelis adeliae	Anvers	Torgersen	Adult, 1 Egg Stage	N3A1	Yes	16/11/2007	36.7	19.3	193	3450	FEMALE	8.76651	-25.32426	NA
PAL0708	6	Adelie Penguin	Pygoscelis adeliae	Anvers	Torgersen	Adult, 1 Egg Stage	N3A2	Yes	16/11/2007	39.3	20.6	190	3650	MALE	8.66496	-25.29805	NA

8.1 Learning Outcomes

8.2 Strings

8.3 Basic string manipulation

8.3.1 Trim

8.3.2 Squish

8.3.3 Truncate

8.3.4 Split

8.3.5 Concatenate

8.4 Cleaning strings with dplyr

8.5 Conditional Changes with case_when() and if_else()

8.5.1 case_when()

8.5.2 if_else()

8.6 Rename text values with stringr

8.7 Split columns

8.8 Matching

8.8.1 Detect a pattern

8.8.2 Remove a pattern

8.5 Conditional Changes with `case_when()` and `if_else()`

8.5.1 `case_when()`

8.5.2 `if_else()`