8 Strings
8.1 Learning Outcomes
Use
stringrfunctions to tidy and manipulate text data.Detect and fix inconsistencies in categorical variables.
Apply conditional replacements with
case_when()andif_else().Split, join, and clean up messy strings efficiently.
Understand when and how to use regex patterns for flexible text matching.
8.2 Strings
Strings are sequences of characters that make up spaces, letters, abbreviations, words or sentences. They can be formatted in many different ways in an individual dataset. For example, these are just some possible ways the scientific name of species could be recorded in a dataset:
“Adelie Penguin (Pygoscelis adeliae)”
“adelie penguin (pygoscelis adeliae)”
” Adelie Penguin (Pygoscelis adeliae)”
“Adelie Penguin - Pygoscelis adeliae”
Terms might be capitalised (or not), have accidental spaces at the beginning or end of a word or sentence, contain typos, or include punctuation; all of these things can impact your ability to consolidate and analyse data accurately.
8.3 Basic string manipulation
The stringr package provides lots of handy tools for cleaning and manipulating text (or “strings”) in R. Let’s look at some examples using the penguins_clean_names dataset.
You can think of string manipulation as tidying text variables — useful when species names, island names, or other labels contain extra spaces or unwanted characters.
8.3.1 Trim
Trim whitespace on either side of a string.
Or just from one side:
8.3.2 Squish
Sometimes extra spaces sneak into your data. str_squish() removes leading, trailing, and extra internal whitespace, leaving only single spaces between words.
8.3.3 Truncate
You can shorten long strings to a specific width — handy when making plots or reports.
8.3.4 Split
Split a string into smaller pieces based on a separator. For example, separating the scientific name in parentheses:
8.3.5 Concatenate
Join pieces of text into one string.
8.4 Cleaning strings with dplyr
We can look for typos by asking R to produce all of the distinct values in a variable. This is very useful for categorical data, where we expect there to be only a few distinct categories
| sex |
|---|
| MALE |
| FEMALE |
| NA |
Here if someone had mistyped e.g. ‘FMALE’ it would be obvious. We could do the same thing (and probably should have before we changed the names) for species.
8.5 Conditional Changes with case_when() and if_else()
Sometimes you’ll want to change the text or category labels in your data based on certain conditions.
The functions case_when() and if_else() from dplyr are designed for this.
8.5.1 case_when()
Use case_when() when you have multiple conditions to check and different outcomes for each one.
It works a bit like an “if / else if / else” ladder.
# use mutate and case_when
# for a statement that conditionally changes
# the names of the values in a variable
penguins <- penguins_clean_names |>
mutate(species = case_when(
species == "Adelie Penguin (Pygoscelis adeliae)" ~ "Adelie",
species == "Gentoo penguin (Pygoscelis papua)" ~ "Gentoo",
species == "Chinstrap penguin (Pygoscelis antarctica)" ~ "Chinstrap",
.default = as.character(species)
)
)Think of case_when() as a tidy way to handle many possible categories.
8.5.2 if_else()
if_else() is simpler — it’s for two-way decisions (something is either true or false).
if_else() always requires:
A logical test (something that’s TRUE or FALSE),
A value if TRUE, and
A value if FALSE.
8.6 Rename text values with stringr
Datasets often contain words, and we call these words “(character) strings”.
Often these aren’t quite how we want them to be, but we can manipulate these as much as we like. Functions in the package stringr, are fantastic. And the number of different types of manipulations are endless!
Below we repeat the outcomes above, but with string matching:
| study_name | sample_number | species | region | island | stage | individual_id | clutch_completion | date_egg | culmen_length_mm | culmen_depth_mm | flipper_length_mm | body_mass_g | sex | delta_15n | delta_13c | comments |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| PAL0708 | 1 | Adelie | Anvers | Torgersen | Adult, 1 Egg Stage | N1A1 | Yes | 11/11/2007 | 39.1 | 18.7 | 181 | 3750 | Male | NA | NA | Not enough blood for isotopes. |
| PAL0708 | 2 | Adelie | Anvers | Torgersen | Adult, 1 Egg Stage | N1A2 | Yes | 11/11/2007 | 39.5 | 17.4 | 186 | 3800 | Female | 8.94956 | -24.69454 | NA |
| PAL0708 | 3 | Adelie | Anvers | Torgersen | Adult, 1 Egg Stage | N2A1 | Yes | 16/11/2007 | 40.3 | 18.0 | 195 | 3250 | Female | 8.36821 | -25.33302 | NA |
| PAL0708 | 4 | Adelie | Anvers | Torgersen | Adult, 1 Egg Stage | N2A2 | Yes | 16/11/2007 | NA | NA | NA | NA | NA | NA | NA | Adult not sampled. |
| PAL0708 | 5 | Adelie | Anvers | Torgersen | Adult, 1 Egg Stage | N3A1 | Yes | 16/11/2007 | 36.7 | 19.3 | 193 | 3450 | Female | 8.76651 | -25.32426 | NA |
| PAL0708 | 6 | Adelie | Anvers | Torgersen | Adult, 1 Egg Stage | N3A2 | Yes | 16/11/2007 | 39.3 | 20.6 | 190 | 3650 | Male | 8.66496 | -25.29805 | NA |
Your turn
Convert all species names to uppercase using str_to_upper().
8.7 Split columns
Alternatively we could decide we want simpler species names but that we would like to keep the latin name information, but in a separate column. To do this we are using regex. Regular expressions are a concise and flexible tool for describing patterns in strings.
They are however also complex and unintuitive. ChatGPT (or any LLM) is great for building more complex regex snippets.
| study_name | sample_number | species | full_latin_name | region | island | stage | individual_id | clutch_completion | date_egg | culmen_length_mm | culmen_depth_mm | flipper_length_mm | body_mass_g | sex | delta_15n | delta_13c | comments |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| PAL0708 | 1 | Adelie Penguin | (Pygoscelis adeliae) | Anvers | Torgersen | Adult, 1 Egg Stage | N1A1 | Yes | 11/11/2007 | 39.1 | 18.7 | 181 | 3750 | MALE | NA | NA | Not enough blood for isotopes. |
| PAL0708 | 2 | Adelie Penguin | (Pygoscelis adeliae) | Anvers | Torgersen | Adult, 1 Egg Stage | N1A2 | Yes | 11/11/2007 | 39.5 | 17.4 | 186 | 3800 | FEMALE | 8.94956 | -24.69454 | NA |
| PAL0708 | 3 | Adelie Penguin | (Pygoscelis adeliae) | Anvers | Torgersen | Adult, 1 Egg Stage | N2A1 | Yes | 16/11/2007 | 40.3 | 18.0 | 195 | 3250 | FEMALE | 8.36821 | -25.33302 | NA |
| PAL0708 | 4 | Adelie Penguin | (Pygoscelis adeliae) | Anvers | Torgersen | Adult, 1 Egg Stage | N2A2 | Yes | 16/11/2007 | NA | NA | NA | NA | NA | NA | NA | Adult not sampled. |
| PAL0708 | 5 | Adelie Penguin | (Pygoscelis adeliae) | Anvers | Torgersen | Adult, 1 Egg Stage | N3A1 | Yes | 16/11/2007 | 36.7 | 19.3 | 193 | 3450 | FEMALE | 8.76651 | -25.32426 | NA |
| PAL0708 | 6 | Adelie Penguin | (Pygoscelis adeliae) | Anvers | Torgersen | Adult, 1 Egg Stage | N3A2 | Yes | 16/11/2007 | 39.3 | 20.6 | 190 | 3650 | MALE | 8.66496 | -25.29805 | NA |
8.8 Matching
8.8.1 Detect a pattern
Use str_detect() to filter your data. Here, we filter the species names to only those containing the pattern “papua”.
| species |
|---|
| Adelie Penguin (Pygoscelis adeliae) |
| Gentoo penguin (Pygoscelis papua) |
| Chinstrap penguin (Pygoscelis antarctica) |
| species |
|---|
| Gentoo penguin (Pygoscelis papua) |
| Gentoo penguin (Pygoscelis papua) |
| Gentoo penguin (Pygoscelis papua) |
| Gentoo penguin (Pygoscelis papua) |
| Gentoo penguin (Pygoscelis papua) |
| Gentoo penguin (Pygoscelis papua) |
8.8.2 Remove a pattern
[1] "specificus"
In the example above we split the species column into a common and latin name column - but it left some ugly brackets - we can use str_remove to strip away those brackets:
| study_name | sample_number | species | full_latin_name | region | island | stage | individual_id | clutch_completion | date_egg | culmen_length_mm | culmen_depth_mm | flipper_length_mm | body_mass_g | sex | delta_15n | delta_13c | comments |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| PAL0708 | 1 | Adelie Penguin | Pygoscelis adeliae | Anvers | Torgersen | Adult, 1 Egg Stage | N1A1 | Yes | 11/11/2007 | 39.1 | 18.7 | 181 | 3750 | MALE | NA | NA | Not enough blood for isotopes. |
| PAL0708 | 2 | Adelie Penguin | Pygoscelis adeliae | Anvers | Torgersen | Adult, 1 Egg Stage | N1A2 | Yes | 11/11/2007 | 39.5 | 17.4 | 186 | 3800 | FEMALE | 8.94956 | -24.69454 | NA |
| PAL0708 | 3 | Adelie Penguin | Pygoscelis adeliae | Anvers | Torgersen | Adult, 1 Egg Stage | N2A1 | Yes | 16/11/2007 | 40.3 | 18.0 | 195 | 3250 | FEMALE | 8.36821 | -25.33302 | NA |
| PAL0708 | 4 | Adelie Penguin | Pygoscelis adeliae | Anvers | Torgersen | Adult, 1 Egg Stage | N2A2 | Yes | 16/11/2007 | NA | NA | NA | NA | NA | NA | NA | Adult not sampled. |
| PAL0708 | 5 | Adelie Penguin | Pygoscelis adeliae | Anvers | Torgersen | Adult, 1 Egg Stage | N3A1 | Yes | 16/11/2007 | 36.7 | 19.3 | 193 | 3450 | FEMALE | 8.76651 | -25.32426 | NA |
| PAL0708 | 6 | Adelie Penguin | Pygoscelis adeliae | Anvers | Torgersen | Adult, 1 Egg Stage | N3A2 | Yes | 16/11/2007 | 39.3 | 20.6 | 190 | 3650 | MALE | 8.66496 | -25.29805 | NA |