3  Data cleaning

In this chapter you will learn how to use tidyverse functions to data clean and wrangle tidy dataTidy data refers to a specific format for organizing datasets that makes data easier to work with for analysis and visualization in R, especially using the tidyverse. The concept was popularized by Hadley Wickham in his paper “Tidy Data” and is an essential principle for effective data manipulation. introduced in Appendix C

3.1 Introduction to dplyr

In this section we will be introduced to some of the most commonly used data wrangling functions, these come from the dplyr package (part of the tidyverse). These are functions you are likely to become very familiar with.

verb action
select() choose columns by name
filter() select rows based on conditions
arrange() reorder the rows
summarise() reduce raw data to user defined summaries
group_by() group the rows by a specified column
mutate() create a new variable

3.1.1 Select

If we wanted to create a dataset that only includes certain variables, we can use the dplyr::select() function from the dplyr package.

For example I might wish to create a simplified dataset that only contains species, sex, flipper_length_mm and body_mass_g.

Run the below code to select only those columns

select(
   # the data object
  .data = penguins_raw,
   # the variables you want to select
  `Species`, `Sex`, `Flipper Length (mm)`, `Body Mass (g)`)

Alternatively you could tell R the columns you don’t want e.g. 

select(.data = penguins_raw,
       -`studyName`, -`Sample Number`)

Note that select() does not change the original penguins tibble. It spits out the new tibble directly into your console.

If you don’t save this new tibble, it won’t be stored. If you want to keep it, then you must create a new object.

When you run this new code, you will not see anything in your console, but you will see a new object appear in your Environment pane.

new_penguins <- select(.data = penguins_raw, 
       `Species`, `Sex`, `Flipper Length (mm)`, `Body Mass (g)`)

3.1.2 Filter

Having previously used dplyr::select() to select certain variables, we will now use dplyr::filter() to select only certain rows or observations. For example only Adelie penguins.

We can do this with the equivalence operator ==

filter(.data = new_penguins, 
       `Species` == "Adelie Penguin (Pygoscelis adeliae)")

Don’t confuse “=” which assigns a value with “==” which checks if two things are ‘equivalent’. See Section A.5.1 for more.

We can use several different operators to assess the way in which we should filter our data that work the same in tidyverse or base R.

Boolean expressions
Operator Name
A < B less than
A <= B less than or equal to
A > B greater than
A >= B greater than or equal to
A == B equivalence
A != B not equal
A %in% B in

If you wanted to select all the Penguin species except Adelies, you use ‘not equals’.

filter(.data = new_penguins, 
       `Species` != "Adelie Penguin (Pygoscelis adeliae)")

This is the same as

filter(.data = new_penguins, 
       `Species` %in% c("Chinstrap penguin (Pygoscelis antarctica)",
                      "Gentoo penguin (Pygoscelis papua)")
       )

You can include multiple expressions within filter() and it will pull out only those rows that evaluate to TRUE for all of your conditions.

For example the below code will pull out only those observations of Adelie penguins where flipper length was measured as greater than 190mm.

filter(.data = new_penguins, 
       `Species` == "Adelie Penguin (Pygoscelis adeliae)", 
       `Flipper Length (mm)` > 190)

3.1.3 Arrange

The function arrange() sorts the rows in the table according to the columns supplied. For example

arrange(.data = new_penguins, 
        `Sex`)

The data is now arranged in alphabetical order by sex. So all of the observations of female penguins are listed before males.

You can also reverse this with desc()

arrange(.data = new_penguins, 
        desc(`Sex`))

You can also sort by more than one column, what do you think the code below does?

arrange(.data = new_penguins,
        `Sex`,
        desc(`Species`),
        desc(`Flipper Length (mm)`))

3.1.4 Mutate

Sometimes we need to create a new variable that doesn’t exist in our dataset. For example we might want to figure out what the flipper length is when factoring in body mass.

To create new variables we use the function mutate().

Note that as before, if you want to save your new column you must save it as an object. Here we are mutating a new column and attaching it to the new_penguins data oject.

new_penguins <- mutate(.data = new_penguins,
                     body_mass_kg = `Body Mass (g)`/1000)

3.2 Pipes

Pipes make code more human readable

Pipes look like this: |> , a pipeAn operator that allows you to chain multiple functions together in a sequence. allows you to send the output from one function straight into another function. Specifically, they send the result of the function before |> to be the first argument of the function after |>. As usual, it’s easier to show, rather than tell so let’s look at an example.

# this example uses brackets to nest and order functions
arrange(.data = filter(
  .data = select(
  .data = penguins_raw, 
  species, `Sex`, `Flipper Length (mm)`), 
  `Sex` == "MALE"), 
  desc(`Flipper Length (mm)`))
# this example uses sequential R objects 
object_1 <- select(.data = penguins_raw, 
                   `Species`, `Sex`, `Flipper Length (mm)`)
object_2 <- filter(.data = object_1, 
                   `Sex` == "MALE")
arrange(object_2, 
        desc(`Flipper Length (mm)`))
# this example is human readable without intermediate objects
penguins_raw |>  
  select(`Species`, `Sex`, `Flipper Length (mm)`) |>  
  filter(`Sex` == "MALE") |>  
  arrange(`Flipper Length (mm)`))

The reason that this function is called a pipe is because it ‘pipes’ the data through to the next function. When you wrote the code previously, the first argument of each function was the dataset you wanted to work on. When you use pipes it will automatically take the data from the previous line of code so you don’t need to specify it again.

3.2.1 Task

Try and write out as plain English what the |> above is doing? You can read the |> as THEN

Take the penguins data AND THEN Select only the species, sex and flipper length columns AND THEN Filter to keep only those observations labelled as sex equals male AND THEN Arrange the data from HIGHEST to LOWEST flipper lengths.

From R version 4 onwards there is now a “native pipe” |>

This doesn’t require the tidyverse magrittr package and the “old pipe” %>% or any other packages to load and use.

You may be familiar with the magrittr pipe or see it in other tutorials, and website usages. The native pipe works equivalntly in most situations but if you want to read about some of the operational differences, this site does a good job of explaining .

3.3 Clean the Penguin Data

Warning

Re-open your 01_import_penguins_data.R started in Chapter 1 and start to add these commands to your data importing and cleaning script:

3.3.1 Activity 1: Explore data structure

Before working with your data, it’s essential to understand its underlying structure and content. In this section, we’ll use powerful functions like glimpse(), str(), summary(), head(), tail() and the add-on function skimr::skim() to thoroughly examine your dataset. These tools provide insights into data types, variable distributions, and sample records, helping you identify initial issues such as missing values or inconsistent data types. By gaining a clear understanding of your data’s structure, you’ll be better equipped to address any problems and proceed confidently with data cleaning and analysis.

When we run glimpse() we get several lines of output. The number of observations “rows”, the number of variables “columns”. Check this against the csv file you have - they should be the same. In the next lines we see variable names and the type of data.

glimpse(penguins_raw)
Rows: 344
Columns: 17
$ studyName             <chr> "PAL0708", "PAL0708", "PAL0708", "PAL0708", "PAL…
$ `Sample Number`       <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 1…
$ Species               <chr> "Adelie Penguin (Pygoscelis adeliae)", "Adelie P…
$ Region                <chr> "Anvers", "Anvers", "Anvers", "Anvers", "Anvers"…
$ Island                <chr> "Torgersen", "Torgersen", "Torgersen", "Torgerse…
$ Stage                 <chr> "Adult, 1 Egg Stage", "Adult, 1 Egg Stage", "Adu…
$ `Individual ID`       <chr> "N1A1", "N1A2", "N2A1", "N2A2", "N3A1", "N3A2", …
$ `Clutch Completion`   <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "No", …
$ `Date Egg`            <chr> "11/11/2007", "11/11/2007", "16/11/2007", "16/11…
$ `Culmen Length (mm)`  <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34…
$ `Culmen Depth (mm)`   <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18…
$ `Flipper Length (mm)` <dbl> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190,…
$ `Body Mass (g)`       <dbl> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 34…
$ Sex                   <chr> "MALE", "FEMALE", "FEMALE", NA, "FEMALE", "MALE"…
$ `Delta 15 N (o/oo)`   <dbl> NA, 8.94956, 8.36821, NA, 8.76651, 8.66496, 9.18…
$ `Delta 13 C (o/oo)`   <dbl> NA, -24.69454, -25.33302, NA, -25.32426, -25.298…
$ Comments              <chr> "Not enough blood for isotopes.", NA, NA, "Adult…

We can see a dataset with 345 rows (including the headers) and 17 variables It also provides information on the type of data in each column

  • <chr> - means character or text data

  • <dbl> - means numerical data

When we run summary() we get similar information, in addition for any numerical values we get summary statistics such as mean, median, min, max, quartile ranges and any missing (NA) values

summary(penguins_raw)
  studyName         Sample Number      Species             Region         
 Length:344         Min.   :  1.00   Length:344         Length:344        
 Class :character   1st Qu.: 29.00   Class :character   Class :character  
 Mode  :character   Median : 58.00   Mode  :character   Mode  :character  
                    Mean   : 63.15                                        
                    3rd Qu.: 95.25                                        
                    Max.   :152.00                                        
                                                                          
    Island             Stage           Individual ID      Clutch Completion 
 Length:344         Length:344         Length:344         Length:344        
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
                                                                            
   Date Egg         Culmen Length (mm) Culmen Depth (mm) Flipper Length (mm)
 Length:344         Min.   :32.10      Min.   :13.10     Min.   :172.0      
 Class :character   1st Qu.:39.23      1st Qu.:15.60     1st Qu.:190.0      
 Mode  :character   Median :44.45      Median :17.30     Median :197.0      
                    Mean   :43.92      Mean   :17.15     Mean   :200.9      
                    3rd Qu.:48.50      3rd Qu.:18.70     3rd Qu.:213.0      
                    Max.   :59.60      Max.   :21.50     Max.   :231.0      
                    NA's   :2          NA's   :2         NA's   :2          
 Body Mass (g)      Sex            Delta 15 N (o/oo) Delta 13 C (o/oo)
 Min.   :2700   Length:344         Min.   : 7.632    Min.   :-27.02   
 1st Qu.:3550   Class :character   1st Qu.: 8.300    1st Qu.:-26.32   
 Median :4050   Mode  :character   Median : 8.652    Median :-25.83   
 Mean   :4202                      Mean   : 8.733    Mean   :-25.69   
 3rd Qu.:4750                      3rd Qu.: 9.172    3rd Qu.:-25.06   
 Max.   :6300                      Max.   :10.025    Max.   :-23.79   
 NA's   :2                         NA's   :14        NA's   :13       
   Comments        
 Length:344        
 Class :character  
 Mode  :character  
                   
                   
                   
                   

Finally the add-on package skimr provides the function skimr::skim() provides an easy to view set of summaries including column types, completion rate, number of unique variables in each column and similar statistical summaries along with a small histogram for each numeric variable.

library(skimr)
skim(penguins_raw)
Data summary
Name penguins_raw
Number of rows 344
Number of columns 17
_______________________
Column type frequency:
character 10
numeric 7
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
studyName 0 1.00 7 7 0 3 0
Species 0 1.00 33 41 0 3 0
Region 0 1.00 6 6 0 1 0
Island 0 1.00 5 9 0 3 0
Stage 0 1.00 18 18 0 1 0
Individual ID 0 1.00 4 6 0 190 0
Clutch Completion 0 1.00 2 3 0 2 0
Date Egg 0 1.00 10 10 0 50 0
Sex 11 0.97 4 6 0 2 0
Comments 290 0.16 18 68 0 10 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Sample Number 0 1.00 63.15 40.43 1.00 29.00 58.00 95.25 152.00 ▇▇▆▅▃
Culmen Length (mm) 2 0.99 43.92 5.46 32.10 39.23 44.45 48.50 59.60 ▃▇▇▆▁
Culmen Depth (mm) 2 0.99 17.15 1.97 13.10 15.60 17.30 18.70 21.50 ▅▅▇▇▂
Flipper Length (mm) 2 0.99 200.92 14.06 172.00 190.00 197.00 213.00 231.00 ▂▇▃▅▂
Body Mass (g) 2 0.99 4201.75 801.95 2700.00 3550.00 4050.00 4750.00 6300.00 ▃▇▆▃▂
Delta 15 N (o/oo) 14 0.96 8.73 0.55 7.63 8.30 8.65 9.17 10.03 ▃▇▆▅▂
Delta 13 C (o/oo) 13 0.96 -25.69 0.79 -27.02 -26.32 -25.83 -25.06 -23.79 ▆▇▅▅▂

Q Based on our summary functions are any variables assigned to the wrong data type (should be character when numeric or vice versa)?

Although some columns like date might not be correctly treated as character variables, they are not strictly numeric either, all other columns appear correct

3.3.1.1 Missing Data

Q Based on our summary functions do we have complete data for all variables?

No, they are 2 missing data points for body measurements (culmen, flipper, body mass), 11 missing data points for sex, 13/14 missing data points for blood isotopes (Delta N/C) and 290 missing data points for comments

Sometimes we want to remove rows where values are missing, this will be covered later - for now we just want to be aware of where we do not have complete records.

3.3.2 Activity 2: Clean column names

# CHECK DATA----
# check the data
colnames(penguins_raw)
#__________________________----
 [1] "studyName"           "Sample Number"       "Species"            
 [4] "Region"              "Island"              "Stage"              
 [7] "Individual ID"       "Clutch Completion"   "Date Egg"           
[10] "Culmen Length (mm)"  "Culmen Depth (mm)"   "Flipper Length (mm)"
[13] "Body Mass (g)"       "Sex"                 "Delta 15 N (o/oo)"  
[16] "Delta 13 C (o/oo)"   "Comments"           

When we run colnames() we get the identities of each column in our dataframe

  • Study name: an identifier for the year in which sets of observations were made

  • Region: the area in which the observation was recorded

  • Island: the specific island where the observation was recorded

  • Stage: Denotes reproductive stage of the penguin

  • Individual ID: the unique ID of the individual

  • Clutch completion: if the study nest observed with a full clutch e.g. 2 eggs

  • Date egg: the date at which the study nest observed with 1 egg

  • Culmen length: length of the dorsal ridge of the bird’s bill (mm)

  • Culmen depth: depth of the dorsal ridge of the bird’s bill (mm)

  • Flipper Length: length of bird’s flipper (mm)

  • Body Mass: Bird’s mass in (g)

  • Sex: Denotes the sex of the bird

  • Delta 15N : the ratio of stable Nitrogen isotopes 15N:14N from blood sample

  • Delta 13C: the ratio of stable Carbon isotopes 13C:12C from blood sample

3.3.2.1 Clean column names

Often we might want to change the names of our variables. They might be non-intuitive, or too long. Our data has a couple of issues:

  • Some of the names contain spaces

  • Some of the names have capitalised letters

  • Some of the names contain brackets

R is case-sensitive and also doesn’t like spaces or brackets in variable names, because of this we have been forced to use backticks `Sample Number` to prevent errors when using these column names

# CLEAN DATA ----

# clean all variable names to snake_case 
# using the clean_names function from the janitor package
# note we are using assign <- 
# to overwrite the old version of penguins 
# with a version that has updated names
# this changes the data in our R workspace 
# but NOT the original csv file

# clean the column names
# assign to new R object
penguins_clean <- janitor::clean_names(penguins_raw) 

# quickly check the new variable names
colnames(penguins_clean) 
 [1] "study_name"        "sample_number"     "species"          
 [4] "region"            "island"            "stage"            
 [7] "individual_id"     "clutch_completion" "date_egg"         
[10] "culmen_length_mm"  "culmen_depth_mm"   "flipper_length_mm"
[13] "body_mass_g"       "sex"               "delta_15_n_o_oo"  
[16] "delta_13_c_o_oo"   "comments"         

We can combine data import and name repair in a single step if we want to:

penguins_clean <- read_csv ("data/penguins_raw.csv",
                      name_repair = janitor::make_clean_names)

3.3.2.2 Rename columns (manually)

The clean_names function quickly converts all variable names into snake caseSnake case is a naming convention in computing that uses underscores to replace spaces between words, and writes words in lowercase. It’s commonly used for variable names, filenames, and database table and column names.. The N and C blood isotope ratio names are still quite long though, so let’s clean those with dplyr::rename() where “new_name” = “old_name”.

# shorten the variable names for isotope blood samples
# use rename from the dplyr package
penguins_clean <- rename(penguins_clean,
         "delta_15n"="delta_15_n_o_oo",  
         "delta_13c"="delta_13_c_o_oo")

3.3.2.3 Rename text values manually

Sometimes we may want to rename the values in our variables in order to make a shorthand that is easier to follow. This is changing the values in our columns, not the column names.

# use mutate and case_when 
# for a statement that conditionally changes 
# the names of the values in a variable
penguins <- penguins_clean |> 
  mutate(species = case_when(
  species == "Adelie Penguin (Pygoscelis adeliae)" ~ "Adelie",
  species == "Gentoo penguin (Pygoscelis papua)" ~ "Gentoo",
  species == "Chinstrap penguin (Pygoscelis antarctica)" ~ "Chinstrap",
  .default = as.character(species)
  )
  )
# use mutate and if_else
# for a statement that conditionally changes 
# the names of the values in a variable
penguins <- penguins |> 
  mutate(sex = if_else(
    sex == "MALE", "Male", "Female"
  )
  )
Warning

Notice from here on out I am assigning the output of my code to the R object penguins, this means any new code “overwrites” the old penguins dataframe. This is because I ran out of new names I could think of, its also because my Environment is filling up with lots of data frame variants.

Be aware that when you run code in this way, it can cause errors if you try to run the same code twice e.g. in the example above once you have changed MALE to Male, running the code again could cause errors as MALE is no longer present!

If you make any mistakes running code in this way, re-start your R session and run the code from the start to where you went wrong.

Have you checked that the above code block worked? Inspect your new tibble and check the variables have been renamed as you wanted.

3.3.2.4 Rename text values with stringr

Datasets often contain words, and we call these words “(character) strings”.

Often these aren’t quite how we want them to be, but we can manipulate these as much as we like. Functions in the package stringr, are fantastic. And the number of different types of manipulations are endless!

Below we repeat the outcomes above, but with string matching:

# use mutate and case_when 
# for a statement that conditionally changes 
# the names of the values in a variable
penguins <- penguins_clean |> 
  mutate(species = stringr::word(species, 1)
  ) |> 
  mutate(sex = stringr::str_to_title(sex))

Alternatively we could decide we want simpler species names but that we would like to keep the latin name information, but in a separate column. To do this we are using regex. Regular expressions are a concise and flexible tool for describing patterns in strings

penguins_clean |> 
    separate(
        species,
        into = c("species", "full_latin_name"),
        sep = "(?=\\()"
    ) 
study_name sample_number species full_latin_name region island stage individual_id clutch_completion date_egg culmen_length_mm culmen_depth_mm flipper_length_mm body_mass_g sex delta_15n delta_13c comments
PAL0708 1 Adelie Penguin (Pygoscelis adeliae) Anvers Torgersen Adult, 1 Egg Stage N1A1 Yes 11/11/2007 39.1 18.7 181 3750 MALE NA NA Not enough blood for isotopes.
PAL0708 2 Adelie Penguin (Pygoscelis adeliae) Anvers Torgersen Adult, 1 Egg Stage N1A2 Yes 11/11/2007 39.5 17.4 186 3800 FEMALE 8.94956 -24.69454 NA
PAL0708 3 Adelie Penguin (Pygoscelis adeliae) Anvers Torgersen Adult, 1 Egg Stage N2A1 Yes 16/11/2007 40.3 18.0 195 3250 FEMALE 8.36821 -25.33302 NA
PAL0708 4 Adelie Penguin (Pygoscelis adeliae) Anvers Torgersen Adult, 1 Egg Stage N2A2 Yes 16/11/2007 NA NA NA NA NA NA NA Adult not sampled.
PAL0708 5 Adelie Penguin (Pygoscelis adeliae) Anvers Torgersen Adult, 1 Egg Stage N3A1 Yes 16/11/2007 36.7 19.3 193 3450 FEMALE 8.76651 -25.32426 NA
PAL0708 6 Adelie Penguin (Pygoscelis adeliae) Anvers Torgersen Adult, 1 Egg Stage N3A2 Yes 16/11/2007 39.3 20.6 190 3650 MALE 8.66496 -25.29805 NA

3.3.3 Activity 2: Checking for duplications

It is very easy when inputting data to make mistakes, copy something in twice for example, or if someone did a lot of copy-pasting to assemble a spreadsheet (yikes!). We can check this pretty quickly

# check for whole duplicate 
# rows in the data
penguins |> 
  duplicated() |>  
  sum() 
[1] 0

Great!

If I did have duplications I could investigate further and extract these exact rows:

# Inspect duplicated rows
penguins |> 
    filter(duplicated(penguins))
A tibble:0 × 17
0 rows | 1-8 of 17 columns
# Keep only unduplicated data
penguins |> 
    filter(!duplicated(penguins))
study_name sample_number species region island stage individual_id clutch_completion date_egg culmen_length_mm culmen_depth_mm flipper_length_mm body_mass_g sex delta_15n delta_13c comments
PAL0708 1 Adelie Anvers Torgersen Adult, 1 Egg Stage N1A1 Yes 11/11/2007 39.1 18.7 181 3750 Male NA NA Not enough blood for isotopes.
PAL0708 2 Adelie Anvers Torgersen Adult, 1 Egg Stage N1A2 Yes 11/11/2007 39.5 17.4 186 3800 Female 8.94956 -24.69454 NA
PAL0708 3 Adelie Anvers Torgersen Adult, 1 Egg Stage N2A1 Yes 16/11/2007 40.3 18.0 195 3250 Female 8.36821 -25.33302 NA
PAL0708 4 Adelie Anvers Torgersen Adult, 1 Egg Stage N2A2 Yes 16/11/2007 NA NA NA NA NA NA NA Adult not sampled.
PAL0708 5 Adelie Anvers Torgersen Adult, 1 Egg Stage N3A1 Yes 16/11/2007 36.7 19.3 193 3450 Female 8.76651 -25.32426 NA
PAL0708 6 Adelie Anvers Torgersen Adult, 1 Egg Stage N3A2 Yes 16/11/2007 39.3 20.6 190 3650 Male 8.66496 -25.29805 NA

3.3.4 Activity 3: Checking for typos

We can also look for typos by asking R to produce all of the distinct values in a variable. This is more useful for categorical data, where we expect there to be only a few distinct categories

# Print only unique character strings in this variable
penguins |>  
  distinct(sex)
sex
Male
Female
NA

Here if someone had mistyped e.g. ‘FMALE’ it would be obvious. We could do the same thing (and probably should have before we changed the names) for species.

We can also trim leading or trailing empty spaces with stringr::str_trim. These are often problematic and difficult to spot e.g.

df2 <- tibble(label=c("penguin", " penguin", "penguin ")) 
df2 # make a test dataframe
label
penguin
penguin
penguin

We can easily imagine a scenario where data is manually input, and trailing or leading spaces are left in. These are difficult to spot by eye - but problematic because as far as R is concerned these are different values. We can use the function distinct to return the names of all the different levels it can find in this dataframe.

df2 |> 
  distinct()
label
penguin
penguin
penguin

If we pipe the data throught the str_trim function to remove any gaps, then pipe this on to distinct again - by removing the whitespace, R now recognises just one level to this data.

df2 |> 
  mutate(label=str_trim(label, side="both")) |> 
  distinct()
label
penguin

3.4 Working with dates

Working with dates can be tricky, treating date as strictly numeric is problematic, it won’t account for number of days in months or number of months in a year.

Additionally there’s a lot of different ways to write the same date:

  • 13-10-2019

  • 10-13-2019

  • 13-10-19

  • 13th Oct 2019

  • 2019-10-13

This variability makes it difficult to tell our software how to read the information, luckily we can use the functions in the lubridate package.

If you get a warning that some dates could not be parsed, then you might find the date has been inconsistently entered into the dataset.

Pay attention to warning and error messages

Depending on how we interpret the date ordering in a file, we can use ymd(), ydm(), mdy(), dmy()

  • Question What is the appropriate function from the above to use on the date_egg variable?
penguins <- penguins |>
  mutate(date_egg = lubridate::dmy(date_egg))
# Dates have been provided in day/month/year

Here we use the mutate function from dplyr to create a new variable called date_egg_proper based on the output of converting the characters in date_egg to date format. The original variable is left intact, if we had specified the “new” variable was also called date_egg then it would have overwritten the original variable.

Once we have established our date data, we are able to perform calculations or extract information. Such as the date range across which our data was collected.

3.4.1 Calculations with dates

penguins |> 
  summarise(min_date=min(date_egg),
            max_date=max(date_egg))
ImportantCheck your outputs

Check that this code is producing sensible outputs - it will not work if you have skipped making the necessary column for dates date_egg

We can also extract and make new columns from our date column - such as a simple column of the year when each observation was made:

penguins <- penguins |> 
  mutate(year = lubridate::year(date_egg))

3.5 Factors

In R, factors are a class of data that allow for ordered categories with a fixed set of acceptable values.

Typically, you would convert a column from character to a factor if you want to set an intrinsic order to the values (“levels”) so they can be displayed non-alphabetically in plots and tables, or for use in linear model analyses (more on this later).

Factors provide order to categorical data - you should not assign factors to numeric data - the exception might be if you wanted to bin data into groups (see below)

Working with factors is easy with the forcats package:

Using dplyr::across() - we can apply functions to columns based on selected criteria - here within mutate we are changing each column in the .cols argument and applying the function forcats::as_factor()

penguins |> 
  mutate(
    across(.cols = c("species", "region", "island", "stage", "sex"),
           .fns = forcats::as_factor)
  ) |> 
  select(where(is.factor)) |> 
  glimpse()
Rows: 344
Columns: 5
$ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adelie…
$ region  <fct> Anvers, Anvers, Anvers, Anvers, Anvers, Anvers, Anvers, Anvers…
$ island  <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgersen, Torgers…
$ stage   <fct> "Adult, 1 Egg Stage", "Adult, 1 Egg Stage", "Adult, 1 Egg Stag…
$ sex     <fct> Male, Female, Female, NA, Female, Male, Female, Male, NA, NA, …
Important

Unless we assign the output of this code to an R object it will just print into the console, in the above I am demonstrating how to change variables to factors but we aren’t “saving” this change.

3.5.1 Setting factor levels

If we want to specify the correct order for a factor we can use forcats::fct_relevel.

In this example we have decided to categorise our penguins into groups - “smol”, “mid” and “chonk” based on mass (g), first we need to create a new column (mass range).

penguins <- penguins |> 
  mutate(mass_range = case_when(
    body_mass_g <= 3500 ~ "smol penguin",
    body_mass_g >3500 & body_mass_g < 4500 ~ "mid penguin",
    body_mass_g >= 4500 ~ "chonk penguin",
    .default = NA)
  )

If we make a barplot, the order of the values on the x axis will typically be in alphabetical order for any character data

penguins |> 
  drop_na(mass_range) |> 
  ggplot(aes(x = mass_range))+
  geom_bar()

To convert a character or numeric column to class factor, you can use any function from the forcats package. They will convert to class factor and then also perform or allow certain ordering of the levels - for example using forcats::fct_relevel() lets you manually specify the level order.

The function as_factor() simply converts the class without any further capabilities.

penguins <- penguins |> 
  mutate(mass_range = as_factor(mass_range))
levels(penguins$mass_range)
[1] "mid penguin"   "smol penguin"  "chonk penguin"

Below we use mutate() and as_factor() to convert the column flipper_range from class character to class factor.

# Correct the code in your script with this version
penguins <- penguins |> 
  mutate(mass_range = fct_relevel(mass_range, 
                                  "smol penguin", 
                                  "mid penguin", 
                                  "chonk penguin")
         )

levels(penguins$mass_range)
[1] "smol penguin"  "mid penguin"   "chonk penguin"

Now when we call a plot, we can see that the x axis categories match the intrinsic order we have specified with our factor levels.

penguins |> 
  drop_na(mass_range) |>  
  ggplot(aes(x = mass_range))+
  geom_bar()

Factors will also be important when we build linear models a bit later. The reference or intercept for a categorical predictor variable when it is read as a <chr> is set by R as the first one when ordered alphabetically. This may not always be the most appropriate choice, and by changing this to an ordered <fct> we can manually set the intercept.

3.6 Summary

In this chapter we have successfully imported and checked our data for typos and small errors, we have also been introduce to some of the key functions in the dplyr package for data wrangling. Now that we have confidence in the format and integrity of our data, next time we will start to make insights and understand patterns.

3.6.1 Save scripts & data

  • Having produced a handy cleaned penguins dataset, we could choose to save it - this will reduce computation in the future as we have a tidied dataset we can work with directly. To do this we can use the readr::write_csv() function:
# Write our cleaned dataframe to the processed data subfolder
write_csv(penguins,
          "data/processed/penguins_clean.csv")
  • Make sure you have saved your script 💾 and given it the filename 01_import_penguins_data.R it should be saved in your scripts folder.

Don’t just check the content of your script is the same as the example below. Check that layouts are clear and comments have been included.

#___________________________----
# SET UP ----
# An analysis of the bill dimensions of male and female 
# Adelie, Gentoo and Chinstrap penguins

# Data first published in  Gorman, KB, TD Williams, and WR Fraser. 
# 2014. 
# “Ecological Sexual Dimorphism and Environmental Variability 
# Within a Community of Antarctic Penguins (Genus Pygoscelis).” 
# PLos One 9 (3): e90081.
# https://doi.org/10.1371/journal.pone.0090081. 
#__________________________----

# PACKAGES ----
library(tidyverse) # tidy data packages
library(here) # organised file paths
library(janitor) # cleans variable names
#__________________________----

# IMPORT DATA ----
penguins_raw <- read_csv(here("data", "raw", "penguins_raw.csv"))

# check the data has loaded, prints first 10 rows of dataframe
penguins_raw
#__________________________----
# CLEAN DATA ----

# clean all variable names to snake_case
# using the clean_names function from the janitor package
# note we are using assign <-
# to overwrite the old version of penguins
# with a version that has updated names
# this changes the data in our R workspace
# but NOT the original csv file

# clean the column names
# assign to new R object
penguins_clean <- janitor::clean_names(penguins_raw)

# quickly check the new variable names
colnames(penguins_clean)

# shorten the variable names for N and C isotope blood samples

penguins_clean <- rename(penguins_clean,
  "delta_15n" = "delta_15_n_o_oo", # use rename from the dplyr package
  "delta_13c" = "delta_13_c_o_oo"
)

# use mutate and case_when for a statement that conditionally changes the names of the values in a variable
penguins <- penguins_clean |>
  mutate(species = case_when(
    species == "Adelie Penguin (Pygoscelis adeliae)" ~ "Adelie",
    species == "Gentoo penguin (Pygoscelis papua)" ~ "Gentoo",
    species == "Chinstrap penguin (Pygoscelis antarctica)" ~ "Chinstrap"
  ))

# use mutate and if_else
# for a statement that conditionally changes
# the names of the values in a variable
penguins <- penguins |>
  mutate(sex = if_else(
    sex == "MALE", "Male", "Female"
  ))

# use lubridate to format date and extract the year
penguins <- penguins |>
  mutate(date_egg = lubridate::dmy(date_egg))

penguins <- penguins |>
  mutate(year = lubridate::year(date_egg))

# Set body mass ranges
penguins <- penguins |>
  mutate(mass_range = case_when(
    body_mass_g <= 3500 ~ "smol penguin",
    body_mass_g > 3500 & body_mass_g < 4500 ~ "mid penguin",
    body_mass_g >= 4500 ~ "chonk penguin",
    .default = NA
  ))

# Assign these to an ordered factor

penguins <- penguins |>
  mutate(mass_range = fct_relevel(
    mass_range,
    "smol penguin",
    "mid penguin",
    "chonk penguin"
  ))

# Write our cleaned dataframe to the processed data subfolder
write_csv(
  penguins,
  "data/processed/penguins_clean.csv"
)
  • Does your workspace look like the below?

My neat project layout

My scripts and file subdirectory

The best way to make sure your script contains no errors is to run it! - First restart R and make sure that your Environment is empty

  • Then in your console type the following command
source("scripts/01_import_penguins_data.R")
  • Check your console for errors

3.6.2 Checklist for data checking

3.7 Activity: Test yourself

Question 1. In order to subset a data by rows I should use the function

Question 2. In order to subset a data by columns I should use the function

Question 3. In order to make a new column I should use the function

Question 4. Which operator should I use to send the output from line of code into the next line?

Question 5. What will be the outcome of the following line of code?

penguins |> 
  filter(species == "Adelie")

Unless the output of a series of functions is “assigned” to an object using <- it will not be saved, the results will be immediately printed. This code would have to be modified to the below in order to create a new filtered object penguins_filtered

penguins_filtered <- penguins |> 
  filter(species == "Adelie")


Question 6. What is the main point of a data “pipe”?

Question 7. The naming convention outputted by the function `janitor::clean_names() is

Question 8. Which package provides useful functions for manipulating character strings?

Question 9. Which package provides useful functions for manipulating dates?

Question 10. If we do not specify a character variable as a factor, then ordering will default to what?

3.8 Glossary

term definition
blank slate R projects are set not to store their .Rhistory file, which means everything required to recreate your analysis is contained in your scripts.
console The R console is the interactive interface within the R environment where users can type and execute R code. It is the place where you can directly enter commands, see their output, and interact with the R programming language in real-time.
pipe An operator that allows you to chain multiple functions together in a sequence.
snake case Snake case is a naming convention in computing that uses underscores to replace spaces between words, and writes words in lowercase. It's commonly used for variable names, filenames, and database table and column names.
tidy data Tidy data refers to a specific format for organizing datasets that makes data easier to work with for analysis and visualization in R, especially using the tidyverse. The concept was popularized by Hadley Wickham in his paper "Tidy Data" and is an essential principle for effective data manipulation.

3.9 Reading