3  Data Management

3.1 Learning Objectives

By the end of this chapter, you will be able to:

  • Understand the limitations with messy/inconsistent data

  • Apply a set of rules to maintain a well organised spreadsheet

  • Write a simple data dictionary

Question

  • What has been your biggest spreadsheet headache?

  • Have you ever struggled to understand a spreadsheet because it has:

    • Inconsistent names
    • Unclear codings
    • Notes or annotations without clear meaning

Why we should all be data quality detectives

3.2 What makes a well organised spreadsheet?

Question

What are the issues with this dataset?

3.2.1 Example

  • Even though are data is extremely messy, and has a lot of inconsistencies - we can clean and reformat our data with clear code.

  • By keeping our data cleaning and tidying to the commmand line we ensure that our work is completely reproducible, with a set of documented steps available at all times.

library(readxl)

data_ranges <- c("A3:F15", "I3:N15", "Q3:V9") # gather each table as a data range

lines <- c("A","B","C") # set names for each experiment

clean_data <-  map2_dfr(data_ranges, lines, ~ read_excel("data/raw/Data.xlsx", # combined all three tables in one tidy dataframe
                                  range = .x,
                                  na = c("ND", "-"), # choose characters that match NA
                                  .name_repair = janitor::make_clean_names) |>  # standardise names
                          mutate(line = .y) # add line information to data
)
  
clean_data |> 
  mutate(genotype = str_remove(genotype, "A-")) |> # fix genotype names
  fill(genotype) # fill in blanks for genotype in line C
genotype tray sex larvae pupae adults line
WT 6 male 200 101 100 A
WT 6 female 200 93 92 A
WT 5 male 200 108 108 A
WT 5 female 200 85 84 A
HET 4 male 200 92 91 A
HET 4 female 200 95 95 A

3.3 Data Organisation

Based on Broman & Woo (2018), Data Organization in Spreadsheets,

Reproducibility in R begins long before you load a file. Poorly structured data silently corrupts analysis, whereas disciplined organization lets you import, wrangle, and model without ad hoc cleaning

3.3.1 Consistency

Principle: Every categorical code, variable name, and missing-value marker must be used consistently. Inconsistency leads to type coercion, ambiguous factors, and failed joins.

  • Inconsistent
library(dplyr)

data <- tibble(
  id = 1:5,
  sex = c("male", "Male", "M", "female", "F"),
  glucose = c(NA, 5.2, 6.1, 5.7, "")
)

data
id sex glucose
1 male NA
2 Male 5.2
3 M 6.1
4 female 5.7
5 F
  • Consistent
data_clean <- data  |> 
  mutate(
    sex = recode(tolower(sex),
                 "m" = "male",
                 "f" = "female"),
    glucose = na_if(glucose, "")
  )

data_clean
id sex glucose
1 male NA
2 male 5.2
3 male 6.1
4 female 5.7
5 female NA

3.3.2 Variable and file naming

Principle: Names are part of your analytical grammar. They must be readable, regular, and machine-safe.

  • Do:

    • Keep your filenames to a reasonable length

    • Use underscores and hyphens for readability

    • Start or end your filename with a letter or number

    • Use a standard date format when applicable; example: YYYY-MM-DD

    • Use filenames for related files that work well with default ordering; example: in chronological order, or logical order using numbers first

2025_04_10_lineA_fecundity_rep2.csv

2025_04_10_lineA_fecundity_analysis.R
  • Don’t:

    • Use unnecessary additional characters in filenames

    • Use spaces or “illegal” characters; examples: &, %, #, <, or >

    • Start or end your filename with a symbol

    • Use incomplete or inconsistent date formats; example: M-D-YY

    • Use filenames for related files that do not work well with default ordering; examples: a random system of numbers or date formats, or using letters first

4102020marchfecundity<workinprogress>.R

 _20210320*newlineinfo.csv

 firstfile_for_analysis.R

See Chapter 7

3.3.3 Dates and Times

Principle: Always store dates in a consistent format e.g. ddmmyyyy

See Chapter 11

3.3.4 One variable per column

Principle: Each cell holds one atomic value; never embed multiple attributes in one field.

See Appendix C

  • Not this:
library(tidyr)

data_messy <- tibble(id = 1:3, combined = c("adult_male", "juvenile_female", "adult_female"))
data_messy
id combined
1 adult_male
2 juvenile_female
3 adult_female
  • But This:
data_clean <- separate(data_messy, combined, into = c("age_group", "sex"), sep = "_")
data_clean
id age_group sex
1 adult male
2 juvenile female
3 adult female

3.3.5 Rectangular layout

Principle: Keep your data rectangular: one header row, variables in columns, observations in rows.

tidy data overview

3.3.6 Preserve Raw Data

  • Never alter the raw data collection files, every time this occurs errors can be introduced and they may not be trackable.

  • Make backups!

3.3.7 No Excel formatting as Data

Principle: Encoded information via colour blocks or fonts are invisible to R and ambiguous to other users.

excel colour blocks

please don’t do this

3.3.8 Metadata and Data Dictionaries

Principle: Every dataset requires a companion dictionary documenting meaning, units, and type.

Examples of data dictionaries

  • Example data:
  • A simple data dictionary:

3.3.9 Validation Checks

  • Every import should be followed by automated sanity checks

    • Are there fewer eggs hatching than eggs laid?

    • Are there negative valuse for body weight?

see ?sec-validate

3.3.10 File formats and version control

  • Where possible use plain text (csv/tsv) and version control for transparency

3.4 Summary

Disciplined data organization eliminates ambiguity, allows seamless import into R, and ensures that analysis code represents all transformations transparently.

Having a set of company values around data quality and standardisation where data quality is everyones responsibility helps prevent errors, and reduce the risk of information being lost when researchers transition across to new roles.

We should all be data detectives!