6  Task Three - Data Cleaning

6.1 Objective:

In biological research, we often summarise data to understand patterns, make comparisons, and draw conclusions. However, data are rarely perfect: samples may be biased, measurements may contain errors or outliers, and important variables may be missing. Understanding these issues is critical for interpreting results accurately.

In this worksheet, you will explore three realistic biological datasets, each containing a different type of bias. For each dataset you should:

  • Computing descriptive statistics (mean, median, standard deviation, counts).

  • Make one appropriate plot (boxplots, histograms, scatterplots).

  • Detect and comment on any potential biases in the data and reasoning about their effect on conclusions.

Important

You will not be fitting statistical models yet. The goal is to summarise, visualise, and interpret the data while thinking carefully about bias.

6.2 The datasets:

Dataset Biological Context
Bird Body Mass Birds in different habitats
Glucose & Ketones Human blood measurements
Cytokine Response Immune response study

6.2.1 Dataset 1: Bird Body Mass Across Habitats

Bias Type: Sampling Bias | Complexity: Easy

Birds from three different habitats — Forest, Wetland, and Grassland — were weighed to study differences in body mass. However, due to field constraints, some birds might be underrepresented in certain habitats. This under-sampling can bias the overall mean body mass. Your task is to explore how habitat representation affects the summary statistics and visualize the differences across habitats.

6.2.2 Dataset 2: Blood Glucose and Ketone Levels

You have measurements of fasting blood glucose and blood ketone levels from a human study. The blood test devices are not always accurate and glucose levels above 30 should be discarded. In addition, some ketone values are recorded as zero due to machine detection limits. Your task is to summarise these two variables, explore the relationship between them, and assess how outliers and detection limits affect your results.

6.2.3 Dataset 3: Cytokine Response to a Treatment

Participants underwent a treatment designed to increase cytokine levels, measured before and after treatment. Your task is to explore how cytokine levels change according to treatment, record missing data patterns, and consider how omitted factors and dropouts could bias interpretation.