4  Task Two - Data Cleaning

4.1 Objective: Practice clean a messy dataset

This week’s assignment is to use simple dplyr commands to take a messy dataset, understand it and clean it.

ImportantUsing AI
  • For this exercise we are asking you Not to use LLMs such as Copilot or ChatGPT.

  • This is because this is your opportunity to build familiarity with the process of thinking about your data and learning basic R functions

  • It is tempting to dump code into AI and ask for a fix - but please hold off it will benefit you to try and understand it for yourself

4.1.1 Step 1: Download the Scripts

  • Access Blackboard → Week 2 → “Cleaning Data”

  • Download the R scripts to your local machine.

4.1.2 Step 2: Upload to Posit Cloud

  • Open your script in a NEW project.

  • Upload the downloaded scripts into the project.

Think organised project!

4.1.3 Step 3: Checklist

4.1.3.1 Set up:

    • Apply snake_case: janitor::clean_names().
    • Manually rename awkward or long columns: dplyr::rename().
    • Ensure no spaces, punctuation, or case inconsistencies that force backticks.
    • use sensible and clear names

4.1.3.2 Inspect structure & quality:

4.1.3.3 Standardise values (within columns)

4.1.3.4 Check duplicates & integrity

4.1.3.5 Factors & ordering

4.1.3.6 Summarise & validate

4.1.3.7 Readable code

4.1.4 Step 4: Submission / Discussion

  1. Submit your fixed scripts to Blackboard.