4 Task Two - Data Cleaning
4.1 Objective: Practice clean a messy dataset
This week’s assignment is to use simple dplyr commands to take a messy dataset, understand it and clean it.
ImportantUsing AI
For this exercise we are asking you Not to use LLMs such as Copilot or ChatGPT.
This is because this is your opportunity to build familiarity with the process of thinking about your data and learning basic R functions
It is tempting to dump code into AI and ask for a fix - but please hold off it will benefit you to try and understand it for yourself
4.1.1 Step 1: Download the Scripts
Access Blackboard → Week 2 → “Cleaning Data”
Download the R scripts to your local machine.
4.1.2 Step 2: Upload to Posit Cloud
Open your script in a NEW project.
Upload the downloaded scripts into the project.
Think organised project!
4.1.3 Step 3: Checklist
4.1.3.1 Set up:
-
- Apply snake_case:
janitor::clean_names(). - Manually rename awkward or long columns:
dplyr::rename(). - Ensure no spaces, punctuation, or case inconsistencies that force backticks.
- use sensible and clear names
- Apply snake_case:
4.1.3.2 Inspect structure & quality:
4.1.3.3 Standardise values (within columns)
4.1.3.4 Check duplicates & integrity
4.1.3.5 Factors & ordering
4.1.3.6 Summarise & validate
4.1.3.7 Readable code
4.1.4 Step 4: Submission / Discussion
- Submit your fixed scripts to Blackboard.