12  Factors

penguins_clean_names <- readRDS(url("https://github.com/Philip-Leftwich/Oct-Intro-Analytics/raw/refs/heads/dev/data/penguin_clean_names.RDS"))

In R, factors are a class of data that allow for ordered categories with a fixed set of acceptable values.

Typically, you would convert a column from character or numeric class to a factor if you want to set an intrinsic order to the values (“levels”) so they can be displayed non-alphabetically in plots and tables, or for use in linear model analyses (more on this later).

levels(penguins_clean_names$species)
NULL

Working with factors is easy with the forcats package:

penguin_factors <- penguins_clean_names |> 
  mutate(species = as_factor(species))

levels(penguin_factors$species)

12.0.1 Applying changes across columns

Using across - we can apply functions to columns based on selected criteria - here within mutate we are changing each column in the .cols argument and applying the function forcats::as_factor()

penguins_clean_names |> 
  mutate(
    across(.cols = c("species", "region", "island", "stage", "sex"),
           .fns = forcats::as_factor)
  ) |> 
  select(where(is.factor)) |> 
  glimpse()
Rows: 344
Columns: 5
$ species <fct> Adelie Penguin (Pygoscelis adeliae), Adelie Penguin (Pygosceli…
$ region  <fct> Anvers, Anvers, Anvers, Anvers, Anvers, Anvers, Anvers, Anvers…
$ island  <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgersen, Torgers…
$ stage   <fct> "Adult, 1 Egg Stage", "Adult, 1 Egg Stage", "Adult, 1 Egg Stag…
$ sex     <fct> MALE, FEMALE, FEMALE, NA, FEMALE, MALE, FEMALE, MALE, NA, NA, …
Important

Unless we assign the output of this code to an R object it will just print into the console, in the above I am demonstrating how to change variables to factors but we aren’t “saving” this change.

12.0.2 Ordering factors manually

With the forcats::fct_relevel function we can set factors and apply a specified level at the same time:

penguins_clean_names |> 
  mutate(species = fct_relevel(species, 
                               "Adelie Penguin (Pygoscelis adeliae)", 
                               "Chinstrap penguin (Pygoscelis antarctica)", 
                               "Gentoo penguin (Pygoscelis papua)")) |> 
  ggplot(aes(x = species))+
  geom_bar()+
  coord_flip()

12.0.3 Ordering factors by another variable

With the function forcats::fct_infreq we can change the order according to how frequently each level occurs

penguins_clean_names |> 
   mutate(species = fct_infreq(species)) |>  
  ggplot(aes(x = species))+
  geom_bar()+
  coord_flip()

The forcats::fct_rev() function in R is used to reverse the order of levels in a factor variable. It is particularly useful for changing the order of factor levels when you want to display data in a reversed or descending order.

penguins_clean_names |> 
  mutate(species = fct_rev(as_factor(species))) |> 
  ggplot(aes(x = species))+
  geom_bar()+
  coord_flip()

penguins_clean_names |> 
  mutate(species = as_factor(species) |> 
           fct_reorder(body_mass_g,
                       .fun = mean,
                       .na_rm = T)) |> 
  # by default the levels are ordered by the median values of the continuous variable
  # mean, min and max can all be included here
  ggplot(aes(x = species,
             y = body_mass_g,
             colour = species))+
  geom_boxplot(width = .2,
               outlier.shape = NA)+
  coord_flip()+
  theme(legend.position ="none")

12.0.4 Factor bins

If we want to we can also bin continuous data into useful chunks:

penguins_clean_names <- penguins_clean_names |> 
  mutate(mass_range = case_when(
    body_mass_g <= 3500 ~ "smol penguin",
    body_mass_g >3500 & body_mass_g < 4500 ~ "mid penguin",
    body_mass_g >= 4500 ~ "chonk penguin",
    .default = NA)
  )

If we make a barplot, the order of the values on the x axis will typically be in alphabetical order for any character data

penguins_clean_names |> 
  drop_na(mass_range) |> 
  ggplot(aes(x = mass_range))+
  geom_bar()

Your turn

To convert a character or numeric column to class factor, you can use any function from the forcats package. They will convert to class factor and then also perform or allow certain ordering of the levels - for example using forcats::fct_relevel() lets you manually specify the level order.

The function as_factor() simply converts the class without any further capabilities.

penguins_clean_names <- penguins_clean_names |> 
  mutate(mass_range = as_factor(mass_range))
levels(penguins_clean_names$mass_range)
[1] "mid penguin"   "smol penguin"  "chonk penguin"

r unhide()

r hide("Solution")

Below we use mutate() and as_factor() to convert the column flipper_range from class character to class factor.

# Correct the code in your script with this version
penguins_clean_names <- penguins_clean_names |> 
  mutate(mass_range = fct_relevel(mass_range, 
                                  "smol penguin", 
                                  "mid penguin", 
                                  "chonk penguin")
         )

levels(penguins_clean_names$mass_range)
[1] "smol penguin"  "mid penguin"   "chonk penguin"

Now when we call a plot, we can see that the x axis categories match the intrinsic order we have specified with our factor levels.

penguins_clean_names |> 
  drop_na(mass_range) |>  
  ggplot(aes(x = mass_range))+
  geom_bar()