18  Visualising models

When you gather raw data, you have lots of numbers and information, but it can be hard to see the big picture. That’s where statistical models come in. Unlike descriptive statistics, which summarize your data using tools like boxplots to show medians and variability, models provide inferential statistics that help you make predictions and understand relationships beyond your immediate data.

By fitting visuals from models onto your figures, you can see these deeper patterns clearly. For example, adding a regression line to a scatter plot not only summarizes the data but also shows the trend and predicts future values. This transforms your data from simple summaries into stories that explain how variables are connected.

Models are important because they allow you to generalize findings to larger populations, test hypotheses, and make informed decisions based on evidence. While descriptive statistics tell you what is happening in your dataset, inferential statistics from models help you understand why it’s happening and what might happen next.

18.0.1 Visuals vs tables

While model summary tables provide important numbers like coefficients and p-values, visuals bring your data and models to life in several key ways:

  • Easy Understanding: Graphs and charts make complex information simple to grasp at a glance. For example, a regression line on a scatter plot clearly shows the relationship between variables.

  • Better Communication: Visuals help you share your findings with others more effectively. Pictures can convey your results quickly, even to those who aren’t familiar with the detailed statistics.

  • Highlight Patterns: Visuals reveal trends, outliers, and relationships that might be hidden in tables. This helps you see the bigger picture and important details simultaneously.

  • Check Model Fit: Graphs like residual plots let you see if your model is accurately capturing the data patterns, making it easier to spot issues.

  • Make It Memorable: People remember images better than numbers. Using visuals ensures your key insights stick with your audience.

In summary, while summary tables are essential for detailed analysis, visual representations make your models easier to understand, communicate, and interpret. In this class, you’ll learn to use both tables and visuals to effectively present your statistical findings.

18.0.2 Simple models

geom_smooth() in ggplot2 is a convenient way to fit simple models to your data and visualize trends in scatterplots. By default, geom_smooth() fits a loess (local regression) curve for smaller datasets, which is a flexible, non-parametric method that captures local patterns in the data. For larger datasets, or when specified, it uses a linear model (method = "lm"), which fits a straight line through the data, representing the best linear relationship between the x and y variables.

In addition to the trend line, geom_smooth() can display a confidence interval around the fitted line, showing the range in which the true trend likely lies. This interval is controlled by the se argument (set to TRUE by default)

penguins |> 
  ggplot(aes(x=flipper_length_mm, 
             y = body_mass_g,
             colour = species))+ # this can be set at the ggplot layer
  geom_point(alpha = .4)+
  geom_smooth(method = "lm")

18.0.3 Complex models

geom_smooth() can automatically fit separate regression lines for each level of a categorical variable by grouping with the color or fill aesthetic. For example, if you group by species, geom_smooth() will produce independent regressions for each species, fitting separate trend lines without considering any relationships or dependencies between groups.

However, this approach may be misleading if you’re working with an additive model or a model that includes interactions between groups, where the effect of one variable is not entirely independent of others. In additive models, variables combine to produce an overall effect, and each group’s trend should account for shared influences, not isolated patterns. Simply applying geom_smooth() would ignore these dependencies and could inaccurately suggest that each group’s trend is independent, when in fact, the groups may influence each other in important ways. For representing additive or interaction-based models, it’s better to use customized modeling and visualization techniques that accurately reflect the dependencies among variables.

18.0.3.1 Butterfly data

  • Year specimen collected

  • Forewing length (mm)

  • Sex of butterfly

  • Average June temp (celsius)

  • Average June rainfall (mm)

butterfly <- butterfly |> 
  janitor::clean_names()

butterfly_correct <- butterfly |> 
  mutate(rain_jun = if_else(rain_jun >100, mean(rain_jun), rain_jun)) |> 
  mutate(sex = case_when(sex == "Maes" ~ "Males",
                         sex == "Female" ~ "Females",
                          .default = as.character(sex)))

18.0.3.2 Analysis

My initial hypotheses are:

  • Average June temperature causes a change in forewing length

  • Average June Rainfall causes a change in forewing length

  • The average forewing length of male and female butterflies are different

  • Male and female butterflies respond differently to average June temperature

Can you produce a single linear model to test these hypotheses?

model <- lm(forewing_length ~ jun_mean + rain_jun + sex + jun_mean:sex, data = butterfly_correct)

summary(model)

Call:
lm(formula = forewing_length ~ jun_mean + rain_jun + sex + jun_mean:sex, 
    data = butterfly_correct)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.04984 -0.30796 -0.00631  0.31447  0.77349 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)       12.329062   1.127727  10.933  3.4e-15 ***
jun_mean           0.132524   0.078110   1.697   0.0956 .  
rain_jun           0.001985   0.002733   0.726   0.4708    
sexMales          -3.551029   1.552919  -2.287   0.0262 *  
jun_mean:sexMales  0.158844   0.110003   1.444   0.1546    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4423 on 53 degrees of freedom
Multiple R-squared:  0.7359,    Adjusted R-squared:  0.716 
F-statistic: 36.93 on 4 and 53 DF,  p-value: 9.698e-15

18.0.3.3 Check model fit

Once the initial model has been fitted we need to check the assumptions, try to evaluate this model

performance::check_model(model, detrend = F)

If in doubt try running normality and heteroscedasticity tests - the collinearity plot shows some high multicollinearity but only between sex and jun_mean:sex. Multicollinearity between terms and their interactions is normal.

18.0.3.4 Test interaction

Can you test whether the interaction term should be kept or removed?

model_2 <- lm(forewing_length ~ jun_mean + rain_jun + sex, data = butterfly_correct)

anova(model_2, model)

# Alternatively
# drop1(model, test = "F")

summary(model_2)
Res.Df RSS Df Sum of Sq F Pr(>F)
54 10.77483 NA NA NA NA
53 10.36697 1 0.4078532 2.085104 0.1546277

Call:
lm(formula = forewing_length ~ jun_mean + rain_jun + sex, data = butterfly_correct)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.05660 -0.28880 -0.03925  0.29866  0.76613 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 11.236813   0.844789  13.301  < 2e-16 ***
jun_mean     0.211121   0.056582   3.731 0.000459 ***
rain_jun     0.001658   0.002751   0.603 0.549352    
sexMales    -1.314933   0.117503 -11.191  1.1e-15 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4467 on 54 degrees of freedom
Multiple R-squared:  0.7256,    Adjusted R-squared:  0.7103 
F-statistic: 47.59 on 3 and 54 DF,  p-value: 3.512e-15

18.0.3.5 Figures

Great now we have fitted and evaluated a model - that shows temperature and sex both affect adult forewing length. But crucially there is no evidence of an interaction effect (a differential response to temperature by sex).

If we make a figure as follow:

butterfly_correct |> 
  ggplot(aes(x = jun_mean,
             y = forewing_length,
             colour = sex))+
  geom_point()+
  geom_smooth(method = "lm")+
  theme_classic(base_size = 18)

Q. What is the issue with the above figure?

19 Refining Plots for Publication Standard

19.1 Background

In scientific publications, a plot must be both aesthetically pleasing and scientifically informative. Clear, readable visuals are essential to ensure that the message in your data is conveyed effectively to readers. Refining a plot for publication involves careful attention to scaling, labeling, and annotation, all of which help make data more accessible and the figure easier to interpret. Good design elements—like clean axes, appropriate scaling, minimal but helpful gridlines, and informative captions—allow readers to focus on the data rather than unnecessary elements.

This section covers the core principles of refining plots for publication standards, ensuring they are clean, well-labeled, and capable of standing alone within a scientific paper.

19.1.1 Best Practices for Publication-Ready Plots

  • Clarity and Simplicity:

    Focus on simplicity and remove unnecessary elements that don’t convey useful information. Every component of the graph should serve a purpose, supporting the main message.

    Avoid decorative effects (such as 3D elements) that detract from readability and clarity.

  • Effective Use of Color:

    Use color thoughtfully, as it can quickly communicate information but also mislead or confuse if overused. Limit color to highlight key distinctions.

    Ensure accessibility by choosing colorblind-friendly palettes and testing for readability in grayscale if necessary.

  • Choosing the Right Graph Type:

    Select the graph type that best represents your data and message. Scatterplots, line graphs, bar charts, and box plots each have strengths for specific data types and questions.

    Avoid using overly complex or unusual graph types that may be difficult for readers to interpret without adding value.

  • Clear and Descriptive Labels:

    Use self-explanatory axis labels that include units (e.g., “Body Size (cm)”).

    Avoid abbreviations where possible, unless they are universally understood in your field.

  • Use of axes:

    Ensure that axes use the correct scale (e.g., linear, log, or custom) to represent the data accurately.

    Avoid unnecessary distortion—choose scales that highlight the data patterns without exaggeration.

    Ensure that axis limits and intervals are set to show data clearly, without truncating or misrepresenting information.

    Order categories along an axis to aid readability

  • Minimalist Themes:

    Use minimal gridlines to guide the eye without overwhelming the data.

    Use grids vertical and horizontal gridlines for scatterplots

    Use only horizontal (or no gridlines) when you have a categorical x-axis

    Avoid distracting backgrounds or decorative elements.

    Opt for theme_minimal() or theme_classic() in ggplot2, which offer a clean, professional look.

  • Annotations:

    Add annotations to highlight important points, trends, or outliers in the data.

    Use concise text and arrows or markers to draw attention to specific aspects without cluttering the plot.

  • Informative Captions:

    Write a detailed caption that summarizes the plot’s main insights and context.

    The caption should make the figure understandable without requiring additional context from the text.

  • Consistency Across Figures:

    Maintain a consistent style across figures in terms of font, size, color, and design to create a cohesive look throughout a publication.

    This consistency makes figures easier to compare and keeps readers focused on the data.

19.2 Task: Take the unvoltine butterfly figure and make it publication ready by following the tips above

  • When you have made your best effort figure save it and submit it as this week’s assignment.

19.3 Saving

One of the easiest ways to save a figure you have made is with the ggsave() function. By default it will save the last plot you made on the screen.

If have assigned the plot to an R object, then you can also provide this as the first argument for ggsave() in order to save a particular plot

You should specify the output path to your figures folder, then provide a file name. Here I have decided to call my plot plot (imaginative!) and I want to save it as a .PNG image file. I can also specify the resolution (dpi 300 is good enough for most computer screens).

# OUTPUT FIGURE TO FILE

ggsave("outputs/YYYYMMDD_ggplot_workshop_final_plot.png", dpi=300)

# ggsave(plot,"outputs/YYYYMMDD_ggplot_workshop_final_plot.png", dpi=300)

19.4 Further Reading, Guides and tips on data visualisation

Check out Chapter 19 for extra data visualisation resources

*Why Scientist’s need to be better at Data Visualisation