Introduction to Generalised Linear Models

ESA and Physalia Collaboration
4th March 2024

Dr Philip Leftwich

Welcome!

physalia logo

About me

Associate Professor in Data Science and Genetics at the University of East Anglia.

Academic background in Behavioural Ecology, Genetics, and Insect Pest Control.

Teach Genetics, Programming, and Statistics

UEA logo

About me

Norwich - “A Fine City!”

Introduce yourselves

Outline of the course

Why be normal?
GLM with binary data
GLM with count data

What to expect during this workshop

These workshop will run for 4 hours each.

We will have a 30 min break halfway through

Combines slides, live coding examples, and exercises for you to participate in.
Ask questions in the chat throughout!

What to expect during this workshop

I hope you end up with more questions than answers after this workshop!

Source: giphy.com

What you need for this workshop

You are comfortable with simple operations in R.
You know how to perform linear regression.
You want to learn some Generalised Linear Models

Workshop resources

Suggested R packages:

library(arm)
library(car)
library(DHARMa)
library(emmeans)
library(equatiomatic)
library(fitdistrplus)
library(gtsummary)
library(lmtest)
library(MASS)
library(MuMIn)
library(performance)
library(pscl)
library(tidyverse)
library(sjPlot)

Workshop Resources

Slides available here
Data available here
Scripts available here
Posit Cloud workspace available here

General linear models

Recapping general linear models

Also known at the Ordinary Least Squares (OLS) model
It aims to fit a regression line which minimises the squared differences between observed and predicted values

Equation of the line

The equation of a linear model (lm) is given by:

\[ y_i = \beta_0 + \beta_1 x_i + \epsilon_i \]

where:

$y_i$ is the predicted value of the response variable.
$\beta_0$ is the intercept.
$\beta_1$ is the slope coefficient, representing the change in the response variable for a one-unit change in the explanatory variable.
$x_i$ is the value of the explanatory variable.
$\epsilon_i$ represents the residuals of the model

Limitations of general linear models

Q. What are the assumptions of a general linear model?

Assumes a linear relationship
Assumes normal distribution of the residuals
Assumes homogeneity of the residuals
Independence - assumes observations are independent of each other

Normal distribution of the residuals

The linear regression line is the most likely line given your data if we assume each data point comes from a hypothetical bell curve centered on the regression line

Normal distribution

Normal distribution of the residuals

When using our residuals to calculate standard error, confidence intervals and statistical significance these are assumed to be drawn from:

a normal distribution with mean zero
with a constant variance.

This implies that the residuals, or the distances between the observed values and the values predicted by the linear model, can be modeled by drawing random values from a normal distribution.

The linear model equation

Another way to write the lm equation is:

\[ y_i \sim N(\mu = \beta_0 + \beta_1 X_i, \sigma^2) \]

Which literally means that $y_i$ is drawn from a normal distribution with parameters:

$\mu$ (which depends on $x_i$)
$\sigma$ (which has the same value for all measures of $Y$).

Real data

How often do these assumptions really hold true?

A good fit:

Real data

How often do these assumptions really hold true?

An ok fit:

Real data

How often do these assumptions really hold true?

A poor fit:

Assumption testing

Residuals vs Fitted

Purpose: To check for linearity and homoscedasticity.
Interpretation: Look for a random scatter of points around the horizontal line at y = 0. A funnel-shaped pattern suggests heteroscedasticity.

QQ-plot

Purpose: To assess the normality of the residuals
Interpretation: Points should fall along the diagonal line. Deviation from the line indicates non-normality of residuals.

Scale-location

Purpose: To check homoscedasticity and identify outliers
Interpretation: Uses standardised residuals. Constant spread indicates homoscedasticity

Residuals vs. Leverage

Purpose: To identify influential data points(outliers)
Interpretation: Loking for points with high leverage that might affect the regression line. Investigate values > 0.5

Performance

The performance package from easystats can produce (among other things) similar plots.

Formal tests

For linear models we can also run some formal tests on the residuals

# lmtest::bptest(model)

performance::check_normality(model)

OK: residuals appear as normally distributed (p = 0.825).

# shapiro.test(residuals(model))

performance::check_heteroscedasticity(model)

OK: Error variance appears to be homoscedastic (p = 0.428).

This becomes more difficult with generalised models
Being able to interpret residual plots is important

Generalised Linear Models

What are Generalised linear models?

The Generalised Linear Model (GLM) is an extension of the ordinary linear regression model that accommodates a broader range of response variable types and error distributions.
Key components:
Linear Predictor: Combines predictor variables linearly.
Link Function: Links the mean of the response variable to the linear predictor.
Error family: Drawn from the exponential family it determines the shape of the response variable’s distribution.

Linear Predictor

Linear models:

$y_i = \beta0 + \beta_1x1 + \beta_2x2 + ... +\beta_nxn +\epsilon_i$

Then generalised linear models…

$\eta_i = \beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + \ldots + \beta_n x_{ni}$

$\eta_i$ the linear prediction for observation $i$

The link function

Let’s start with linear models…

$y_i = \beta0 + \beta_1x1 + \beta_2x2 + ... +\beta_nxn +\epsilon_i$

Then generalised linear models…

$g(\mu_i) = \eta_i$

$\eta_i$ linear prediction for observation $i$
$g$ the link function relating the linear predictor to the expected value (mean) of the response variable
$\mu_i$ is the expected value of the response variable for observation $i$

Common link functions: identity, sqrt, log, logit

Error family

Let’s start with linear models…

$y_i = \beta0 + \beta_1x1 + \beta_2x2 + ... +\beta_nxn +\epsilon_i$

$\epsilon_i = \mathcal{N}(0, \sigma^2)$

Then generalised linear models…

$\mu_i = f(\eta_i)$

$\epsilon_i \sim \mathcal{N}(0, \sigma^2)$

$\epsilon_i$ is the error term for observation $i$

Examples: Gaussian (normal), binomial, Poisson, gamma, etc.

Generalised Linear Model

Linear Predictor ($\eta$):

$\eta_i = \beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + \ldots$

Link Function ($g$):

$g(\mu_i) = \eta_i$

Probability Distribution:

$\mu_i = f(\eta_i)$

$\epsilon_i \sim \mathcal{N}(0, \sigma^2)$

$\eta_i$ linear prediction for observation $i$
$g$ the link function relating the linear predictor to the expected value (mean) of the response variable
$\mu_i$ is the expected value of the response variable for observation $i$
$f$ the inverse of the link function, transforming the linear predictor back to the response variable’s expected value
$\epsilon_i$ is the error term for observation $i$

Changing model structures

There are a range of model structures available.

The choice of error family and link influence how well the model fits

Exponential Error Family	Link Function
Gaussian	Identity, sqrt, log, inverse
Binomial	Logit, probit, cloglog
Poisson	Log, identity, sqrt
Gamma	Inverse, identity, log, sqrt
Negative Binomial	Log, identity, sqrt

Questions?

A Linear Model Example

Janka timber hardness

This regression analysis uses data from the Australian forestry industry recording wood density and timber hardness from 36 different samples

Timber hardness is quantified on the “Janka” scale
The ‘amount of force required to embed a 0.444” steel ball into the wood to half of its diameter’.

Fitting the general linear model

janka_ls <- lm(hardness ~ dens, data = janka)

summary(janka_ls)
janka |> 
  ggplot(aes( x = dens, y = hardness))+
  geom_point()+
  geom_smooth(method = "lm")


Call:
lm(formula = hardness ~ dens, data = janka)

Residuals:
    Min      1Q  Median      3Q     Max 
-338.40  -96.98  -15.71   92.71  625.06 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -1160.500    108.580  -10.69 2.07e-12 ***
dens           57.507      2.279   25.24  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 183.1 on 34 degrees of freedom
Multiple R-squared:  0.9493,    Adjusted R-squared:  0.9478 
F-statistic:   637 on 1 and 34 DF,  p-value: < 2.2e-16

Challenge: Fit a Linear Model

Fit the Linear Model lm(hardness ~ dens, data = janka) to the janka data.
Evaluate the model fit

10:00

Model diagnostics

plot(janka_ls, which=2)
plot(janka_ls, which=3)

Violations of assumptions

library(lmtest)

# Breusch-Pagan Test of Homoscedasticity
lmtest::bptest(janka_ls)


    studentized Breusch-Pagan test

data:  janka_ls
BP = 4.4668, df = 1, p-value = 0.03456

# Shapiro-Wilk test for normality of residuals
shapiro.test(residuals(janka_ls))


    Shapiro-Wilk normality test

data:  residuals(janka_ls)
W = 0.93641, p-value = 0.03933

Challenge

What can we try to improve the fit of our lm(hardness ~ dens, data = janka) ?

Let’s make suggestions then try them out:

15:00

BoxCox

The R output for the MASS::boxcox() function plots a maximum likelihood curve (with a 95% confidence interval - drops down as dotted lines) for the best transformation for fitting the dependent variable to the model.

lambda value	transformation
0	log(Y)
0.5	sqrt(Y)
1	Y
2	Y^1

MASS::boxcox(janka_ls)

Square root transformation

janka_sqrt <- lm(sqrt(hardness) ~ dens, data = janka)

plot(janka_sqrt, which=2)
plot(janka_sqrt, which=3)

Model fit

ggplot(janka, aes(x = hardness, y = dens)) +
  geom_point() +  # scatter plot of original data points
  geom_smooth(method = "lm", formula = y ~ sqrt(x)) +  # regression line
  labs(title = "Sqrt Linear Regression with ggplot2",
       x = "X", y = "Y")  # axis labels and title

Log transformation

janka_log <- lm(log(hardness) ~ dens, data = janka)

plot(janka_log, which=2)
plot(janka_log, which=3)

Model fit

ggplot(janka, aes(x = hardness, y = dens)) +
  geom_point() +  # scatter plot of original data points
  geom_smooth(method = "lm", formula = (y ~ log(x))) +  # regression line
  labs(title = "Log Linear Regression",
       x = "X", y = "Y")  # axis labels and title

Polynomials

janka_poly <- lm(log(hardness) ~ poly(dens, 2), data = janka)

plot(janka_poly, which=2)
plot(janka_poly, which=3)

Polynomial fit

summary(janka_poly)
ggplot(janka, aes(x = hardness, y = dens)) +
  geom_point() +  # scatter plot of original data points
  geom_smooth(method = "lm", formula = (y ~ poly(log(x), 2))) +  # regression line
  labs(title = "Quadratic Log Linear Regression",
       x = "X", y = "Y")  # axis labels and title


Call:
lm(formula = log(hardness) ~ poly(dens, 2), data = janka)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.22331 -0.05708 -0.01104  0.07500  0.18871 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)      7.1362     0.0168 424.896  < 2e-16 ***
poly(dens, 2)1   3.3862     0.1008  33.603  < 2e-16 ***
poly(dens, 2)2  -0.5395     0.1008  -5.354 6.49e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1008 on 33 degrees of freedom
Multiple R-squared:  0.9723,    Adjusted R-squared:  0.9706 
F-statistic: 578.9 on 2 and 33 DF,  p-value: < 2.2e-16

Weighted least squares

janka_wls <- lm(sqrt(hardness) ~ dens, weights = 1/sqrt(hardness), data = janka)

Weights are specified as $\frac{1}{\sqrt{\text{hardness}}}$, indicating that observations with higher hardness values will have lower weights, and vice versa.
This weighting scheme can assign more importance to certain observations in the model fitting process.
This approach is known as weighted least squares (WLS) regression, where observations are weighted differently based on certain criteria, such as the square root of hardness.

Weighted least squares

janka_wls <- lm(sqrt(hardness) ~ dens, weights = 1/sqrt(hardness), data = janka)

plot(janka_wls, which=2)
plot(janka_wls, which=3)

Model fit


Call:
lm(formula = sqrt(hardness) ~ dens, data = janka, weights = 1/sqrt(hardness))

Weighted Residuals:
     Min       1Q   Median       3Q      Max 
-0.55072 -0.19009 -0.03393  0.19082  0.63845 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  2.01819    0.96647   2.088   0.0443 *  
dens         0.76142    0.02201  34.601   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.294 on 34 degrees of freedom
Multiple R-squared:  0.9724,    Adjusted R-squared:  0.9716 
F-statistic:  1197 on 1 and 34 DF,  p-value: < 2.2e-16

Troublesome transformations

Changes interpretability
Can introduce NEW violations of assumptions
Error and Mean change together
We don’t understand the true structure of our data

A Generalised Linear Model Example

A Generalised linear model approach

Fit a glm(gaussian(link = "identity")) to the data
Check that the model is identical to the lm()

05:00

janka_glm <- glm(hardness ~ dens, data = janka, family = gaussian(link = "identity"))

summary(janka_glm)


Call:
glm(formula = hardness ~ dens, family = gaussian(link = "identity"), 
    data = janka)

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -1160.500    108.580  -10.69 2.07e-12 ***
dens           57.507      2.279   25.24  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for gaussian family taken to be 33510.78)

    Null deviance: 22485041  on 35  degrees of freedom
Residual deviance:  1139366  on 34  degrees of freedom
AIC: 481.21

Number of Fisher Scoring iterations: 2

A Generalised linear model approach

We previously identified that the square root of the dependent variable might have been a better fit with our linear model:

janka_glm <- glm(hardness ~ dens, data = janka, family = gaussian(link = "sqrt"))

summary(janka_glm)


Call:
glm(formula = hardness ~ dens, family = gaussian(link = "sqrt"), 
    data = janka)

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  2.60869    1.47660   1.767   0.0863 .  
dens         0.75194    0.02722  27.622   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for gaussian family taken to be 25498.81)

    Null deviance: 22485041  on 35  degrees of freedom
Residual deviance:   866960  on 34  degrees of freedom
AIC: 471.38

Number of Fisher Scoring iterations: 4

Plot model

ggplot(janka, aes(x = dens, y = hardness)) +
  geom_point() +  # scatter plot of original data points
  geom_smooth(method = "glm", method.args = list(gaussian(link = "sqrt"))) +  # regression line
  labs(title = "Linear Regression with ggplot2",
       x = "X", y = "Y")  # axis labels and title

Residuals

janka_glm <- glm(hardness ~ dens, data = janka, family = gaussian(link = "sqrt"))

plot(janka_glm, which=2)
plot(janka_glm, which=3)

Choosing an Error Family

library(fitdistrplus)

descdist(janka$hardness, boot = 500)

summary statistics
------
min:  413   max:  3260 
median:  1195 
mean:  1469.472 
estimated sd:  801.5172 
estimated skewness:  0.6579162 
estimated kurtosis:  2.571426

A Cullen and Frey plot (skewness-kurtosis)
A graphical view of the distribution of the dependent variable
Reference lines represent expected skew and kurtosis for different distributions
Beta family sits outside of GLM exponential families

Choosing an Error family

library(fitdistrplus)

par(mfrow = c(1, 2), mar = c(4, 4, 2, 1))
fg <- fitdist(janka$hardness, "gamma")
fn <- fitdist(janka$hardness, "norm")
plot.legend <- c("normal", "gamma")
denscomp(list(fn, fg), legendtext = plot.legend)
qqcomp(list(fn, fg), legendtext = plot.legend)

Choosing based on AIC

summary(fg)

Fitting of the distribution ' gamma ' by maximum likelihood 
Parameters : 
         estimate   Std. Error
shape 3.351431971 0.4076228722
rate  0.002280106 0.0002166559
Loglikelihood:  -287.9889   AIC:  579.9778   BIC:  583.1448 
Correlation matrix:
          shape      rate
shape 1.0000000 0.7201126
rate  0.7201126 1.0000000

summary(fn)

Fitting of the distribution ' norm ' by maximum likelihood 
Parameters : 
      estimate Std. Error
mean 1469.4722  131.71673
sd    790.3066   93.13779
Loglikelihood:  -291.2889   AIC:  586.5779   BIC:  589.7449 
Correlation matrix:
     mean sd
mean    1  0
sd      0  1

Gamma

$f(x;\alpha,\beta) = \frac{\beta^\alpha x^{\alpha - 1} e^{-\beta x}}{\Gamma(\alpha)}$

Shape (α): Determines the shape or curve of the distribution. Higher values of α result in distributions that are more peaked and skewed to the left.
Rate (β): Influences the spread of the distribution. Smaller values of β result in distributions that are more spread out, larger values of β lead to distributions that are less spread out.
Useful for modelling positively skewed continuous non-negative data

Gamma GLM

janka_gamma <- glm(hardness ~ dens, data = janka, family = Gamma(link = "sqrt"))

summary(janka_gamma)


Call:
glm(formula = hardness ~ dens, family = Gamma(link = "sqrt"), 
    data = janka)

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.86729    0.90357   2.067   0.0465 *  
dens         0.76780    0.02243  34.224   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for Gamma family taken to be 0.009716208)

    Null deviance: 11.26788  on 35  degrees of freedom
Residual deviance:  0.32876  on 34  degrees of freedom
AIC: 452.97

Number of Fisher Scoring iterations: 4

Plot model

ggplot(janka, aes(x = dens, y = hardness)) +
  geom_point() +  # scatter plot of original data points
  geom_smooth(method = "glm", method.args = list(Gamma(link = "sqrt"))) +  # regression line
  labs(title = "Linear Regression with ggplot2",
       x = "X", y = "Y")  # axis labels and title

Exercise: Gamma

Fit the Gamma distribution linear model to the data.
glm(hardness ~ dens, data = janka, family = Gamma(link = "sqrt")
Evaluate the model fit against other glm fits.

10:00

Questions?

Likelihood

Log-Likelihood

Likelihood measures the probability of observing the given data under a specific statistical model, as a function of the model’s parameters.
It represents how well the model explains the observed data.
The log likelihood function in maximum likelihood estimations is usually computationally simpler as it is additive.

Log-Likelihood

Likelihood Function:

$L(\theta | y) = f(y_1 | \theta) \times f(y_2 | \theta) \times \ldots \times f(y_n | \theta)$

Log-Likelihood Function:

$\ell(\theta | y) = \log(f(y_1 | \theta)) + \log(f(y_2 | \theta)) + \ldots + \log(f(y_n | \theta))$

Although log-likelihood functions are mathematically easier than their multiplicative counterparts, they can be challenging to calculate by hand. They are usually calculated with software.

Maximum Likelihood Estimation

For a specified error family distribution and link function, the glm finds parameter values that have the highest likelihood score.

Which estimates are most probable for producing the observed data
This is an iterative process, without a single defined equation
For gaussian and identity link it will produce the same outcome as OLS.

Log-Likelihood Function: \[\ell(\theta | \mathbf{y}) = \sum_{i=1}^{n} \log(f(y_i | \theta))\]

Maximum Likelihood

A contour plot visualizing the log-likelihood surface for a normal distribution fitted to simulated data.
The contour lines on the plot represent regions of equal log-likelihood values. Darker regions indicate lower log-likelihood values, while lighter regions indicate higher log-likelihood values

Exercise: log-likelihood

Use the log-likelihood function with varying values of the mean and sd

05:00

Deviance

The deviance is a key concept in Generalised linear models.
It measures the deviance of the fitted Generalised linear model with respect to a perfect/saturated model for the sample.

Pseudo R-squared

Formula:

$GLM~R^2 = 1 - \frac{D}{D_0}$

$linear\_model~R^2= 1 - \frac{SSE}{SST}$

$D$ is the deviance of the fitted model.
$D_0$ is the deviance of the null model (linear model with only an intercept).

The GLM $R^2$ is not the percentage of variance explained by the model, but rather a ratio indicating how close the fit is to being perfect.

Pseudo R-squared in R

We can fit the deviance-based pseudo $R^2$ with the MuMIn package
r.squaredLR() is a function used to compute the likelihood ratio (LR) based R-squared for a given model

library(MuMIn)

r.squaredLR(janka_gamma)

[1] 0.9722591
attr(,"adj.r.squared")
[1] 0.9722593

AIC

The Akaike Information Criterion (AIC) balances the trade-off between model fit and complexity by penalizing models with more parameters.
It aims to find the model that best explains the data while avoiding overfitting.
AIC is calculated as follows: $\text{AIC} = 2k - 2\ln(\hat{L})$,
where: log-likelihood is the maximized value of the likelihood function for the fitted model. $k$ is the number of parameters in the model.
In general a difference in AIC of <2 suggests equivalent goodness-of-fit

AIC

AIC can:

Compare non-nested models fitted to the same dataset and response variable
Compare across different GLM families and link functions

AIC cannot:

Compare models where the response variable is transformed
Compare models where there are differences in the dataset

Exercise: Compare models

Revisit the janka models
Compare R-squared fits and AIC values
r.squaredLR() and AIC()

10:00

Likelihood ratio test

The likelihood ratio test (LRT) “unifies” frequentist statistical tests. Brand-name tests like t-test, F-test, chi-squared-test, are specific cases (or even approximations) of the LRT.

The likelihood ratio test (LRT) begins with a comparison of the likelihood scores of the two models:

$\text{LR} = -2(\ln L_1 - \ln L_2)$

Where:

$\ln L_1$ and $\ln L_2$ are the log-likelihoods of the two models being compared.

Likelihood ratio test

This LRT statistic approximately follows a chi-square distribution.
To determine if the difference in likelihood scores among the two models is statistically significant, we next must consider the degrees of freedom.
In the LRT, degrees of freedom is equal to the number of additional parameters in the more complex model.

LR <- -2*(null_likelihood[1]-fitted_likelihood[1])

LR

[1] 129.0546

Likelihood ratio test

The likelihood ratio test lrtest() is typically calculated as twice the difference in deviance between the two models. (Can also be carried out with the anova() function)
Scaling by -2 ensures the test statistic follows a chi-square distribution under certain conditions
We can calculate the likelihood ratio test statistic, compare it to the chi-square distribution with the appropriate degrees of freedom, and determine whether the improvement in fit between the two models is statistically significant.

Likelihood ratio test

Model 1: hardness ~ 1
Model 2: hardness ~ dens
  #Df  LogLik Df  Chisq Pr(>Chisq)    
1   2 -288.01                         
2   3 -223.48  1 129.05  < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

F distribution

What if the variance is unknown e.g. Gaussian and Gamma distributions?

Then the null model estimates one parameter while the alternative model estimates two, so the difference in df is still 1.
In this case, we know this more familiarly as the ANOVA or F-test, which in this example is equivalent to the t-test on the intercept.


Call:
glm(formula = hardness ~ dens, family = Gamma(link = "sqrt"), 
    data = janka)

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.86729    0.90357   2.067   0.0465 *  
dens         0.76780    0.02243  34.224   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for Gamma family taken to be 0.009716208)

    Null deviance: 11.26788  on 35  degrees of freedom
Residual deviance:  0.32876  on 34  degrees of freedom
AIC: 452.97

Number of Fisher Scoring iterations: 4

F distribution

What if the variance is unknown e.g. Gaussian and Gamma distributions?

anova(null_model, fitted_model, test = "F")

Analysis of Deviance Table

Model 1: hardness ~ 1
Model 2: hardness ~ dens
  Resid. Df Resid. Dev Df Deviance      F    Pr(>F)    
1        35    11.2679                                 
2        34     0.3288  1   10.939 1125.9 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

drop1(fitted_model, test = "F")

Single term deletions

Model:
hardness ~ dens
       Df Deviance     AIC F value    Pr(>F)    
<none>      0.3288  452.97                      
dens    1  11.2679 1576.83  1131.3 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Challenge: Fruitfly

Using the fruitfly.csv dataset
Build a model to investigate the effect of sleep, mating status and body size on longevity
Check the model fits and report findings

30:00

Confidence intervals

# Assuming 'model' is your fitted GLM model

# Extract coefficients and confidence intervals
coef_ci <- confint(fly_gamma)

# Extract estimates
estimates <- coef(fly_gamma)

# Combine estimates with confidence intervals
estimates_df <- data.frame(
  estimates = estimates,
  lower_ci = coef_ci[,1],
  upper_ci = coef_ci[,2]
)

# Display dataframe
estimates_df

                  estimates     lower_ci    upper_ci
(Intercept)     -62.3343118 -79.00681086 -44.6346129
typeInseminated   3.3775829  -2.55299124   9.0437595
typeVirgin      -14.1795176 -19.65196164  -9.1009370
thorax          150.6246221 129.55928920 170.8932705
sleep             0.0238678  -0.08213297   0.1373446

Confidence intervals

Broom is a tidyverse package for with helpers to extract and convert models into tibbles:

# Alternatively, using package-specific functions
# Example with 'broom' package
library(broom)
tidy(fly_gamma, conf.int = T)

# A tibble: 5 × 7
  term            estimate std.error statistic  p.value conf.low conf.high
  <chr>              <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
1 (Intercept)     -62.3       8.55      -7.29  3.65e-11 -79.0      -44.6  
2 typeInseminated   3.38      2.97       1.14  2.58e- 1  -2.55       9.04 
3 typeVirgin      -14.2       2.70      -5.26  6.47e- 7 -19.7       -9.10 
4 thorax          151.       10.3       14.6   2.52e-28 130.       171.   
5 sleep             0.0239    0.0576     0.414 6.79e- 1  -0.0821     0.137

Extracting main effects

Estimates and confidence intervals are set agains the intercept
Hypothesis testing of factors with more than one level requires lrt:

drop1(fly_gamma, test = "F")

Single term deletions

Model:
longevity ~ type + thorax + sleep
       Df Deviance     AIC  F value    Pr(>F)    
<none>      4.8907  961.62                       
type    2   8.0073 1034.12  38.2348 1.423e-13 ***
thorax  1  10.9946 1109.45 149.7658 < 2.2e-16 ***
sleep   1   4.8982  959.81   0.1848     0.668    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Estimates and post-hoc

The emmeans package in R computes estimated marginal means (EMMs) it can also provides post-hoc comparisons and contrasts for fitted models.

By default the package computes marginal means with 95% confidence intervals for the average of each variable

emmeans::emmeans(fly_gamma, 
                 specs = pairwise ~ type + thorax + sleep,
                 type = "response")

$emmeans
 type        thorax sleep emmean   SE  df lower.CL upper.CL
 Control      0.821  23.5   61.9 2.42 120     57.1     66.7
 Inseminated  0.821  23.5   65.3 1.78 120     61.7     68.8
 Virgin       0.821  23.5   47.7 1.32 120     45.1     50.3

Confidence level used: 0.95 

$contrasts
 contrast                                                                 
 Control thorax0.82096 sleep23.464 - Inseminated thorax0.82096 sleep23.464
 Control thorax0.82096 sleep23.464 - Virgin thorax0.82096 sleep23.464     
 Inseminated thorax0.82096 sleep23.464 - Virgin thorax0.82096 sleep23.464 
 estimate   SE  df t.ratio p.value
    -3.38 2.97 120  -1.136  0.4940
    14.18 2.70 120   5.257  <.0001
    17.56 2.11 120   8.326  <.0001

P value adjustment: tukey method for comparing a family of 3 estimates

Tables

There are multiple packages available that provide ways to present publication ready model summaries

library(gtsummary)

tbl_regression(fly_gamma)

library(sjPlot)

tab_model(fly_gamma)

Summary table & Figures

Characteristic	Beta	95% CI¹	p-value
type
Control	—	—
Inseminated	3.4	-2.6, 9.0	0.3
Virgin	-14	-20, -9.1	<0.001
thorax	151	130, 171	<0.001
sleep	0.02	-0.08, 0.14	0.7
¹ CI = Confidence Interval

GLMs for binary data

GLMS for binary data

Common response variable in ecological datasets is the binary variable: we observe a phenomenon X or its “absence”

Presence/Absence of a species
Presence/Absence of a disease
Success/Failure to observe behaviour
Survival/Death of organisms

Wish to determine if $P/A∼ Variable$

Called a logistic regression or logit model

Binary variables

Investigating mayfly abundance in relation to pollution gradient measured in Cumulative Criterion Units (CCU)
Mayflies serve as indicators of stream health due to sensitivity to metal pollution
Higher CCU values indicate increased metal pollution, potentially impacting mayfly presence or absence in the stream

Example: Mayflies

Fitting a linear regression vs a logit-link regression:

Bernoulli Distribution

The Bernoulli distribution models the probability of success or failure in a single trial of a binary experiment:

where success occurs with probability $p$
and failure with probability ${1-p}$

Probability distribution

$logit(p) = \log \frac{p}{1 - p}$

$y_i = Bernoulli(p) = \frac{p}{1 - p}$

Mean of distribution Probability (p) of observing an outcome

Variance of observed responses As observed responses approach 0 or 1, the variance of the distribution decreases

The link function

If $μ = xβ$ is only true for normally distributed data, then, if this is not the case, we must use a transformation on the expected values:

\[g(μ) = xβ\]

where $g(μ)$ is the link function.

This allows us to relax the normality assumption.

The logit link

For binary data, the link function is called the logit:

\[ logit(p) = \log \frac{p}{1 - p} \]

The log of the odds $(p / (1 - p))$

The logit link

\[ logit(p) = \log \frac{p}{1 - p} \]

The odds put our expected values on a 0 to +Inf scale.
The log transformation puts our expected values on a -Inf to +Inf scale.
Now, the expected values can be linearly related to the linear predictor.

Probability, odds, logit-odds

The odds put our expected values on a 0 to +Inf scale.
The log transformation puts our expected values on a -Inf to +Inf scale.
Now, the expected values can be linearly related to the linear predictor.

Example

Can you recreate the binomial glm for mayfly~ccu data?

15:00

mayfly_glm <- glm(Occupancy ~ CCU, family = binomial(link = "logit"), data = Mayflies)

summary(mayfly_glm)


Call:
glm(formula = Occupancy ~ CCU, family = binomial(link = "logit"), 
    data = Mayflies)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)  
(Intercept)    5.102      2.369   2.154   0.0313 *
CCU           -3.051      1.211  -2.520   0.0117 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 34.795  on 29  degrees of freedom
Residual deviance: 12.649  on 28  degrees of freedom
AIC: 16.649

Number of Fisher Scoring iterations: 7

Interpret the model

Each coefficient corresponds to a predictor variable, indicating the change in the log odds of the response variable associated with a one-unit change in that predictor, holding other predictors constant.

Interpretation of coefficients involves exponentiating them to obtain odds ratios.
An odds ratio greater than 1 implies higher odds of the event occurring with an increase in the predictor.
An odds ratio less than 1 implies lower odds of the event occurring with an increase in the predictor.

Intepret the model

Probability	Odds	Log Odds	Verbiage
$p=.95$	$\frac{95}{5} = \frac{19}{1} = 19$	2.94	19 to 1 odds for
$p=.75$	$\frac{75}{25} = \frac{3}{1} = 3$	1.09	3 to 1 odds for
$p=.50$	$\frac{50}{50} = \frac{1}{1} = 1$	0	1 to 1 odds
$p=.25$	$\frac{25}{75} = \frac{1}{3} = 0.33$	-1.11	3 to 1 odds against
$p=.05$	$\frac{95}{5} = \frac{1}{19} = 0.0526$	-2.94	19 to 1 odds against

Logit odds

When we use a binomial model, we produce the ‘log-odds’, this produces a fully unconstrained linear regression as anything less than 1, can now occupy an infinite negative value -∞ to ∞.

\[\log\left(\frac{p}{1-p}\right) = \beta_{0}+\beta_{1}x_{1}+\beta_{2}x_{2}\]

\[\frac{p}{1-p} = e^{\beta_{0}}e^{\beta_{1}x_{1}}e^{\beta_{2}x_{2}}\]

We can interpret $\beta_{1}$ and $\beta_{2}$ as the increase in the log odds for every unit increase in $x_{1}$ and $x_{2}$.
We could alternatively interpret $\beta_{1}$ and $\beta_{2}$ using the notion that a one unit change in $x_{1}$ as a percent change of $e^{\beta_{1}}$ in the odds.

For the intercept:

The estimate as log odds is 5.102.
Therefore the odds are 164
Over 99% probability of observing mayfly when CCU = 0


Call:
glm(formula = Occupancy ~ CCU, family = binomial(link = "logit"), 
    data = Mayflies)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)  
(Intercept)    5.102      2.369   2.154   0.0313 *
CCU           -3.051      1.211  -2.520   0.0117 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 34.795  on 29  degrees of freedom
Residual deviance: 12.649  on 28  degrees of freedom
AIC: 16.649

Number of Fisher Scoring iterations: 7

For the predictor variable CCU:

The estimate is -3.051.
This means that for every one unit increase in CCU, the log odds of the response variable decreases by 3.051, holding all other predictors constant.


Call:
glm(formula = Occupancy ~ CCU, family = binomial(link = "logit"), 
    data = Mayflies)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)  
(Intercept)    5.102      2.369   2.154   0.0313 *
CCU           -3.051      1.211  -2.520   0.0117 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 34.795  on 29  degrees of freedom
Residual deviance: 12.649  on 28  degrees of freedom
AIC: 16.649

Number of Fisher Scoring iterations: 7

Now, let’s interpret the coefficient for CCU using the odds ratio interpretation:

The odds ratio associated with CCU is calculated as $e^{\beta_{\text{CCU}}} = e^{-3.051}$.
Therefore, $e^{\beta_{\text{CCU}}} \approx 0.048$.
This means that for every one unit increase in CCU, the odds of the response variable decrease by a multiple of 0.048

broom::tidy(mayfly_glm, exponentiate = T)

# A tibble: 2 × 5
  term        estimate std.error statistic p.value
  <chr>          <dbl>     <dbl>     <dbl>   <dbl>
1 (Intercept) 164.          2.37      2.15  0.0313
2 CCU           0.0473      1.21     -2.52  0.0117

Challenge

Work with the odds and probability calculators
Check how comfortable you are with
additive log-odds
multiplicative odds
changing probability
Can you work out the probability of mayfly occupancy at each CCU level?

10:00

Model fit

For a simple Bernoulli/Binary GLM there are only a few checks that apply:

# Calculate and plot "deviance" residuals
dresid <- resid(mayfly_glm, type = "deviance")
hist(dresid)
# Plot leverage
plot(mayfly_glm, which = 5)
# Plot Cook's distance
plot(mayfly_glm, which = 4)

Model fit

Residuals in binary glms will never be normally distributed - so a “residuals vs fitted plot” provides little insight. We can try “binned residuals”:

DHARMa fit

DHARMa works by simulating residuals from the fitted model, comparing them with observed residuals, and providing diagnostic plots to assess model adequacy and identify potential issues like heteroscedasticity or influential data points.

library(DHARMa)
plot(simulateResiduals(mayfly_glm))

Prediction

predict() function for GLMs computes predicted values based on the fitted model.
For existing data, it predicts values for each observation in the original dataset.
For new data, it predicts values for a specified new dataset, allowing for predictions on unseen data.

# Predictions for existing data
predict(mayfly_glm, type = "response")

# Predictions for new data
new_CCU<- data.frame(CCU = c(0,1,2,3,4,5))
predict(mayfly_glm, newdata = new_CCU, type = "response")

Emmeans

The emmeans package in R computes estimated marginal means (EMMs) it can also provides post-hoc comparisons and contrasts for fitted models.

This specific function call below computes EMMs for the predictor variable CCU at specific values (0 to 5) and provides them on the response scale.

CCU	prob	SE	df	asymp.LCL	asymp.UCL
0	0.99	0.014	Inf	0.61	1.0
1.0	0.89	0.13	Inf	0.39	0.99
2.0	0.27	0.15	Inf	0.078	0.62
3.0	0.017	0.026	Inf	0.00082	0.27
4.0	0.00082	0.0022	Inf	0.0000042	0.14
5.0	0.000039	0.00015	Inf	0.000000020	0.071

Prediction plots with emmeans

We can use emmeans package to compute estimated marginal means (EMMs) from a fitted GLM, convert them into a tibble format for visualization with ggplot2, overlay the means with uncertainty ribbons and add them on a scatter plot of the original data.

means <- emmeans::emmeans(mayfly_glm, 
                 specs = ~ CCU, 
                 at=list(CCU=c(0:5)), 
                 type='response') |> 
  as_tibble()

ggplot(Mayflies, aes(x=CCU, y=Occupancy)) + geom_point()+
    geom_ribbon(data = means,
              aes(x = CCU,
                  y = prob,
                  ymin = asymp.LCL,
                  ymax = asymp.UCL),
              alpha = .2)+
  geom_line(data = means,
            aes(x = CCU,
                y = prob)) +  # regression line
  labs(title = "Logit-link Binomial Regression",
       x = "CCU", y = "Occupancy")+
  theme_classic(base_size = 14)

Practice - Bacteria

10:00

Using the ‘bacteria’ dataset, model the presence of H. influenzae as a function of treatment and week of test.

Start with a full model between these two variables (include an interaction term) and use the likelihood ratio test to see if this should be retained

Rows: 220
Columns: 6
$ y    <fct> y, y, y, y, y, y, n, y, y, y, y, y, y, y, y, y, y, y, y, y, y, y,…
$ ap   <fct> p, p, p, p, a, a, a, a, a, a, a, a, a, p, p, p, p, p, p, p, p, p,…
$ hilo <fct> hi, hi, hi, hi, hi, hi, hi, hi, lo, lo, lo, lo, lo, lo, lo, lo, l…
$ week <int> 0, 2, 4, 11, 0, 2, 6, 11, 0, 2, 4, 6, 11, 0, 2, 4, 6, 11, 0, 2, 4…
$ ID   <fct> X01, X01, X01, X01, X02, X02, X02, X02, X03, X03, X03, X03, X03, …
$ trt  <fct> placebo, placebo, placebo, placebo, drug+, drug+, drug+, drug+, d…

Solution - Bacteria

Fit the model and check whether residuals are ok?
Full model:

Single term deletions

Model:
y ~ trt * week
         Df Deviance    AIC     LRT Pr(>Chi)
<none>        203.12 215.12                 
trt:week  2   203.81 211.81 0.68544   0.7098

Solution - Bacteria

We found no significant difference in deviance when the interaction term was dropped from the full model $\chi^2_1$ = 0.68, p = 0.7098. So this was removed from the model and estimates were calculated from an additive model containing treatment and week

Single term deletions

Model:
y ~ trt + week
       Df Deviance    AIC    LRT Pr(>Chi)   
<none>      203.81 211.81                   
trt     2   210.91 214.91 7.1026 0.028688 * 
week    1   210.72 216.72 6.9140 0.008552 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1


Call:
glm(formula = y ~ trt + week, family = binomial, data = bacteria)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  2.54629    0.40555   6.279 3.42e-10 ***
trtdrug     -1.10667    0.42519  -2.603  0.00925 ** 
trtdrug+    -0.65166    0.44615  -1.461  0.14412    
week        -0.11577    0.04414  -2.623  0.00872 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 217.38  on 219  degrees of freedom
Residual deviance: 203.81  on 216  degrees of freedom
AIC: 211.81

Number of Fisher Scoring iterations: 4

Challenge

40:00

Using the malaria dataset fit a binomial glm to look at the predictors of malaria in the Seychelles warbler
Check the model fit
Make predictions
Produce a suitable figure

GLMs for Binomial data

Proportion data and GLM

Sometimes, count data aligns more closely with logistic regression methodologies than it initially appears.

We’re not looking at typical count data when measuring the number of occurrences with a known total sample size.

Imagine, for example, we’re studying the prevalence of a native underbrush species across various forest plots. We assess 15 different plots, each meticulously surveyed for the presence of this underbrush, counting the number of square meters where the underbrush is present versus absent within a standard area:

\[\frac{M^2~\text{with native underbrush (successes)}}{\text{Total surveyed area in}~M^2~(\text{trials})}\]

Bound between zero and one

Binomial distribution

$logit(p) = \log \frac{p}{1 - p}$

$y_i = Binomial(n,p)$

It is the collection of Bernoulli trials for the same event
It is represented by the number of Bernoulli trials $n$
It is also the probability of an event in each trial $p$

Binomial: Example

This small dataset is from an experiment looking at mortality in batches of flour exposed to different doses of pesticides.

The question of interest is:

How does pesticide dose affect mortality in the flour beetle Tribolium castaneum

The data

# A tibble: 8 × 4
   Dose Number_tested Number_killed Mortality_rate
  <dbl>         <dbl>         <dbl>          <dbl>
1  49.1            49             6          0.122
2  53.0            60            13          0.217
3  56.9            62            18          0.290
4  60.8            56            28          0.5  
5  64.8            63            52          0.825
6  68.7            59            53          0.898
7  72.6            62            61          0.984
8  76.5            60            60          1

Preparing the data

As well as the numbers killed, we need the numbers that remain alive

beetles <- beetles |> 
  rename("dead" = Number_killed,
         "trials" = Number_tested) |> 
  mutate("alive" = trials-dead)

beetles

# A tibble: 8 × 5
   Dose trials  dead Mortality_rate alive
  <dbl>  <dbl> <dbl>          <dbl> <dbl>
1  49.1     49     6          0.122    43
2  53.0     60    13          0.217    47
3  56.9     62    18          0.290    44
4  60.8     56    28          0.5      28
5  64.8     63    52          0.825    11
6  68.7     59    53          0.898     6
7  72.6     62    61          0.984     1
8  76.5     60    60          1         0

Binomial VS Bernoulli Keypoints!

Bernoulli deals with the outcome of the single trial of the event, whereas Binomial deals with the outcome of the multiple trials of the single event.

Bernoulli is used when the outcome of an event is required for only one time,

whereas the Binomial is used when the outcome of an event is required multiple times.

Binomial glm

The GLM of the binomial counts will analyse the number of each batch of beetles killed, while taking into account the size of each group

#cbind() creates a matrix of successes and failures for each batch

beetle_glm <- glm(cbind(dead, alive) ~ Dose, family = binomial(link = "logit"), data = beetles)

summary(beetle_glm)


Call:
glm(formula = cbind(dead, alive) ~ Dose, family = binomial(link = "logit"), 
    data = beetles)

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -14.57806    1.29846  -11.23   <2e-16 ***
Dose          0.24554    0.02149   11.42   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 267.6624  on 7  degrees of freedom
Residual deviance:   8.4379  on 6  degrees of freedom
AIC: 38.613

Number of Fisher Scoring iterations: 4

Weights

An alternative method is to analyse the proportion of beetles killed and including the number of trials with the weight argument

beetle_glm_weights <- glm(Mortality_rate ~ Dose, weights = trials, family = binomial(link = "logit"), data = beetles)

summary(beetle_glm_weights)


Call:
glm(formula = Mortality_rate ~ Dose, family = binomial(link = "logit"), 
    data = beetles, weights = trials)

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -14.57806    1.29846  -11.23   <2e-16 ***
Dose          0.24554    0.02149   11.42   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 267.6624  on 7  degrees of freedom
Residual deviance:   8.4379  on 6  degrees of freedom
AIC: 38.613

Number of Fisher Scoring iterations: 4

Checking Binomial model fit

For a Binomial model standard residual checks do not apply

# Calculate and plot "deviance" residuals

par(mfrow = c(2, 2), mar = c(4, 4, 2, 1))

pred <- predict(beetle_glm_weights) 
dresid <- resid(beetle_glm_weights, type = "deviance")

# Check for overdispersion
hist(dresid)

# Binned residuals
arm::binnedplot(dresid, pred)

# Plot leverage
plot(beetle_glm_weights, which = 5)

plot(beetle_glm_weights, which = 2)

Performance residual checks

performance::check_model(beetle_glm_weights)

performance::binned_residuals(beetle_glm_weights)

Warning: Probably bad model fit. Only about 67% of the residuals are inside the error bounds.

Checking with DHARMa

DHARMa works by simulating residuals from the fitted model, comparing them with observed residuals, and providing diagnostic plots to assess model adequacy and identify potential issues like heteroscedasticity or influential data points.

library(DHARMa)
plot(simulateResiduals(beetle_glm_weights))

Estimates of significance

Confidence intervals

library(broom)
tidy(beetle_glm, conf.int = T)

# A tibble: 2 × 7
  term        estimate std.error statistic  p.value conf.low conf.high
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
1 (Intercept)  -14.6      1.30       -11.2 3.00e-29  -17.3     -12.2  
2 Dose           0.246    0.0215      11.4 3.18e-30    0.206     0.290

drop1(beetle_glm, test = "Chi")

Single term deletions

Model:
cbind(dead, alive) ~ Dose
       Df Deviance     AIC    LRT  Pr(>Chi)    
<none>       8.438  38.613                     
Dose    1  267.662 295.837 259.23 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Table

This function has automatically applied the exponentiate to produce the odds and odds ratio:

library(sjPlot)

tab_model(beetle_glm)

	cbind(dead, alive)
Predictors	Odds Ratios	CI	p
(Intercept)	0.00	0.00 – 0.00	<0.001
Dose	1.28	1.23 – 1.34	<0.001
Observations	8

Predictions

emmeans::emmeans(beetle_glm, 
                 specs = ~ Dose, 
                 at=list(Dose=c(40, 50, 60, 70, 80)), 
                 type='response')

 Dose    prob      SE  df asymp.LCL asymp.UCL
   40 0.00852 0.00382 Inf   0.00354    0.0204
   50 0.09103 0.02099 Inf   0.05742    0.1414
   60 0.53851 0.03260 Inf   0.47432    0.6014
   70 0.93149 0.01595 Inf   0.89282    0.9569
   80 0.99373 0.00279 Inf   0.98506    0.9974

Confidence level used: 0.95 
Intervals are back-transformed from the logit scale

Predictions plot

Overdispersion

If the residual deviance of the model is > the residual degrees of freedom - we consider our binomial models to have overdispersion.

This means the variance is greater than predicted by the binomial distribution

Overdispersion > 1.5 can be accounted for with the quasi-binomial glm


Call:
glm(formula = Mortality_rate ~ Dose, family = binomial(link = "logit"), 
    data = beetles, weights = trials)

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -14.57806    1.29846  -11.23   <2e-16 ***
Dose          0.24554    0.02149   11.42   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 267.6624  on 7  degrees of freedom
Residual deviance:   8.4379  on 6  degrees of freedom
AIC: 38.613

Number of Fisher Scoring iterations: 4

Overdispersion

Binomial $P(X = s) = C(n, s) \cdot p^s \cdot (1 - p)^{n - s}$

Quasibinomial $P(X = s) = C(n, s) \cdot p(p+k\theta)^{s-1} \cdot (1 - pk\theta)^{n - s}$

Where:

$n$ is the total number of trials of an event.
$s$ corresponds to the number of times an event should occur.
$p$ is the probability that the event will occur.
$(1 - p)$ is the probability that the event will not occur.
$C$ term represents combinations, calculated as $C(n, s) = \frac{n!}{s!(n - s)!}$, representing the number of ways to choose $s$ successes out of $n$ trials.
$\theta$ term describes additional variance outside of the Binomial distribution

Fitting the quasibinomial model


Call:
glm(formula = cbind(dead, alive) ~ Dose, family = binomial(link = "logit"), 
    data = beetles)

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -14.57806    1.29846  -11.23   <2e-16 ***
Dose          0.24554    0.02149   11.42   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 267.6624  on 7  degrees of freedom
Residual deviance:   8.4379  on 6  degrees of freedom
AIC: 38.613

Number of Fisher Scoring iterations: 4


Call:
glm(formula = cbind(dead, alive) ~ Dose, family = quasibinomial(link = "logit"), 
    data = beetles)

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -14.57806    1.46611  -9.943 5.98e-05 ***
Dose          0.24554    0.02427  10.118 5.42e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for quasibinomial family taken to be 1.274895)

    Null deviance: 267.6624  on 7  degrees of freedom
Residual deviance:   8.4379  on 6  degrees of freedom
AIC: NA

Number of Fisher Scoring iterations: 4

Fitting the quasibinomial model

Standard Error, Confidence intervals and p-values WILL change.

Estimates remain the same

Likelihood ratio tests should now be applied with an F-statistics

drop1(beetle_quasiglm, test = "F")

Single term deletions

Model:
cbind(dead, alive) ~ Dose
       Df Deviance F value    Pr(>F)    
<none>       8.438                      
Dose    1  267.662  184.33 9.908e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Challenge: Binomial model

On the afternoon of January 28th, 1986 the space shuttle Challenger experienced a critical failure and broke apart in mid air, resulting in the deaths of all seven crewmembers.

An investigation discovered critical failures in all six of the O-rings in the solid rocket booster.

How did temperature affect the probability of o-ring failure?

What was the probability of all o-rings failing when the temperature was 36 degrees F

40:00

GLMs for count data

Poisson

Count or rate data are ubiquitous in the life sciences (e.g number of parasites per microlitre of blood, number of species counted in a particular area). These type of data are discrete and non-negative.

A useful distribution to model abundance data is the Poisson distribution:

a discrete distribution with a single parameter, λ (lambda), which defines both the mean and the variance of the distribution.

GLM

Recall the simple linear regression case (i.e a GLM with a Gaussian error structure and identity link). For the sake of clarity let’s consider a single explanatory variable where $\mu$ is the mean for Y:

$\mu_i = \beta0 + \beta_1x1 + \beta_2x2 + ... +\beta_nxn$

$y_i \sim \mathcal{N}(\mu_i, \sigma^2)$

The mean function is unconstrained, i.e the value of $\beta_0 + \beta_1X$ can range from $-\infty$ to $+\infty$.

Poisson

If we want to model count data we therefore want to constrain this mean to be positive only.

Mathematically we can do this by taking the logarithm of the mean (the log is the default link for the Poisson distribution).
We then assume our count data variance to be Poisson distributed (a discrete, non-negative distribution), to obtain our Poisson regression model (to be consistent with the statistics literature we will rename $\mu$ to $\lambda$):

GLM Poisson

Poisson limitations

The Poisson distribution has the following characteristics:

Discrete variable (integer), defined on the range $0, 1, \dots, \infty$.
A single rate parameter $\lambda$, where $\lambda > 0$.
Mean = $\lambda$
Variance = $\lambda$

Poisson: Case study

In a study by Kilner et al. (1999), the authors studied the begging rate of nestlings in relation to total mass of the brood of reed warbler chicks and cuckoo chicks. The question of interest is:

How does nestling mass affect begging rates between the different species?

This model is inadequate. It is predicting negative begging calls within the range of the observed data, which clearly does not make any sense.

cuckoo_lm <- lm(Beg ~ Mass + Species + Mass:Species, data = cuckoo)

Diagnostics

Let us display the model diagnostics plots for this linear model.

Curvature
Funnelling effect

Biological data and distributions

We can look at the kurtosis and skewness of our univariate variable - Begging:

descdist(cuckoo$Beg, boot = 500, discrete = T)

summary statistics
------
min:  0   max:  55 
median:  9 
mean:  16.13725 
estimated sd:  17.03176 
estimated skewness:  0.8270038 
estimated kurtosis:  2.367674

Univariate distributions

par(mfrow = c(1, 2), mar = c(4, 4, 2, 1))
fp <- fitdist(cuckoo$Beg, "pois")
fn <- fitdist(cuckoo$Beg, "norm")
fnb<- fitdist(cuckoo$Beg, "nbinom")
plot.legend <- c("normal", "poisson", "negative binomial")
denscomp(list(fn, fp, fnb), legendtext = plot.legend)
qqcomp(list(fn, fp, fnb), legendtext = plot.legend)

Poisson model

We should therefore try a different model structure.

The response variable in this case is a classic count data: discrete and bounded below by zero (i.e we cannot have negative counts). We will therefore try a Poisson model using the canonical log link function for the mean:

λ varies with x (mass) which means residual variance will also vary with x, which means that we just relaxed the homogeneity of variance assumption!
Predicted values will now be integers instead of fractions
The model will never predict negative values (Poisson is strictly positive)

\[ \log{\lambda} = \beta_0 + \beta_1 M_i + \beta_2 S_i + \beta_3 M_i S_i \]

Poisson model

\[ \log{\lambda} = \beta_0 + \beta_1 M_i + \beta_2 S_i + \beta_3 M_i S_i \]

where $M_i$ is nestling mass and $S_i$ a dummy variable

$S_i = \left\{\begin{array}{ll}1 & \text{if } i \text{ is warbler},\\0 & \text{otherwise}.\end{array}\right.$

The term $M_iS_i$ is an interaction term. Think of this as an additional explanatory variable in our model. Effectively it lets us have different slopes for different species (without an interaction term we assume that both species have the same slope for the relationship between begging rate and mass, and only the intercept differ).

Regression lines

The mean regression lines for the two species look like this:

Cuckoo ($S_i=0$)
$\begin{aligned}\log{\lambda} & = \beta_0 + \beta_1 M_i + (\beta_2 \times 0) + (\beta_3 \times M_i \times 0)\\\log{\lambda} & = \beta_0 + \beta_1 M_i\end{aligned}$
Warbler ($S_i=1$)
$\begin{aligned}\log{\lambda} & = \beta_0 + \beta_1 M_i + (\beta_2 \times 1) + (\beta_3 \times M_i \times 1)\\\log{\lambda} & = \beta_0 + \beta_1 M_i + \beta_2 + \beta_3M_i\\\end{aligned}$

Fit the Poisson model

Fit the model with the interaction term in R:

cuckoo_glm1 <- glm(Beg ~ Mass + Species + Mass:Species, data=cuckoo, family=poisson(link="log"))

summary(cuckoo_glm1)


Call:
glm(formula = Beg ~ Mass + Species + Mass:Species, family = poisson(link = "log"), 
    data = cuckoo)

Coefficients:
                     Estimate Std. Error z value Pr(>|z|)    
(Intercept)          0.334475   0.227143   1.473  0.14088    
Mass                 0.094847   0.007261  13.062  < 2e-16 ***
SpeciesWarbler       0.674820   0.259217   2.603  0.00923 ** 
Mass:SpeciesWarbler -0.021673   0.008389  -2.584  0.00978 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 970.08  on 50  degrees of freedom
Residual deviance: 436.05  on 47  degrees of freedom
AIC: 615.83

Number of Fisher Scoring iterations: 6

Note there appears to be a negative interaction effect for Species:Mass, indicating that Begging calls do not increase with mass as much as you would expect for Warblers as compared to Cuckoos.

Exercise

Fit a Poisson model to the cuckoo data
Look at the residual plots - how have they changed?

15:00

Model summary

summary(cuckoo_glm1)


Call:
glm(formula = Beg ~ Mass + Species + Mass:Species, family = poisson(link = "log"), 
    data = cuckoo)

Coefficients:
                     Estimate Std. Error z value Pr(>|z|)    
(Intercept)          0.334475   0.227143   1.473  0.14088    
Mass                 0.094847   0.007261  13.062  < 2e-16 ***
SpeciesWarbler       0.674820   0.259217   2.603  0.00923 ** 
Mass:SpeciesWarbler -0.021673   0.008389  -2.584  0.00978 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 970.08  on 50  degrees of freedom
Residual deviance: 436.05  on 47  degrees of freedom
AIC: 615.83

Number of Fisher Scoring iterations: 6

Poisson model check

VIF

VIF, or Variance Inflation Factor, measures the extent of multicollinearity among independent variables in regression analysis.

High multicollinearity, indicated by elevated VIF values, can distort regression coefficients and inflate standard errors, undermining the reliability of the model’s predictions

car::vif(cuckoo_glm1)

        Mass      Species Mass:Species 
    4.050091    12.967267    14.496767

check_model(cuckoo_glm1, check = "vif")

Mean-centering

True VIF is highly problematic

Mean centering can remove apparent VIF as means become zero
This often decreases potential correlation

cuckoo$mass.c <- cuckoo$Mass - mean(cuckoo$Mass, na.rm =T)

cuckoo_glm2 <- glm(Beg ~ mass.c + Species + mass.c:Species, data=cuckoo, family=poisson(link="log"))

summary(cuckoo_glm2)


Call:
glm(formula = Beg ~ mass.c + Species + mass.c:Species, family = poisson(link = "log"), 
    data = cuckoo)

Coefficients:
                       Estimate Std. Error z value Pr(>|z|)    
(Intercept)            2.263284   0.091946  24.615  < 2e-16 ***
mass.c                 0.094847   0.007261  13.062  < 2e-16 ***
SpeciesWarbler         0.234068   0.106710   2.194  0.02827 *  
mass.c:SpeciesWarbler -0.021673   0.008389  -2.584  0.00978 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 970.08  on 50  degrees of freedom
Residual deviance: 436.05  on 47  degrees of freedom
AIC: 615.83

Number of Fisher Scoring iterations: 6

Exercise

cuckoo$mass.c <- cuckoo$Mass - mean(cuckoo$Mass, na.rm =T)
Mean center the mass variable and recheck VIF. car::vif()

10:00

Model selection

For a term with only one degree of freedom $z^2$ = the chi-square value

The likelihood ratio test allows us to determine whether we can remove an interaction term

# For a fixed  mean-variance model we use a Chisquare distribution
drop1(cuckoo_glm2, test = "Chisq")

Single term deletions

Model:
Beg ~ mass.c + Species + mass.c:Species
               Df Deviance    AIC   LRT Pr(>Chi)   
<none>              436.05 615.83                  
mass.c:Species  1   442.81 620.59 6.766 0.009291 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Offset - for rates

In Poisson GLMs, using an offset allows us to model rates rather than counts, by including the logarithm of the exposure variable as an offset, ensuring that the model accounts for the differing exposure levels across observations.

cuckoo_glm3 <- glm(Call_Total ~ mass.c + Species + mass.c:Species, data=cuckoo, offset = log(Mins), family=poisson(link="log"))

summary(cuckoo_glm3)


Call:
glm(formula = Call_Total ~ mass.c + Species + mass.c:Species, 
    family = poisson(link = "log"), data = cuckoo, offset = log(Mins))

Coefficients:
                       Estimate Std. Error z value Pr(>|z|)    
(Intercept)            2.249441   0.030831  72.961   <2e-16 ***
mass.c                 0.129261   0.002502  51.671   <2e-16 ***
SpeciesWarbler         0.300202   0.035330   8.497   <2e-16 ***
mass.c:SpeciesWarbler -0.057893   0.002829 -20.462   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 9597.9  on 50  degrees of freedom
Residual deviance: 3071.8  on 47  degrees of freedom
AIC: 3334.6

Number of Fisher Scoring iterations: 6

Poisson limitations

So for the Poisson regression case we assume that the mean and variance are the same. Hence, as the mean increases, the variance increases also (heteroscedascity). This may or may not be a sensible assumption so watch out! Just because a Poisson distribution usually fits well for count data, doesn’t mean that a Gaussian distribution can’t always work.

Recall the link function between the predictors and the mean and the rules of logarithms (if $\log{\lambda} = k$(value of predictors), then $\lambda = e^k$):

Overdispersion

When the residual deviance is higher than the residual degrees of freedom, we say that the model is overdispersed. This situation is mathematically represented as:

\[ \phi = \frac{\text{Residual deviance}}{\text{Residual degrees of freedom}} \]

Overdispersion occurs when the variance in the data is even higher than the mean, indicating that the Poisson distribution might not be the best choice. This can be due to many reasons, such as the presence of many zeros, many very high values, or missing covariates in the model.

Overdispersion and what to do about it

Overdispersion statistic values > 1
Causes of over-dispersion	What to do about it
Model mis-specification (missing covariates or interactions)	Add more covariates or interaction terms
Too many zeros (“zero inflation”)	Use a zero-inflated model
Non-independence of observations	Use a generalised mixed model with random effects to take non-independence into account
Variance is larger than the mean	Use a quasipoisson GLM if overdispersion = 2-15. Use a negative binomial GLM if > 15-20

Choosing a model

Overdispersion and what to do about it

It may be true that you are initially unsure whether overdispersion is coming from zero-inflation, larger variance than the mean or both.

You can fit models with the same dependent and independent variables and different error structures and use AIC to help determine the best model for your data

Overdispersion in the cuckoo model

$\phi = \frac{\text{Residual deviance}}{\text{Residual degrees of freedom}}$ = $\frac{436}{47} = 9.2$

summary(cuckoo_glm2)


Call:
glm(formula = Beg ~ mass.c + Species + mass.c:Species, family = poisson(link = "log"), 
    data = cuckoo)

Coefficients:
                       Estimate Std. Error z value Pr(>|z|)    
(Intercept)            2.263284   0.091946  24.615  < 2e-16 ***
mass.c                 0.094847   0.007261  13.062  < 2e-16 ***
SpeciesWarbler         0.234068   0.106710   2.194  0.02827 *  
mass.c:SpeciesWarbler -0.021673   0.008389  -2.584  0.00978 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 970.08  on 50  degrees of freedom
Residual deviance: 436.05  on 47  degrees of freedom
AIC: 615.83

Number of Fisher Scoring iterations: 6

Quasi-Poisson GLM

The systematic part and the link function remain the same

\[ \log{\lambda} = \beta_0 + \beta_1 x_i \]

phi$\phi$ is a dispersion parameter, it will not affect estimates but will affect significance. Standard errors are multiplied by $\sqrt\phi$

\[ Y_i = Poisson(\phi \lambda_i) \]

Where:

$Yi$ is the response variable.
$\phi$ is the dispersion parameter, which adjusts the variance of the distribution

Confidence intervals and p-values WILL change.

Exercise

10:00

Fit a quasipoisson model to the cuckoo data

cuckoo_glmq <- glm(Beg ~ mass.c + Species + mass.c:Species, data=cuckoo, family=quasipoisson(link="log"))

Look at the residual plots - how have they changed?
Look at the SE and p-values - how have they changed
Calculate new 95% confidence intervals

Compare models

summary(cuckoo_glm2)
summary(cuckoo_glmq)


Call:
glm(formula = Beg ~ mass.c + Species + mass.c:Species, family = poisson(link = "log"), 
    data = cuckoo)

Coefficients:
                       Estimate Std. Error z value Pr(>|z|)    
(Intercept)            2.263284   0.091946  24.615  < 2e-16 ***
mass.c                 0.094847   0.007261  13.062  < 2e-16 ***
SpeciesWarbler         0.234068   0.106710   2.194  0.02827 *  
mass.c:SpeciesWarbler -0.021673   0.008389  -2.584  0.00978 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 970.08  on 50  degrees of freedom
Residual deviance: 436.05  on 47  degrees of freedom
AIC: 615.83

Number of Fisher Scoring iterations: 6


Call:
glm(formula = Beg ~ mass.c + Species + mass.c:Species, family = quasipoisson(link = "log"), 
    data = cuckoo)

Coefficients:
                      Estimate Std. Error t value Pr(>|t|)    
(Intercept)            2.26328    0.25554   8.857 1.38e-11 ***
mass.c                 0.09485    0.02018   4.700 2.30e-05 ***
SpeciesWarbler         0.23407    0.29657   0.789    0.434    
mass.c:SpeciesWarbler -0.02167    0.02332  -0.930    0.357    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for quasipoisson family taken to be 7.7242)

    Null deviance: 970.08  on 50  degrees of freedom
Residual deviance: 436.05  on 47  degrees of freedom
AIC: NA

Number of Fisher Scoring iterations: 6

F distributions

The presence of overdispersion suggested the use of the F-test for nested models. We will test if the interaction term can be dropped from the model.

Though here for a term with only one degree of freedom $t^2$ = the F distribution with denominator degree of freedom

drop1(cuckoo_glmq, test = "F")

Single term deletions

Model:
Beg ~ mass.c + Species + mass.c:Species
               Df Deviance F value Pr(>F)
<none>              436.05               
mass.c:Species  1   442.81  0.7293 0.3974

Challenge: Monarch butterflies

This dataset contains standardised counts of Monarch butterflies across three different gardens.

Is there a difference in observance of the butterflies between these three sites?

Check the poisson model is a good fit for this data

Get the monarchs.csv dataset

15:00

Negative Binomial

Negative Binomial

Negative binomial GLMs are favored when overdispersion is high.

It has two parameters, $μ$ and $\theta$. $\theta$ controls for the dispersion parameter (smaller $\theta$ indicates higher dispersion). It corresponds to a combination of two distributions (Poisson and gamma).

It assumes that the $Y_i$ are Poisson distributed with the mean $μ_i$ assumed to follow a gamma distribution:

\[ E(Y_i) = μ_i \\ \text{Var}(Y_i) = μ_i + μ_i^2/k \]

Fitting a negative binomial in R

Negative binomial is not in the glm() function. We need the MASS package:

Suitable for strongly over-dispersed zero-bounded integer data.

library(MASS)

model <- glm.nb(y ~ x1 + x2, link = "log", data = dataframe)

Negbin: Example

10:00

Open the parasite dataset and script

Determine whether the poisson model is suitable for this dataset?

The example is an experiment measuring the effect of the parasitic tapeworm Schistocephalus solidus infection on the susceptibility of infection from a second parasite, the trematode Diplostomum pseudospathaceum

The question of interest is:

How does Treatment/Infection status affect trematode counts?

Univariate distribution

par(mfrow = c(1, 2), mar = c(4, 4, 2, 1))
fp <- fitdist(parasite$Diplo_intensity, "pois")
fn <- fitdist(parasite$Diplo_intensity, "norm")
fnb<- fitdist(parasite$Diplo_intensity, "nbinom")
plot.legend <- c("normal", "poisson", "negative binomial")
denscomp(list(fn, fp, fnb), legendtext = plot.legend)
qqcomp(list(fn, fp, fnb), legendtext = plot.legend)

Check Poisson model

summary(stick_poisson)


Call:
glm(formula = Diplo_intensity ~ Treatment, family = "poisson", 
    data = parasite)

Coefficients:
                     Estimate Std. Error z value Pr(>|z|)    
(Intercept)           1.82141    0.04740  38.423  < 2e-16 ***
TreatmentInfected HG  0.53969    0.05879   9.180  < 2e-16 ***
TreatmentInfected LG -0.19726    0.09771  -2.019 0.043492 *  
TreatmentUninfected  -0.31733    0.08543  -3.715 0.000203 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 887.73  on 220  degrees of freedom
Residual deviance: 697.25  on 217  degrees of freedom
AIC: 1481.5

Number of Fisher Scoring iterations: 5

Check zero-inflation

The model is underfitting zeros (but not by much).

check_zeroinflation(stick_poisson)

# Check for zero-inflation

   Observed zeros: 6
  Predicted zeros: 1
            Ratio: 0.17

100*sum(parasite$Diplo_intensity == 0)/nrow(parasite)

[1] 2.714932

# Only 2% of our data has zeros

Compare models

# Quasilikelihood
stick_quasi <- glm(Diplo_intensity ~ Treatment, family = quasipoisson, data = parasite)

# Neg bin
stick_nb <- glm.nb(Diplo_intensity ~ Treatment, link = "log", data = parasite)

# Zero-inflated model
stick_zip <- zeroinfl(Diplo_intensity ~ Treatment| 
                   Treatment,
                dist = "poisson",
                link = "logit",
                data = parasite)

Compare models

AIC(stick_poisson)

[1] 1481.456

AIC(stick_quasi)

[1] NA

AIC(stick_nb)

[1] 1259.893

AIC(stick_zip)

[1] 1466.362

Validating Negative Binomial models

library(DHARMa)
plot(simulateResiduals(stick_nb))
par(mfrow = c(2, 2), mar = c(4, 4, 2, 1))

plot(stick_nb)

Comparison of model predictions

Luckily in this instance we can be assured that each of these models makes very similar predictions:

Zero-inflation

“Zero-inflation” refers to the problem of “too many zeros”.

The dependent variable contains more zeros than would be expected under the standard statistical distribution for zero-bounded count data (Poisson or negative binomial).

Zero-inflation is a common feature of count datasets. Large numbers of zeros can arise because

There was nothing there to count (true zeros)
Things were there but missed (false zeros)

Zero-inflation

Model	Acronym/Abbr	Source of Overdispersion
Zero-inflated Poisson	ZIP	Zeros
Negative Binomial	NegBin	Positive integers
Zero-inflated Negative Binomial	ZINB	Zeros + Positive integers

ZIP in R

In a ZIP model we define two halves of the model; the first half models the count process (i.e. a Poisson GLM explaining non-zero integers) and the second half models the binomial (i.e. explains the zeros).

install.packages("pscl")

library(pscl)
zip1 <- zeroinf(y ~ x1 + x2 | x1 + x2,
                dist = "poisson",
                link = "logit",
                data = dataframe)

Alternative model design

You can change the zero-inflation model’s link function and use AIC to decide on the best model

# zeros modelled with a constant
y ~ x1 + x2

# zeros modelled with the same variables
y ~ x1 + x2 | x1 + x2

# zeros modelled with different variables
y ~ x1 + x2 | z1 +z2

Checking for zero-inflation

hist(cuckoo$Beg)

100*sum(cuckoo$Beg== 0)/nrow(cuckoo)

[1] 25.4902

# 25% of data is zeros

performance::check_zeroinflation(cuckoo_glm2)

# Check for zero-inflation

   Observed zeros: 13
  Predicted zeros: 0
            Ratio: 0.00

Fit the model

cuckoo_zip <- zeroinfl(Beg ~ mass.c + Species + mass.c:Species| 
                   mass.c + Species + mass.c:Species,
                dist = "poisson",
                link = "logit",
                data = cuckoo)

summary(cuckoo_zip)


Call:
zeroinfl(formula = Beg ~ mass.c + Species + mass.c:Species | mass.c + 
    Species + mass.c:Species, data = cuckoo, dist = "poisson", link = "logit")

Pearson residuals:
    Min      1Q  Median      3Q     Max 
-3.2611 -0.9164 -0.1979  0.8878  3.7917 

Count model coefficients (poisson with log link):
                       Estimate Std. Error z value Pr(>|z|)    
(Intercept)            3.395981   0.104443  32.515  < 2e-16 ***
mass.c                 0.027481   0.008805   3.121 0.001802 ** 
SpeciesWarbler        -0.800911   0.117996  -6.788 1.14e-11 ***
mass.c:SpeciesWarbler  0.038046   0.009819   3.875 0.000107 ***

Zero-inflation model coefficients (binomial with logit link):
                      Estimate Std. Error z value Pr(>|z|)  
(Intercept)             0.8893     0.7431   1.197   0.2314  
mass.c                 -0.2154     0.1078  -1.998   0.0457 *
SpeciesWarbler         -8.6156     4.1849  -2.059   0.0395 *
mass.c:SpeciesWarbler  -0.3357     0.3483  -0.964   0.3352  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Number of iterations in BFGS optimization: 18 
Log-likelihood: -155.6 on 8 Df

Check for zero-inflated negbin

cuckoo_zinb <- zeroinfl(Beg ~ mass.c + Species + mass.c:Species| 
                   mass.c + Species + mass.c:Species,
                dist = "negbin",
                link = "logit",
                data = cuckoo)

summary(cuckoo_zinb)


Call:
zeroinfl(formula = Beg ~ mass.c + Species + mass.c:Species | mass.c + 
    Species + mass.c:Species, data = cuckoo, dist = "negbin", link = "logit")

Pearson residuals:
    Min      1Q  Median      3Q     Max 
-1.9866 -0.5772 -0.1885  0.6021  2.2198 

Count model coefficients (negbin with log link):
                      Estimate Std. Error z value Pr(>|z|)    
(Intercept)            3.38815    0.22489  15.066  < 2e-16 ***
mass.c                 0.02831    0.02051   1.381 0.167410    
SpeciesWarbler        -0.79777    0.23905  -3.337 0.000846 ***
mass.c:SpeciesWarbler  0.03786    0.02175   1.740 0.081793 .  
Log(theta)             2.27765    0.43523   5.233 1.67e-07 ***

Zero-inflation model coefficients (binomial with logit link):
                      Estimate Std. Error z value Pr(>|z|)  
(Intercept)             0.8893     0.7431   1.197   0.2314  
mass.c                 -0.2154     0.1078  -1.998   0.0457 *
SpeciesWarbler         -8.6653     4.2549  -2.037   0.0417 *
mass.c:SpeciesWarbler  -0.3374     0.3530  -0.956   0.3391  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Theta = 9.7537 
Number of iterations in BFGS optimization: 33 
Log-likelihood: -143.9 on 9 Df

Model validation

Model validation is not straightforward for zero-inflated models

Neither the performance package or DHARMa will produce outputs at this time.

We have to create manual checks of the residuals:

sresid <- residuals(cuckoo_zip, type = "pearson")

pred <- predict(cuckoo_zip)

sresidnb <- residuals(cuckoo_zinb, type = "pearson")

prednb <- predict(cuckoo_zinb)

Zero-inflated poisson

par(mfrow = c(2, 2), mar = c(4, 4, 2, 1))

hist(sresid)
plot(sresid ~ pred)
qqnorm(sresid)
qqline(sresid, col = "red")

Zero-inflated negative binomial

par(mfrow = c(2, 2), mar = c(4, 4, 2, 1))

hist(sresidnb)
plot(sresidnb ~ prednb)
qqnorm(sresidnb)
qqline(sresidnb, col = "red")

Model comparisons

How do we know which of is the best model of our data?

Do I really NEED zero-inflation? Check against negative binomial

Here I am comparing

AIC(cuckoo_glm1, 
    cuckoo_nb, 
    cuckoo_zip, 
    cuckoo_zinb)

            df      AIC
cuckoo_glm1  4 615.8282
cuckoo_nb    5 357.2955
cuckoo_zip   8 327.2635
cuckoo_zinb  9 305.8751

Choosing Zero-inflation models

Check for overdispersion
Is your model underfitting zeros?
AND do zeros make up a substantial part of your data?
Do you think there might be a different process controlling 0/1 from count?
You probably want to compare zip and zinb for best fit.

Challenge - Zeroinflation

20:00

Using the revegetation dataset
Check the model fit
Determine if zero-inflation is required
Validate the model

Decision making helper

What type of variable is the outcome?
Fit models: check residuals and AICs

physalia logo

Simulate

I think simulating data is a fantastic idea

You will get better insights into your models and various error distributions
You can check the stability of your models
You can run proper power analyses before you start your data collection

This is Part One of an excellent series of blog posts on simulating data

Wrap-up

Where next?

(Generalised) Linear Mixed Models

Generalised Additive Models

Additional resources

GLMs resource list:
Discovering Statistics - Andy Field
An Introduction to Generalized Linear Models - Dobson & Barnett
An Introduction to Statistical Learning with Applications in R - James, Witten, Hastie & Tibshirani
Mixed Effects Models and Extensions in Ecology with R - Zuur, et al.
Ecological Statistics with contemporary theory and application
The Big Book of R (https://www.bigbookofr.com/)

philip-leftwich

@ecoevo.social@PhilipLeftwich

PhilipLeftwich

philip.leftwich.github.io

Probability	Odds	Log Odds	Verbiage
\(p=.95\)	\(\frac{95}{5} = \frac{19}{1} = 19\)	2.94	19 to 1 odds for
\(p=.75\)	\(\frac{75}{25} = \frac{3}{1} = 3\)	1.09	3 to 1 odds for
\(p=.50\)	\(\frac{50}{50} = \frac{1}{1} = 1\)	0	1 to 1 odds
\(p=.25\)	\(\frac{25}{75} = \frac{1}{3} = 0.33\)	-1.11	3 to 1 odds against
\(p=.05\)	\(\frac{95}{5} = \frac{1}{19} = 0.0526\)	-2.94	19 to 1 odds against