25  Generalised Linear Models

25.1 Motivation

In the previous workshop we have seen that linear models are a powerful modelling tool. However, we have to satisfy the following assumptions:

  1. A linear relationship between predictors and the mean response value.
  2. Variances are equal across all predicted values of the response (homoscedatic)
  3. Errors are normally distributed.
  4. Samples collected at random.
  5. No omitted variables of importance

If assumptions 1-3 are violated we can often transform our response variable to try and fix this (Box-Cox & transformation).

However, in a lot of other cases this is either not possible (e.g binary output) or we want to explicitly model the underlying distribution (e.g count data). Instead, we can use Generalized Linear Models (GLMs) that let us change the error structure (assumption 3) to something other than a normal distribution.

25.2 Generalised Linear Models (GLMs)

Generalised Linear Models (GLMs) have:

  1. A linear predictor.

  2. An error/variance structure.

  3. A link function (like an ‘internal’ transformation).

The first (1) should be familiar, its everything that comes after the ~ in a linear model formula. Or as an equation \(\beta_0 + \beta_1\). The second (2) should also be familiar, variance measures the error structure of the model \(\epsilon\). An ordinary least squares model uses the normal distribution, but GLMs are able to use a wider range of distributions including poisson, binomial and Gamma. The third component (3) is less familiar, the link function is the equivalent of a transformation in an ordinary least squares model. However, rather than transforming the data, we transform the predictions made by the linear predictors. Common link functions are log and square root.

Note

Maximum Likelihood - Generalised Linear Models fit a regression line by finding the parameter values that best fit the model to the data. This is very similar to the way that ordinary least squares finds the line of best fit by reducing the sum of squared errors. In fact for data with normally distributed residuals, the particular form of maximum likelihood is least squares.

However the normal (gaussian) distribution will not be a good model for lots of other types of data, binary data, is a good example and one we will investigate in this workshop.

Maximum likelihood provides a more generalized approach to model fitting that includes, but is broader than, least squares.

An advantage of the least squares method we have been using is that we can generate precise equations for the fit of the line. In contrast the calculations for GLMs (which are beyond the scope of this course) are approximate, essentially multiple potential best fit lines are made and compared against each other.

You will see two main differences in a GLM output:

If the model is one where the mean and variance are calculated separately (e.g. for most normal distributions), uncertainty estimates use the t distribution; and when we compare complex to simplified models (using anova() or drop1()) we use the F-test.

However, when we provide distributions where the mean and variance are expected to change together (Poisson and Binomial), then we calculate uncertainty estimates using the z distribution, and compare models with the chi-square distribution.

The simple linear regression model we have used so far is a special cases of a GLM:

lm(y ~ x)

is equivalent to

glm(y ~ x, family=gaussian(link=identity))

Compared to lm(), the glm() function takes an additional argument called family, which specifies the error structure and link function.

The default link function for the normal (Gaussian) distribution is the identity, where no transformation is neededi.e. for mean \(\mu\) we have:

\[ \mu = \beta_0 + \beta_1 X \]

Defaults are usually good choices (shown in bold below):

Family Link
gaussian identity
binomial logit, probit or cloglog
poisson log, identity or sqrt
Gamma inverse, identity or log
inverse.gaussian 1/mu^2
quasibinomial logit
quasipoisson log

Fit a GLM

Using the fruitfly data introduced last week fit a linear model with lifespan as a response variable and sleep, type and thorax as explanatory variables.

Compare this to a glm fitted with a gaussian error distribution and identity link for the mean

r hide("Solution")

flyls <- lm(longevity ~ type + thorax + sleep, data = fruitfly)
summary(flyls)

Call:
lm(formula = longevity ~ type + thorax + sleep, data = fruitfly)

Residuals:
    Min      1Q  Median      3Q     Max 
-28.153  -6.836  -2.191   7.196  29.046 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)     -56.04502   11.17882  -5.013 1.87e-06 ***
typeInseminated   3.62796    2.77122   1.309    0.193    
typeVirgin      -13.24603    2.76198  -4.796 4.70e-06 ***
thorax          144.43008   13.11616  11.012  < 2e-16 ***
sleep            -0.05281    0.06383  -0.827    0.410    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 11.23 on 120 degrees of freedom
Multiple R-squared:  0.6046,    Adjusted R-squared:  0.5914 
F-statistic: 45.88 on 4 and 120 DF,  p-value: < 2.2e-16
flyglm <- glm(longevity ~ type + thorax + sleep, 
             family = gaussian(link = "identity"),
             data = fruitfly)
summary(flyglm)

Call:
glm(formula = longevity ~ type + thorax + sleep, family = gaussian(link = "identity"), 
    data = fruitfly)

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)     -56.04502   11.17882  -5.013 1.87e-06 ***
typeInseminated   3.62796    2.77122   1.309    0.193    
typeVirgin      -13.24603    2.76198  -4.796 4.70e-06 ***
thorax          144.43008   13.11616  11.012  < 2e-16 ***
sleep            -0.05281    0.06383  -0.827    0.410    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for gaussian family taken to be 126.0381)

    Null deviance: 38253  on 124  degrees of freedom
Residual deviance: 15125  on 120  degrees of freedom
AIC: 966.2

Number of Fisher Scoring iterations: 2

They are exactly the same. This is not surprising, as the maximum likelihood being fitted here is the same as an ordinary least squares model.

25.3 Workflow for fitting a GLM

Note

When you transform your data e.g. with a log or sqrt, this changes the mean and variance at the same time (everything gets squished down). This might be fine, but can lead to difficult model fits if you need to reduce unequal variance but this leads to a change (often curvature) in the way the predictors fit to the response variable.

GLMs model the mean and variability independently. So a link function produces a transformation between predictors and the mean, and the relationship between the mean and data points is modeled separately.