Call:
lm(formula = longevity ~ type + thorax + sleep, data = fruitfly)
Residuals:
Min 1Q Median 3Q Max
-28.153 -6.836 -2.191 7.196 29.046
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -56.04502 11.17882 -5.013 1.87e-06 ***
typeInseminated 3.62796 2.77122 1.309 0.193
typeVirgin -13.24603 2.76198 -4.796 4.70e-06 ***
thorax 144.43008 13.11616 11.012 < 2e-16 ***
sleep -0.05281 0.06383 -0.827 0.410
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 11.23 on 120 degrees of freedom
Multiple R-squared: 0.6046, Adjusted R-squared: 0.5914
F-statistic: 45.88 on 4 and 120 DF, p-value: < 2.2e-16
25 Generalised Linear Models
25.1 Motivation
In the previous workshop we have seen that linear models are a powerful modelling tool. However, we have to satisfy the following assumptions:
- A linear relationship between predictors and the mean response value.
- Variances are equal across all predicted values of the response (homoscedatic)
- Errors are normally distributed.
- Samples collected at random.
- No omitted variables of importance
If assumptions 1-3 are violated we can often transform our response variable to try and fix this (Box-Cox & transformation).
However, in a lot of other cases this is either not possible (e.g binary output) or we want to explicitly model the underlying distribution (e.g count data). Instead, we can use Generalized Linear Models (GLMs) that let us change the error structure (assumption 3) to something other than a normal distribution.
25.2 Generalised Linear Models (GLMs)
Generalised Linear Models (GLMs) have:
A linear predictor.
An error/variance structure.
A link function (like an ‘internal’ transformation).
The first (1) should be familiar, its everything that comes after the ~ in a linear model formula. Or as an equation \(\beta_0 + \beta_1\). The second (2) should also be familiar, variance measures the error structure of the model \(\epsilon\). An ordinary least squares model uses the normal distribution, but GLMs are able to use a wider range of distributions including poisson, binomial and Gamma. The third component (3) is less familiar, the link function is the equivalent of a transformation in an ordinary least squares model. However, rather than transforming the data, we transform the predictions made by the linear predictors. Common link functions are log and square root.
Maximum Likelihood - Generalised Linear Models fit a regression line by finding the parameter values that best fit the model to the data. This is very similar to the way that ordinary least squares finds the line of best fit by reducing the sum of squared errors. In fact for data with normally distributed residuals, the particular form of maximum likelihood is least squares.
However the normal (gaussian) distribution will not be a good model for lots of other types of data, binary data, is a good example and one we will investigate in this workshop.
Maximum likelihood provides a more generalized approach to model fitting that includes, but is broader than, least squares.
An advantage of the least squares method we have been using is that we can generate precise equations for the fit of the line. In contrast the calculations for GLMs (which are beyond the scope of this course) are approximate, essentially multiple potential best fit lines are made and compared against each other.
You will see two main differences in a GLM output:
If the model is one where the mean and variance are calculated separately (e.g. for most normal distributions), uncertainty estimates use the t distribution; and when we compare complex to simplified models (using anova() or drop1()) we use the F-test.
However, when we provide distributions where the mean and variance are expected to change together (Poisson and Binomial), then we calculate uncertainty estimates using the z distribution, and compare models with the chi-square distribution.
The simple linear regression model we have used so far is a special cases of a GLM:
lm(y ~ x)
is equivalent to
glm(y ~ x, family=gaussian(link=identity))
Compared to lm(), the glm() function takes an additional argument called family, which specifies the error structure and link function.
The default link function for the normal (Gaussian) distribution is the identity, where no transformation is neededi.e. for mean \(\mu\) we have:
\[ \mu = \beta_0 + \beta_1 X \]
Defaults are usually good choices (shown in bold below):
| Family | Link |
|---|---|
gaussian |
identity |
binomial |
logit, probit or cloglog
|
poisson |
log, identity or sqrt
|
Gamma |
inverse, identity or log
|
inverse.gaussian |
1/mu^2 |
quasibinomial |
logit |
quasipoisson |
log |
Fit a GLM
Using the fruitfly data introduced last week fit a linear model with lifespan as a response variable and sleep, type and thorax as explanatory variables.
Compare this to a glm fitted with a gaussian error distribution and identity link for the mean
r hide("Solution")
Call:
glm(formula = longevity ~ type + thorax + sleep, family = gaussian(link = "identity"),
data = fruitfly)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -56.04502 11.17882 -5.013 1.87e-06 ***
typeInseminated 3.62796 2.77122 1.309 0.193
typeVirgin -13.24603 2.76198 -4.796 4.70e-06 ***
thorax 144.43008 13.11616 11.012 < 2e-16 ***
sleep -0.05281 0.06383 -0.827 0.410
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for gaussian family taken to be 126.0381)
Null deviance: 38253 on 124 degrees of freedom
Residual deviance: 15125 on 120 degrees of freedom
AIC: 966.2
Number of Fisher Scoring iterations: 2
They are exactly the same. This is not surprising, as the maximum likelihood being fitted here is the same as an ordinary least squares model.