25 Power analysis

The statistical power of a hypothesis test is the probability of detecting an effect, if there is a true effect present to detect.

Power can be calculated and reported for a completed experiment to comment on the confidence one might have in the conclusions drawn from the results of the study. It can also be used as a tool to estimate the number of observations or sample size required in order to detect an effect in an experiment.

In this tutorial, you will discover the importance of the statistical power of a hypothesis test and now to calculate power analyses and power curves as part of experimental design.

After completing this tutorial, you will know:

  • Statistical power is the probability of a hypothesis test of finding an effect if there is an effect to be found.

  • A power analysis can be used to estimate the minimum sample size required for an experiment, given a desired significance level, effect size, and statistical power.

  • How to calculate and plot power analysis for the Student’s t test in R in order to effectively design an experiment.

25.1 Why do we need statistical power?

When interpreting statistical power, we seek experiential setups that have high statistical power.

  • Low Statistical Power: Large risk of committing Type II errors, e.g. a false negative.

  • High Statistical Power: Small risk of committing Type II errors.

It is common to design experiments with a statistical power of 80% or better, e.g. 0.80. This means a 20% probability of encountering a Type II error (if the alternative hypothesis is true).

25.2 What is statistical power?

Statistical power is one piece in a puzzle that has four related parts; they are:

  • Effect Size. The quantified magnitude of a result present in the population. Effect size is calculated using a specific statistical measure, such as Pearson’s correlation coefficient for the relationship between variables or Cohen’s d for the difference between groups.

  • Sample Size. The number of observations in the sample.

  • Significance. The significance level used in the statistical test, e.g. \(\alpha\). Often set to 5% or 0.05.

  • Statistical Power. The probability of accepting the alternative hypothesis if it is true \(1-\beta\).

25.3 Standardised effect size

25.3.1 Cohen's D

Cohens d is a standardized effect size for measuring the difference between two group means. So it is the effect size we use when carrying out t-tests.

Table 25.1: Cohen's D
Effect size d
very small 0.01
small 0.20
medium 0.50
large 0.80
very large 1.20
huge 2.00

For the unpaired t-test, see if you can use the following equations to manually calculate d.

We will be using the models from Chapter 14.

lm(height ~ type, data = darwin)

Unpaired t-test - the difference in means divided by the pooled standard deviation:

\[ \frac{mean~difference}{SD_{pooled}} = \frac{mean~difference}{\sqrt{(n_1-1)s_1^2+(n_2-1)s_2^2 \over n_1 + n_2-2}}\] Let's try and work this out by hand first, then we will investigate functions that can make this process faster.

Task
Use data wrangling to summarise the means, standard deviations and sample size of each cross.
summary_darwin <- darwin %>% 
  group_by(type) %>% 
  summarise(mean = mean(height),
            sd = sd(height),
            n = n())
summary_darwin
type mean sd n
Cross 20.19167 3.616945 15
Self 17.57500 2.051676 15
Task

Can you calculate the mean difference and the pooled standard deviation from this?

You need to turn standard deviation into variance, and follow the formula above
summary_darwin %>% 
    mutate(variance = sd^2) %>% 
    mutate(per_sample_var = variance * (n-1)) %>% 
  summarise(sd_pooled = sqrt(sum(per_sample_var)/sum(n-2)),
            mean_diff = diff(mean))
sd_pooled mean_diff
3.051376 -2.616667
Task
What do you calculate as Cohen's D for this analysis?
2.61/3.05
## [1] 0.8557377

25.3.2 Cohen's D from a linear model

Most of the time, we do not have to calculate these values by hand. Instead there are a range of functions which will allow us to do this:

library(effectsize)

# simple t-test
lsmodel1 <- lm(height ~ type, data = darwin)

t_to_d(2.437, df_error = 28, paired = F)

# paired t-test
lsmodel2 <- lm(height ~ type +factor(pair), data = darwin)

t_to_d(2.456, df_error=27, paired=T)
d CI CI_low CI_high
0.9210994 0.95 0.1349189 1.692417
d CI CI_low CI_high
0.4726574 0.95 0.0712072 0.8662385

25.4 Power

You don't necessarily need to report effect sizes in a write-up, it is fairly rare to see them included in a paper. It will be more useful for you to report confidence intervals on the original measurement scale. But it does provide two important pieces of information.

  1. All experiments/statistical analyses will become statistically significant if you make the sample size large enough. In this respect it shows how misleading a significant result can be. It is not that interesting if a result is statistically significant, but the effect size is tiny. Even without reporting d you can start thinking about whether confidence intervals indicate your result is interesting.

  2. Type 2 errors. Statistical tests provide you with the probability of making a Type 1 error (rejecting the null hypothesis incorrectly) in the form of P. But what about Type 2 errors? Keeping the null hypothesis, when we should be rejecting it? Or not finding an effect.

The probability of making a Type 2 error is known as \(1-\beta\), where \(\beta\) refers to your statistical 'power'. Working out statistical power is is very straightforward for simple tests, and then becomes rapidly more diffcult as the complexity of your analysis increases... but it is an important concept to understand.

There are two potential uses for Power analysis

  1. Working out what the statistical power of you analysis was post hoc to determine how likely it was you missed an effect ( but seeThis useful paper.

  2. Working out what sample size you need before an experiment to make sure you reach the desired power. Often a common \(\beta\) value is 0.8, in much the same way that a common \(\alpha\) is 0.05, it is an arbitrary target, but here means we can tolerate a risk of failing to reject the null when we should have in 20% of our experiments that do not produce a significant result.

Task

For n, d, sig.level and power - put values for three of the arguments and it will return the value of the fourth e.g. can you work out the power of our two sample t-test above?

pwr.t.test(n = NULL, d = NULL, sig.level = 0.05, power = NULL, type = c("two.sample"), alternative = c("two.sided"))

n = number of observations per sample(

d = Effect size

sig.level = agreed alpha

power = desired power of test (beta) )
library(pwr)
# Calculate the power of our model 
test.power <- pwr.t.test(n = 15, d = 0.92, sig.level = 0.05)

# Calculate the sample size we need to get power of 0.8 for a medium effect size (d = 0.5)

samp.size <- pwr.t.test(d = 0.5, sig.level = 0.05, power = 0.8)

From this we can see that even though our test did find a significant difference in heights between groups, the power of the test is not optimal.

We can also see that if we wanted to detect smaller effects of inbreeding reliably, we need a much larger sample size.

Now you know how the pwr.t.test() function works - Can you make a simple iteration to check the power for a lower effect size (d) of 0.2 for a two-sided t-test at sample sizes from 0-1000 increasing by 10 each time?

sample_size <- seq(0,1000, by=10)

output <- list(length(sample_size))


for (i in 1:length(sample_size)) { 
  
  sample <- pwr.t.test(n=sample_size[i], d=0.2, sig.level=0.05)
  output[[i]] <- sample
  
 #     if(i %% 1==0){    # The %% operator is the remainder, this handy if line prints a number every time it completes a loop
 #   print(i)
  #  }
}

sample_list <- as.list(sample_size)

names(output) <- sample_size

#  now you should be able to call any sample size and check statistical power!

# output$`30`