The Philosophy of
Statistical Power Analysis
Ian C. McKay, July 1999
In Praise of Power
Analysis
The arguments for using some kind of power analysis are based on very practical considerations and sometimes ethical considerations too. It is clearly not desirable to invest a lot of time, effort and expense on a scientific study that has no reasonable prospect of yielding any conclusions.
A double-blind clinical trial of a polio vaccine comes to mind. The outcome was measured by comparing the incidence of polio among the vaccinated and control groups. None of the vaccinated volunteers caught polio in the course of the study, but neither did any of the control group. No conclusion could be drawn about the efficacy of the vaccine, and it became evident that a lot of volunteers had been needlessly inconvenienced and possibly put at some risk of side-effects. Particularly damning was the fact that an inconclusive outcome could easily have been predicted from a knowledge of the current incidence of polio, and so the costs and risks could have been avoided.
Another mistake that can be prevented by power analysis is the wasteful collection of more experimental data than are needed. If you have good prospects of being able convincingly to demonstrate the effectiveness of a drug using 100 volunteers, then it is arguably wasteful and unethical to use 200.
The above arguments are clear enough and will probably convince most people. But there are other aspects of power analysis that are much more debatable.
Snatching numbers out
of the air
At the start of a power analysis we have to make three rather arbitrary decisions. The first is the easiest and most familiar: we have to choose a confidence criterion for rejecting the null hypothesis. In algebraic terms, we choose a value of a, which is often, by custom and practice, set at 0.05. a is the probability of rejecting a null hypothesis, given that it is true. a is the proportion of type 1 errors that we deem to be tolerable. Much has been said about this, but I mention it here because it and the other arbitrary decisions are interdependent and ought rationally to be considered together.
Secondly, we usually decide what statistical power is required. In other words, how certain do we need to be that our study will yield a statistically significant conclusion, given that there is a true effect waiting to be discovered? In algebraic terms, we choose a value of b, which may for example be 0.2. b is the probability of making a type II error, i.e. failing to reject a particular, specified null hypothesis when it is false. b is the frequency of type II errors that we judge to be tolerable. The power of the test is (1 - b). Why choose b = 0.2? Well, why not? The rationale is often not much better than this. But we know from experience that setting b equal to 0.01, for example, will lead us to collect a very large set of data, and this will be expensive. We also know that if we set b equal to 0.7 then our study will probably yield no firm conclusion at all.
The third and most difficult decision is to specify a rather specific alternative hypothesis. We need to make some kind of estimate or guess about the size of the effect that we are trying to demonstrate in the population. There is something faintly illogical about doing this. If we knew the size of the effect then we would not need to do the experiment anyway.
If the data are to be analysed by a two-sample t-test then the putative size of the effect can be measured by a hypothetical difference, d, between two means, measured relative to the standard deviation of the measured variable. Sometimes there may be previous studies that indicate the size of the effect in some similar but not identical population, or in some other species. This might provide some rational basis for decision number three. Some workers consider d to be small, medium or large when its value is about 5, 15 or 25 per cent of the population mean.
Usually the outcome of the power calculation is an estimate of the number (n) of observations that will be needed to achieve the required power. But sometimes the calculation is done the other way round. Sometimes the number of observations is chosen by the researchers or is imposed upon them by the scarcity of experimental subjects. Then the power is not something chosen by the researchers: it is the outcome of the calculation. The researchers then decide whether this power is sufficient to justify the experimental study.
In most experimental studies that I have taken part in, the authors can choose either n or b and calculate the other. But it seems to me that the most rational approach, if it were feasible, would be to choose n, b, a and d all at the same time, attempting to balance the conflicting wishes for small n, small a, small b and small d, while allowing for the fact that these four parameters are inter-related. The feasibility of doing this in a formal, objective way is discussed later.
An argument for 50 per
cent power
There are times when the size of the effect (d) inserted into the calculation is dictated not by previous evidence but by the fact that a very small effect, even if statistically significant, will not be of much interest. For example, suppose a drug has a very small beneficial effect on a particular clinical condition. Then even if this effect is very convincingly demonstrated, the information will not have any influence on therapeutic practices and may not be of any value to the scientific or medical world.
The results of a study may therefore be negative for two possible reasons. Either the observed effect may be too small to be of interest, or it may be too small to be statistically significant. There are obviously some merits in a policy of ensuring that the sample size is chosen so that the test passes or fails both of these criteria at the same time. It would be wasteful (and tantalising) to finish up with an observed effect that is large enough to be important but not large enough to be statistically significant, for then the whole experimental effort has yielded nothing of much value. But is would also be wasteful to finish up with an observed effect that is highly significant in the statistical sense but too small to be important, for then we shall realise what could easily have been predicted, namely that a smaller study would have been sufficient.
In circumstances where the previous two paragraphs adequately summarise the main considerations, then the usual kind of power analysis, aimed at 70 or 80 or 90 per cent power, is not in my opinion rationally justified. The calculation should be based on 50 per cent power, not just because this makes the calculations particularly simple, but because this will ensure that the two criteria for a positive result will both be satisfied or both unsatisfied, and so wasted effort is minimised.
To be more specific, if the smallest effect that would be worth demonstrating, measured in standard deviations, is d and the critical t value is about 2, then the number of observations needed per group to give 50 per cent power will be about 8/d2, though a few more observations may prudently be added to allow for the occasional experimental subject who will withdraw from the study, or the occasional blood sample that may be lost.
In precisely what circumstances then is 50 per cent power the rational choice? The commonest occasion when it would be an irrational choice is when there are reasons to expect that the size of the effect in the population is substantially larger than the minimal size that would be of scientific or medical interest, for then the use of a large d and a mid-range b in the calculation will lead us to a design of experiment that will probably show an effect that is large enough to be important but perhaps not large enough to be statistically significant. There are clear grounds for using 50 per cent power only when we have no information about the size of the effect, and our value of d is a measure of the minimal size that would be of interest. Even then it may sometimes be prudent to conduct a smaller, pilot study first, and conduct a larger one later only if the smaller one fails to reach statistical significance. The smaller one may not be wasted: it may be possible to test the null hypothesis in the light of both studies considered together, using one of the methods of meta-analysis.
Towards a rational
balance of competing interests
We want to minimise type I errors, type II errors, the detectable size of effects and the total cost of an experiment. But reducing any one of these four variables can only be achieved at the expense of increasing at least one of the others.
The optimal balance of these competing aims depends on how much we value each of them. Can this be decided rationally? And if it can be decided rationally, can we then devise a calculation, based on our decisions, that will tell us what size of experiment will be optimal?
Inevitably, our decisions about how costly or damaging a type I error is, or a type II error, or how valuable it is to know that the effect of moderate drinking on the incidence of cardiovascular disease is less than 5 per cent, as distinct from knowing that it is less than 10 per cent, are all rather arbitrary decisions. But perhaps, in various ways, they can be made less arbitrary than the three arbitrary decisions described above.
For example, rather than deciding the absolute cost or harm arising from type I and type II errors, which might be outrageously arbitrary, it may suffice to decide on the relative cost of one compared with the other. And likewise the relative value of narrowing a range of uncertainty to a greater or a lesser extent may reasonably be modelled with the help of information theory, even when the absolute value is very difficult to assess objectively. Some simplifying assumptions and some approximations will no doubt be needed, but let us have a go and see how far we get.
Suppose that a fraction f of our null hypotheses are false. Then the expected value of an experiment to test a null hypothesis is
f ´ (1 - b) ´ value of a rejecting a false Ho
- (1 - f) ´ a ´ cost of rejecting a true Ho (type I error)
+ (1 - f) ´ (1 - a) ´ value of not rejecting a true Ho
- f ´ b ´ cost of failing to reject a false Ho (type II error)
- n ´ experimental expenses per observation.
If we denote these five values or costs more compactly by v1, v2, v3 and v4 then the expected value of an experiment may be written
- nv5.
But usually (1 - a) and (1 - b) are only capable of being increased by a few per cent
by adjusting a
and b.
As we adjust a
and b
the main influence on the expected value lies in the negative terms containing v2 and v4. So the only way to increase the expected value much
is to decrease the value of . One message that arises from this formula is that in
seeking an optimal balance between type I and type II errors we need to
consider not only how damaging or costly are type I and type II errors, but
also what proportion of our null hypotheses are false.
Cohen (1977) in Statistical Power Analysis for the Behavioural Sciences, Revised Edition, page 5, says “with a = 0.05 . . . and b = 0.20, the relative seriousness of Type I to Type II error is b/a = 0.20/0.05 = 4 to 1; thus mistaken rejection of the null hypothesis is considered four times as serious as mistaken acceptance.”
But Cohen’s interpretation neglects two possibly important effects. It neglects the fact that we may be exposed to the risk of Type I and Type II errors with unequal frequency, i.e. f may not be equal to 0.5. And it also ignores the fact that we could achieve quite a large decrease in b at the expense of a much smaller increase in a.
To what extent can we reduce a at the expense of b, or vice versa, while keeping n and d constant?
If n and d are
constant then = const.
Differentiating this equation with respect to b, allowing a to vary gives
, where B = invcdf(b) and A = invcdf(a/2).
Solving for the differential coefficient gives
If our values of a and b are, for the sake of argument, 0.05 and 0.2, which are
popular choices, then this formula gives . This means, for example, that in order to reduce b
marginally from 0.2 to 0.19 we should only have to increase a from
0.05 to 0.05087. But would this be an advantageous change?
One of our goals is to minimise the value of the cost (C) of errors plus the cost of the experiment given by
+ nv5.
Let us see how far we can do this by adjusting a and b
while keeping n and d
constant. Differentiation with respect to b gives us
and for fairly typical values of a = 0.05 and b =
0.20 this gives us .
When we have minimised the cost of errors by adjusting a and b with
constant d
and n, this differential coefficient will be zero, and so .
If we make the rather rash assumption that about half of our null hypotheses are false then we can see that the choice of a = 0.05 and b = 0.20 does not imply that Type I errors are merely 4 times as harmful as Type II errors: it implies that they are about 11 or 12 times as harmful.
Fortunately, Cohen’s neglect of this asymmetrical
relationship between a and b is largely mitigated by his neglect of the influence of
f. In the case where a proportion f = 0.258 of our null hypotheses are
false, which is reasonably compatible with my experience, then the popular assumption of a =
0.05 and b
= 0.20 reduces the above formula to , which does imply that Type I errors are 4 times as harmful
as Type II errors. Perhaps Cohen’s intuitive judgement, born of long experience,
led him to the right answer despite weak rationale.
A four-fold ratio of damage does seem an acceptable assumption. If a false conclusion is reached by making a type 1I error, then it is quite likely to take about four independent studies before the scientific community agrees that the original conclusion was wrong. On the other hand, if an experimenter fails to reach a conclusion by reason of a type II error, then it will carry rather little weight and will probably not even be published, and so only one further study may be needed to correct the error.
Let us turn now to the trade-off between n and a. If we keep b and d constant then the only way to reduce a is to increase n. Let us go back to the cost equation to see what simultaneous adjustment of n and a will give a minimum cost, including the cost of type I errors.
+ nv5.
Partial differentiation with respect to a while keeping b and d constant gives
.
The value of to put into this formula needs to be obtained by
differentiating the equation
with respect to a, which gives us
, where A = invcdf(a/2).
Substituting this into the previous equation gives
. Setting this equal to zero to find a necessary
condition for minimum C, we get
, and solving this for A gives us
, and from this value of A, a unique value of a can be
calculated. This provides one more non-differential equation that must be
satisfied in order to ensure a rational choice of the parameters a, b, d and
n. If we consider the choice of a = 0.05 to be imposed by custom and practice, then we
can substitute a
= 0.05, which makes A = -1.96, and the equation can then be rewritten in the
form
.
Let us turn now to the trade-off between n and d, i.e. between the size of the experimental groups and the size of the effect that can be demonstrated convincingly. How do we decide rationally whether the extra expense of increasing n will be worth the extra information that we may gain?
Suppose that before we conduct our experiment the size of
the effect we are looking for (e.g. the difference between mean male and female
scores in a diagnostic test) is very uncertain, having an approximately normal
probability distribution function with standard deviation E. If in the population the standard deviation of test scores is S
then after conducting tests on n
males and n females, where n is large
enough to justify a normal approximation, our mean difference will now have an
uncertainty, or standard error, of only . The amount I of
information gained, measured using a formula borrowed from information theory,
will then be
The first term is usually rather small and does not change
with n. The second term is the main
component and increases more and more slowly as n increases. Thus a law of diminishing returns applies. The
marginal extra information gained per unit increase in n is equal to , which becomes smaller and smaller as extra data are added.
On this scale of information, used by information theorists, the maximum
information that can be expressed in one binary digit is 0.69, and the maximum
amount that can be conveyed in two binary digits is 1.39, and so on. If we have
conducted an experiment using n
subjects per group, then expanding the group sizes to 4n per group will add the
equivalent of one binary digit to our information. Receiving one binary digit
is like starting with a hypothesis that is equally likely to be true or false,
and getting enough information to tell you clearly which it is.
Another use for Power
Analysis
One of the weaknesses of classical statistical methods is that in some circumstances they appear to condone an irrational rejection of Ho. There are two circumstances in which this happens. One occurs when the probability of Ho a priori is close to 1, when there is the danger of what I call the Hanover error (named after a miscarriage of justice that occurred in the city of Hanover). The other is when the experimental evidence is highly improbable not only in the light of the null hypothesis but also in the light of the alternative hypothesis. This creates the danger of what I call the Guildford error, (named after a miscarriage of justice in the case of the Guildford Four in England).
Both these errors can in principle be avoided by applying Bayes’ Theorem, but the most common difficulty in doing so lies in the need for an estimate of the probability a priori of the null hypothesis being rejected, an estimate that is often very difficult to obtain. This difficulty, however, does not arise if we use Bayes’ theorem merely to recalculate the relative probabilities of two specific hypotheses. Suppose Ho is our null hypothesis that the difference between two means is zero, and suppose HA is an alternative hypothesis that the difference between the two means is d. If E represents the evidence, namely that our test criterion has been satisfied by analysis of the experimental data, then with the help of Bayes’ theorem we can write a formula in which the troublesome variable P(E) does not occur, i.e.
To put this in words, the probability of the null hypothesis
relative to the probability of the alternative hypothesis should be updated in
the light of the new evidence E simply by multiplying by . This provides a very simple but logically rigorous basis
for making a scientific inference from experimental evidence, in a way that
says something about the relative probabilities of hypotheses and yet guards
against the two errors of logic described at the start of this section. To
apply this kind of Bayesian logic, power analysis is essential: we need it in
order to calculate the value of b.