Plugging the Gap in the Logic of Classical Statistical Inference
McKay, I. C.,
University of Glasgow,
Department of Immunology and Bacteriology,
Glasgow, G11 6NT
Tel. 0141-211 2591
Fax: 0141-337 3217
McKay, P. K.,
Department of Philosophy,
King’s College London,
London, WC2R 2LS
The interpretation of statistical tests of a hypothesis depend on a modified form of the modus tollens, which sometimes gives seriously misleading conclusions. We show that the errors that can arise from this argument can be classified into two types, which we call the Hanover error and the Guildford error. We demonstrate that the validity of the argument can be adequately safeguarded by avoiding just these two errors, and there are no other ways in which the modified modus tollens can be invalid. Methods of preventing, recognising and eliminating these errors are discussed.
Karl Popper (1963) more than anyone else has promoted the use of the modus tollens as a model of the most crucial stage in the logical process by which scientific knowledge is derived from scientific data. The argument is sometimes expressed as follows.
Premise 1: If hypothesis H is true then experimental evidence E will not be observed.
Premise 2: Evidence E has been observed.
Conclusion: Hypothesis H is false.
Whether this is a useful model of scientific inference has often been debated, but in the above form the validity of the argument appears to be unassailable. Its logical strength, however, begins to crumble when the argument is applied to real-life science, because the conclusion of the modus tollens depends on its premises being known with total certainty, and total certainty in science is a rare commodity. When there is a slight doubt about premise 1 then the prevailing custom is to make do with a modified argument, which we shall call the fuzzy modus tollens and which may be written as follows.
Premise 1: If H is true then E is very improbable (e.g. P < 0.01).
Premise 2: Evidence E has been observed.
Conclusion: We have grounds to reject H.
This is almost identical to the argument proposed by Fisher (1925) as the logical basis of statistical hypothesis testing. The conclusion in the words of Fisher (1951) is sometimes stated quite strongly: “the hypothesis under consideration must be deemed to be contradicted by the data, and must be abandoned.” The logical gap referred to in the title of the present paper is the gap between these premises and the conclusion. These premises, by themselves, do not in fact provide sufficient rational grounds for rejecting H.
Many statistical authors, e.g. Siegel and Castellan (1988), wary of the logical weakness of the fuzzy modus tollens, express the outcome of the argument not as a conclusion about fact but as a prescription for action. Instead of saying, “We have grounds to reject H” they say, “We reject H” or “We decide to reject H.” This wording, in fact, does little to alleviate the logical weakness: its main effect is merely to make the logical weakness less conspicuous. If in our professional work we profess to be rational people making rational decisions, then in saying that we decide to reject H we imply that we have rational grounds for doing so. If the above form of the argument is not always stated, it is in fact what is implied when we use classical tests of a hypothesis.
Many scientific workers, by contrast, being less aware of the logical pitfalls on which they tread, adopt a form of the fuzzy modus tollens that is particularly pernicious. Its conclusion takes the form, “Hypothesis H is probably false.” In this form it is not sufficient to describe the argument as weak: it is better described as invalid, as numerous counterexamples, some of them rather tragic, can demonstrate.
The best counterexamples to the fuzzy modus tollens are circumstances in which the evidence E, rationally interpreted, implies that H is almost certainly true whilst the fuzzy modus tollens, on the basis of the same evidence, prescribes that H be rejected. By almost any rational standards it is very difficult to justify rejecting a hypothesis when the available evidence implies that it is probably true. Each counterexample, when interpreted by the classical fuzzy modus tollens, leads to a conclusion that is misleading, and its derivation is an error of logic.
In this paper we classify these errors of logic into two taxonomic groups, with the intention of making them more easily recognisable when they occur. We call these types of error the Hanover error and the Guildford error, named after criminal trials in which the errors led to miscarriages of justice. Later we demonstrate that these are the only serious avoidable errors that can arise from the logical weakness inherent in the fuzzy modus tollens, and we conclude that the utility and validity of the fuzzy modus tollens can adequately be protected if we can find ways to avoid both of these errors.
The Hanover Error
In 1990 a State Court in Hanover convicted a man of rape. The evidence was starkly simple. He lived in the city where several cases of rape had been committed by one person, and his DNA matched that of the rapist rather closely, as judged by studying restriction fragment polymorphism, which was popularly described at the time as genetic fingerprinting. There was no other admissible evidence that carried much weight.
An expert forensic witness told the court that if the man were innocent, then the probability of his DNA showing so close a match or closer was only 0.00024. The prosecution said this meant that the man was almost certainly guilty. The jury were convinced, and the prisoner was given a long prison sentence. This was a classic example of fuzzy modus tollens, in which the H is the prisoner’s innocence and E is the evidence of the DNA match.
A more rational interpretation of the same evidence may be made as follows. There are about 250,000 men who live in or frequently visit Hanover. Any one of them chosen at random would have a probability of 0.00024 of having DNA that matched that of the rapist as closely as or more closely than did the prisoner’s DNA. The number of men in the city with equally strong or stronger evidence against them is therefore about 250,000 ´ 0.00024 = 60. Even if we assume that one of them is guilty, the probability that the prisoner is the rapist cannot be more than about 1/60 = 0.017. In other words, the prisoner, in the light of the available evidence, is almost certainly innocent.
The defining feature of errors of the Hanover type is that the hypothesis being tested and rejected (namely that he is innocent) is, in the absence of the crucial evidence (in this case the DNA evidence) very probably true. Any given man picked upon merely because he lives in the city is almost certainly innocent.
Scientific examples of this type of error are likely to arise when a medical treatment is undergoing a clinical trial despite having no previous empirical or theoretical reasons for believing that it works, and no plausible mechanism known by which it might work. Examples could occur in experimental tests of extrasensory perception, astrology, crystal therapy, precognition and a variety of dubious sciences and therapies. In all these cases the null hypothesis, namely that the effect being tested does not exist, is very probably true, as far as we can judge from previous evidence experience and theory.
Another common scientific example of the Hanover error occurs in the interpretation of tests designed to help in the diagnosis of rather rare diseases. The diagnostic VDRL test is positive in 99.9 per cent of patients with secondary-stage syphilis, and is negative in 95 per cent of people without secondary stage syphilis. This gives the impression that it is a highly sensitive and specific test. But if it is applied as a screening test, we must consider that only about 1 in 10,000 people in the UK suffer from syphilis and so any attempt to use the modus tollens argument as follows will amount to a gross example of the Hanover error.
Premise 1: If he did not have syphilis the test would very probably be negative.
Premise 2: The test is positive.
Conclusion: We have grounds to believe that he has syphilis.
In fact, a positive VDRL test obtained in the course of screening a population is much more likely to be a false positive than to indicate a genuine case of syphilis, despite the quite impressive figures for the sensitivity and specificity of the test.
It is sometimes said that such errors are tolerable because they arise only with a frequency that we control by setting our statistical confidence level. In some circumstances this is true, but it is not always so. It does depend on how the evidence is obtained. If, for the sake of argument, the Hanover Prosecutor did not improve his logic and the Hanover forensic scientists did not improve the specificity of their technique, and the Hanover police built up a DNA database through which they could trawl every time there was a rape, then the Court, even if it set itself a rather stringent test criterion, such as P < 0.001, would falsely convict the accused much more frequently than this criterion implies: despite the notional limit on the frequency of type 1 errors, there could be a miscarriage of justice in almost every case brought before it.
The Hanover error, when it occurs in science, is fairly easy to recognise, and many reviewers and editors of scholarly journals will compensate for it by demanding that the data satisfy a much more stringent test criterion than usual. It is in criminal trials (Regina v. Deen, 1993; Regina v. Doheny/Adams 1996) and in diagnostic tests (see The Economist, 4th July 1992, page 96) that the error seems to occur repeatedly. It has also been called the prosecutor’s fallacy (Balding and Donnelly, 1994) but we use a different name for it here because the Guildford error is also a prosecutor’s fallacy.
In 1974 two bombs exploded in pubs in Guildford and Woolwich. The two were manufactured in the same way and were thought to have been planted by the same terrorist gang. A total of seven people were killed.
Gerard Conlon, Paul Hill, Patrick Armstrong and Carole Richardson were arrested on the basis of vague hearsay evidence, interrogated, charged and brought to trial in Guildford in 1975.
Gerard Conlon, the alleged ringleader, was supposed to have an alibi, but the alibi had wandered off and no one knew where to find him. There was only one substantial piece of evidence that survived scrutiny, namely that the four accused had all confessed to the crime and later retracted their confessions.
The fuzzy modus tollens argument applied to this case can be written in the following form.
Premise 1: If a person is innocent of a terrorist crime it is very improbable that he will confess.
Premise 1: The accused did confess.
Conclusion: We have grounds to reject the hypothesis of innocence.
This was enough to satisfy the jury, and at first sight it will be enough to satisfy many reasonable people. The flaw in the argument, however, immediately becomes apparent if we point out that premise 1 remains true when the word innocent is replaced by the word guilty. Indeed, statistical evidence from 20th century terrorist crimes in the UK shows that guilty people have never confessed to a terrorist crime before their conviction but innocent people have sometimes done so. The confessions might arguably be used therefore as evidence of innocence.
On 9th November 2000 at Chester Crown Court Sally Clark was convicted of murdering her two infant children. The defence said that they were tragic cases of death from unidentified natural causes, or cot deaths. With conflicting forensic evidence, the prosecution’s case was bolstered by an eminent paediatrician testifying that the chances of two cot deaths happening in one family were vanishingly small: 1 in 73 million. He got this figure by estimating the probability of one cot death in a comparable family chosen at random and squaring it.
There were several gross errors of logic in this testimony. First, the family had been selected for suspicion because of the deaths, so they were not a family chosen for investigation at random or chosen for a separate reason. Because of this, the probability should not have been squared. This error is rather like being over-surprised to find someone who has the same birthday as yourself, since by this sort of erroneous reckoning the probability of this happening would be 1/3652. Further, the squaring is based on the assumption that cot deaths occur independently of one another, and this would be a reasonable assumption only if there were no environmental or genetic causes of cot death that could have been shared by the two children. But in addition to these errors, this was an example of the Guildford error at work, for not only is it rare for there to be two cot deaths in one family: it is also rare for there to be two murders in one family. The statistical evidence weighs not only against the one hypothesis but also against the alternative hypothesis.
The Guildford error is sometimes not so easy to recognise because it arises in a variety of different forms.
In the 1980s a pharmaceutical company conducted a double-blind controlled clinical trial of a drug intended to treat migraine. About 30 per cent of the test group and only about 10 per cent of the placebo group reported that their migraine attacks had been alleviated. The difference between the two groups was large enough to satisfy the statistical criterion agreed in advance. The fuzzy modus tollens could have been written as follows.
Premise 1: If the drug is no better than the placebo, these experimental results would be very improbable.
Premise 2: These experimental results were in fact observed.
Conclusion: We reject that hypothesis that the drug is no better than the placebo.
But the Food and Drugs Administration were not convinced and did not grant a licence for the drug. They pointed out that even if the drug were better than the placebo, these experimental results would still be very surprising, in that it is most unusual for the placebo effect to be so small in a clinical trial of a migraine treatment. Usually at least 20 per cent of the placebo group report an improvement, and often as much as 62 per cent (Couch, 1987).
The Guildford error will occasionally occur when the designers of a clinical trial depend entirely on randomization to eliminate the possible confounding effects of biological variables such as age, sex, general health or medical history. Suppose that in a clinical trial of another migraine treatment the placebo effect had been in the usual range, and the difference between the test and placebo group had been significant as judged by an appropriate statistical test. Suppose also that the volunteers had been allocated to test and control groups on a strictly random basis. And suppose that just by the luck of the draw, the test group contained a disproportionate number of women, giving test and control sex ratios that also differed significantly as judged by an appropriate statistical test. Then the experimental evidence, considered as a whole and including the sex ratios, is very surprising whether the drug works or not. This undermines the validity of the fuzzy modus tollens argument. Suppose also that previous experiments had shown that women are more likely than men to say that a treatment has helped them. Then sex is clearly a potentially confounding variable, and might arguably provide the whole explanation for the difference between the test and control groups.
We do not intend to imply here that the outcome of the statistical test, in the form of a P value, is inaccurate. The risk of an uneven distribution of a potentially confounding variable is adequately reflected in the usual statistical calculations. The experimenters’ risk of making a type 1 error is still what they choose it to be (e.g. 0.01). What we are saying is that even when the statistical test delivers a strong prescription to reject the null hypothesis, it is sometimes irrational and dishonest to do so, and that this circumstance is more likely to arise when an experimental design includes randomization.
The fact that the Guildford error can arise from the process of randomization lends weight to the view of Urbach (1985) that randomization should not be used. However, this view is not shared by many classical statistical authors and some philosophers of science, e.g. Papineau (1994). There appear to be two contentious questions that are both crucial to the job of safeguarding the validity of scientific logic. One is the question of whether a failure to use randomization would violate the assumptions on which classical statistical tests are based, making the tests inapplicable and the conclusions invalid. The other is whether the use of randomization reduces the power of experimental tests of a hypothesis, leading to a needless risk of making a type 2 error. The job of resolving these questions is confounded by the fact that the conceptual experiment that classical statisticians had in mind when they developed the theory of hypothesis testing is quite different from the actual experiments that scientists usually do, particularly in respect of the use of random numbers.
If, for example, we wish to know whether taking a daily dose of aspirin reduces blood-clotting time, then the conceptual experiment visualised, for example, by Student depends on there being two notional populations, one taking aspirin and one not. The notional experiment consists of drawing a random sample from each notional population and measuring each subject’s blood-clotting time. In this scenario the only use of random numbers lies in ensuring that the samples are drawn randomly from the two populations. But the actual experiment that would be done in practice is very different. We would recruit a self-selecting set of volunteers who are not a random sample of any definable population and then, if we believe in randomization, we would divide them on a strictly random basis into a test and placebo group. If we don’t believe in randomization we would divide them into two groups matched by age and sex and any other vital statistic that we think might have an influence on clotting time.
The implicit assumption that prevails in clinical trials is that the difference between the conceptual and actual experiments does not invalidate the logic of the test. The notional populations of the theoretician comprise all those people that we could have recruited but didn’t. And if the volunteers are divided into test and placebo groups on a strictly random basis then the resulting groups, while not strictly random samples from two populations, have the same statistical properties as they would have if they were in fact random samples. Therefore the classical statistical tests are applicable.
One consequence, however, that arises from the non-random sample of volunteers is that when we have analysed our experimental results and drawn a conclusion, we cannot say exactly what population this conclusion is true of. If the conclusion were only true of our sample then it would be of little interest to clinicians, for example, whose patients are not part of our sample. The value of the conclusion depends on its being true of a large population, and so it is somewhat unfortunate that we cannot say exactly what population it is true of.
But suppose we follow Urbach’s advice and use random numbers at no stage whatever in the experimental procedure. Suppose we divide the volunteers into male and female sets and arrange each in order of age. We can then consciously construct test and placebo groups that have almost identical distributions of sex and age. Are our P values still valid, and are our conclusions all equally safe? Does the decision to use matched groups alter our risks of making type 1 and type 2 errors?
There are several distinct dangers inherent in this procedure, related to the fact that randomization serves several distinct purposes.
One benefit of randomization is that it simplifies the statistical theory used in analysing the experimental results, and thereby helps to ensure that the statistical theory is valid. This is true whether classical or Bayesian methods are used. Fisher (1925) showed how randomization allows experimental data to be analysed with only very weak assumptions, and Rubin (1978) showed how a Bayesian analysis is considerably simplified if randomization is used. Simplicity and weak assumptions obviously help to justify confidence in our statistical theories. It would be reassuring to find a formal proof that the statistical theories are equally valid for certain non-random experimental designs, which depend on more complex theories with stronger assumptions, but such a proof appears to be lacking, as Yates (1964) seems to imply.
Another benefit of randomization is that it sometimes allows us to draw the conclusion that one factor is a cause of another, rather than the weaker conclusion that the two factors are statistically associated. For example, if we carried out our experiment on aspirin by finding pre-existing populations that took and did not take aspirin daily, we could in principle draw strictly random samples from these populations and analyse their blood-clotting times by a t-test. This would conform admirably with the hypothetical experimental design that is often used in explaining the theory underlying the t-test, but it would not allow us to draw the conclusion that aspirin causes a delay in clotting time. A significant result might arise because there is in the population some causal factor that makes people take aspirin and also, by some independent mechanism, makes their blood slow to clot. Our conclusion would be limited to saying that slow clotting and the consumption of aspirin are statistically associated. In allocating subjects to test and placebo groups, possibly the most important function of randomization is to insulate the drug treatment from the influence of unknown causal factors that might also influence clotting, and thereby permit us to draw causal conclusions.
A third major benefit of randomization lies in its ability to protect the validity of our conclusions not only from the influence of confounding factors that we know about, but also from confounding factors that we don’t know about.
A fourth benefit of randomization is that it insulates our experiment from conscious and subconscious bias on the part of the experimenter. For example, a compassionate clinician may tend consciously or unconsciously to include in the test group those patients who seem most in need of the therapy.
In summary, randomization may expose us to the risk of a type of Guildford error arising. We can guard against this by controlling the potential confounding factors that we know about rather than randomizing them. At the same time we can still gain all the benefits of randomization by randomizing the subjects for all the other factors that we don’t know about. For example, we could first stratify the subjects according to age and sex and then randomly assign the members of each stratum to test and control groups.
Whether we do this or not, it may still be found retrospectively that our test and control groups, purely as a result of random sampling error, were worryingly dissimilar in respect of some potential confounding factor that we did not think about at the planning stage. In a sense, this is only a problem if we notice it. If we don’t notice it then at least we know that the risk of this happening is adequately reflected in the usual type of statistical calculations that we use in hypothesis testing. Our risk of making a type 1 error is still what we have chosen it to be. But if we do happen to notice that the groups are substantially dissimilar then this information poses a problem for us, because it is relevant to the scientific conclusions that we draw. Rationality demands that we use the new information, and honesty demands that we make it known to our readers. The best remedy in such an event is probably to build the potentially confounding factor into the statistical model that we use in our analysis, thereby compensating for its effect.
Complementarity of the Hanover and Guildford Errors
In this section we show that when the fuzzy modus tollens leads to seriously misleading conclusions, it does so only through the Hanover and Guildford errors. There are no other ways in which it can go badly wrong, giving an irrational conclusion. In other words, we can adequately safeguard the validity of the classical argument by avoiding just these two errors.
Let H be the hypothesis that is being tested by experimental evidence E, and let us use P(H|E) to denote the probability of H in the light of evidence E, and P(H) to denote the probability of H before E is known. Then the defining characteristic of the Hanover error is that P(¬H) is small, and the defining characteristic of the Guildford error is that P(E|¬H) is small. Later we shall make this derivation quantitative, and examine what exactly we mean by small.
If we avoid both the Hanover and Guildford errors, then P(¬H) and P(E|¬H) are both large and so their product P(¬H) ´ P(E|¬H) is also large. It follows that P(¬H) ´ P(E|¬H) + P(H) ´ P(E|H) is also large. Another way of writing this is to say that P(E) is large. The implications of this can be seen by writing Bayes’ theorem as follows:
It can be seen by inspection of the above equation that when P(E) is large, then a small P value, i.e. a small value of P(E|H), necessarily implies that P(H|E) is also small. In other words, the premises of the fuzzy modus tollens give us sufficient grounds not only to reject H but also to say that H is probably false.
Let us now try to be more precise in our use of the terms small and large. If we consider all the grades of severity of the Hanover error, represented by different values of P(¬H) between 0 and 1, and the grades of severity of the Guildford error, represented by different values of P(E|¬H) between 0 and 1, we can construct a two-dimensional diagram using these two variables as Cartesian coordinates (see Fig 1.)
In this diagram there is an area labelled “Argument is strong” in which the fuzzy modus tollens argument is entirely valid, in the sense that P(H|E) is always smaller than or equal to P(E|H). There is another area in which a small P value provides some evidence against the null hypothesis, in the sense that the evidence implies P(H) < 0.5. And there is an area in which the fuzzy modus tollens is seriously misleading, since it recommends rejection of H despite the fact that the evidence actually weighs in favour of H. The formal derivation of these boundaries is given in the Appendix. This version of the diagram is based on a confidence level given by a = 0.05.
Fig. 1. The validity of the fuzzy modus tollens argument vanishes towards the left of the diagram (Hanover error) and towards the bottom of the diagram (Guildford error). The two types of error merge into each other at the bottom left, where they reinforce each other. Below and to the left of the hyperbola shown as a broken line, the argument is grossly misleading because the hypotheses that are rejected are probably true. To the right and above the solid hyperbola the argument is entirely valid: a hypothesis rejected at confidence level a (which is 0.05 in this diagram) will have a probability even smaller than a. Between the solid and broken lines the argument is usable but weak: it leads to rejection of H only when H is less probable than ¬H.
If, during the scrutiny of experimental findings, it is found that some features of the Hanover or Guildford errors are present, then an explicit Bayesian analysis may be undertaken. But often the information necessary to do this is not all available. It is not necessary, however, to suspend all judgment. The boundary of irrationality in Fig.1 is defined by equation 1 of the Appendix, which shows that the boundary can be shifted by adjusting the confidence criterion a. The amount of adjustment is usually chosen subjectively, but equation 1 provides a basis for doing it more systematically. It can be seen that when a risk of the Hanover error is present, i.e. the value of P(¬H) is rather small for comfort, then the problem can be remedied by adjusting a in proportion to . For example, if a criterion of a = 0.05 is enough to satisfy a journal editor when the hypothesis a priori has probability 0.5, then in order to reject a hypothesis that is thought a priori to have probability 0.99 then a rational editor should look for a much more stringent criterion, namely.
A similar, slightly simpler adjustment can be made if there is a risk of the Guildford error. If is worryingly small then the remedy is to adjust a exactly in proportion to it. For example, if a criterion of a = 0.05 is satisfactory when is thought to be about 0.5, then a criterion of a = 0.005 should be adequate when is thought to be about 0.05.
1. In designing an experiment that is intended to demonstrate the effect of some treatment upon some observed variable, make a list of potentially confounding variables. Where possible, control these in preference to randomizing them. Then use randomization to eliminate bias caused by other confounding variables that you may not have foreseen.
2. Before rejecting a null hypothesis on the basis of a classical statistical test, examine the evidence to see if there are signs of the Guildford error or the Hanover error. When looking for the Guildford error, examine all aspects of the experimental results, not only the main measure of experimental outcome. Has the control group given results within the expected range? Are there any potentially confounding variables such as age or sex or general health that appear to be unevenly distributed between test and control groups? These are all part of the evidence whose likelihood P(E|¬H) should be judged in the light of the assumption that the null hypothesis is false. If P(¬H) or P(E|¬H) seem by subjective assessment to be less than about 0.2, then consider carrying out an explicit Bayesian analysis, using subjective estimates of probability where necessary, or make adjustments to the confidence criterion as described in the previous section.
Balding, D.J. & Donnelly, P., Criminal Law Review, October 1994
Couch J.R. Jr. (1987). Placebo effect and clinical trials in migraine therapy. Neuroepidemiology, 6 (4): 178-85. [“a review of several studies has shown 62% of 188 subjects improving by 75% after 4 weeks on placebo with a continued effect after this.”]
Fisher, R. A. (1951). Statistics, in Scientific Thought in the Twentieth Century, ed. A. E. Heath, London: Watts. Page 36.
Fisher (1925). Statistical Methods for Research Workers. Oliver and Boyd.
Popper, K. R. (1963). Conjectures and Refutations: the Growth of Scientific Knowledge. London : Routledge & Paul.
Siegel, S. and Castellan, N. J. Jr. (1988). Nonparametric Statistics for the Behavioural Sciences, 2nd Edition, page 8. McGraw-Hill.
Yates (1964). Sir Ronald Fisher and the Design of Experiments, Biometrics 20: 307-321.
Appendix 1: Derivation of the formulae defining the lines in Fig. 1.
Let us use x and y to denote the horizontal and vertical coordinates of Fig. 1; i.e. and . Let the chosen confidence criterion be Let us define the area of irrationality as that area of the diagram in which H is rejected even when its probability is greater than ½. Then the values of x and y on the boundary of irrationality will satisfy the boundary definition that . If we begin with Bayes’ theorem and substitute various values we get
which simplifies to give
This is the equation of the hyperbola shown as a broken line in Fig. 1. It also provides a fairly precise basis for adjusting a to compensate for the Hanover and Guildford errors.
The area of weakness is defined as the area in which H is rejected when its probability in the light of the evidence lies between a and ½. The line defining its upper boundary is the locus of points whose coordinates satisfy . Bayes’ theorem then gives us
which simplifies to give
This is the equation of the hyperbola shown as a solid line in Fig. 1.