The reproducibility of results in both clinical and preclinical science is lower than we had supposed. This is serious because every false positive result that is published provides ammunition for those who oppose science. There are many reasons for irreproducibility, not least the perverse incentives imposed on scientists to publish too much. But one of the less-appreciated reasons lies in the misinterpretation of tests of statistical significance. Tests of significance give you a p value. If you ask what a p value means, the most common answer you’ll get is that it is “the probability that your results occurred by chance” (1). This is plain wrong (2). I take it that what we want to know is the probability that one will claim that there is an effect when in fact the observations arose by chance only. This quantity may appropriately be called the false positive risk (FPR) (3-5) and it’s is not the same thing as the p value. Here’s why. It would obviously be a mistake to confuse the probability that you have four legs given that you are a cow, with the probability that you are a cow given that you have four legs. The latter is obviously much smaller than the former because there are many non-bovine creatures that have four legs. The former is large because there aren’t many three-legged cows. Confusion of FPR with the p value is to make an exactly analogous mistake. The p value is the probability that you make your observations (or more extreme ones) given that the null hypothesis is true. The FPR is the probability that the null hypothesis is true given our observations. These two quantities are quite different, and, under almost all circumstances, the FPR is bigger than the p value. Suppose that you observe p = 0.047. That implies, for a well-powered experiment, a FPR of at least 20% (there are various ways of calculating this number but most of them give a value of between 20% and 30% (5). Your chance of making a fool of yourself is not 5%, as most people still seem to think. It’s at least 26%. That means that the term “statistically significant” tells you remarkably little about the truth of a hypothesis. I propose that journals should require not only a p value and confidence interval, but also one more number that gives a realistic idea about what really matters, the FPR (3-5). The simplest way to do this would be to give the FPR that is calculated on the assumption that there was a 50:50 chance of there being a real effect before the experiment was done, which may be dubbed the minimum FPR. This is the easiest solution to understand, because it tells you much the same thing as people mistakenly think the p value tells you. It would be overoptimistic for implausible hypotheses, but it would be a great improvement on the present convention. The term “statistically significant” should never be used. Just give the numbers.
Physiology 2019 (Aberdeen, UK) (2019) Proc Physiol Soc 43, C031
Oral Communications: A proposal concerning what to do about p values
D. Colquhoun1
1. NPP, UCL, Kings Langley, United Kingdom.
View other abstracts by:
Where applicable, experiments conform with Society ethical requirements.