
Physiology News Magazine
Trust me – I’m a scientist
Trust is fundamental in science. More often than not, we design our experiments at least in part based on data published by others. We trust them to be true, unless proven otherwise.
Features
Trust me – I’m a scientist
Trust is fundamental in science. More often than not, we design our experiments at least in part based on data published by others. We trust them to be true, unless proven otherwise.
Features
Martin Michel
Dept. of Pharmacology, Johannes Gutenberg University, Mainz, Germany
https://doi.org/10.36866/pn.101.33
The public tends to trust scientists, at least more than many other professions. That’s why an advertisement more often states, ‘as shown by scientists’ than ‘as shown by used car dealers/politicians/lawyers’ (pick whoever you dislike most). However, this basic trust received a major blow when scientists from Bayer reported that they were unable to reproduce two thirds of published major findings (Prinz et al., 2011). Thereafter, others reported similar proportions of irreproducible studies in various fields of biomedical science. This is not only a problem of obscure journals with weak referee systems; it affects ‘big’ journals such as Nature, Science or Cell at least as much. This has not only shocked the academic community and pharmaceutical industry. Funders such as the Wellcome Trust or the NIH (National Institute of Health) became concerned, the latter particularly because the US Congress may reduce their funding. Even the general public has noticed, as evidenced by a cover story in The Economist in October 2013.

In God we trust –
all others must bring data
Those who have looked into root causes of lack of reproducibility, agree that fraud is only a very minor part of the problem. Rather, poor standards in experimental design, data analysis and transparent reporting what actually has been done appear to be at the core of the problem. In my own experience, I find that the vast majority of published studies lack too many important details to permit the repeat of a key experiment. Examples include the strain of animals, the identity of antibodies or the incubation volume in a biochemical assay; this list goes on and could fill a book. Some high-profile journals may in part be to blame, as they have asked authors to produce only short methods sections, printed them in smaller font, or relegated them to online supplements. All of this has sent a message to young scientists that details of methods might not be so important. Boy, were they wrong! However, this probably is easy to address by more comprehensive and transparent reporting of experimental methods and data analysis approaches, and several journals have meanwhile adapted their Instructions to Authors to account for this.
The smell of a t-shirt can affect outcome of an experiment
We expect that key findings can be confirmed not only when each detail of a reported experiment has been adhered to (reproducibility in the specific sense of the word) but also if apparently minor details have been altered (often referred to as robustness, although this is a vaguely defined term). This has two reasons. First, a finding that can only be obtained under extremely standardized conditions is less likely to be relevant to the overall progress of science. Second, on purely pragmatic grounds, we simply cannot monitor every detail of an experiment. Thus, did you ever consider that the smell of your t-shirt might affect the outcome of your experiment? It apparently can (Sorge et al., 2014). There are probably many such unknown factors. Moreover, it was shown that investigators expecting groups to be different are likely to find such differences – even when they do not exist. That is the reason why it is important that interventions are randomised and, as far as possible, investigators be blinded to group allocation in experimental studies. Clinical medicine has realised this decades ago and made the double-blind, randomised study the gold standard of investigation. Shouldn’t we be embarrassed that clinicians have developed more sophisticated approaches to research than scientists? (Yes!!)
Don’t be a P-hacker
Other than lack of robustness, there may be an even bigger problem leading to lack of reproducibility and that relates to inappropriate use and interpretation of statistical tests. Many have the misconception that the asterisk on top of a data point, by indicating statistical significance, also implies that a finding is true and even relevant. Rather, a P-value tells us the probability that a certain group difference would have been observed by chance if the samples had been selected randomly from the same group. Thus, a P-value is only meaningful if all factors other than the primary variable we are investigating are the same in all groups or at least randomly distributed, be it animal strain, presence of disease or treatment with a drug. Several investigator-induced violations of this randomness principle have been summarized under the term P-hacking (Motulsky, 2014). Thus, if you change number of experiments, parameters to be analysed or method of analysis after having seen initial results, you deviate from the path of randomness. For example, you may have done an experiment six times, analysed the data and obtained a P-value of 0.06. You feel uncomfortable with this, as you can hardly submit your manuscript this way. Thus, you add two more experiments in the hope that with a total of eight you will reach the magic significance threshold. Other examples include a post-hoc decision for normalization of the data or the choice of a different statistical test. All of this introduces a major bias for finding a difference, even if it is not there, and for exaggerated effect sizes. The asterisk you have gained may look like a trophy, but it actually increases the risk that the observed difference is not robust. Thus, any type of modification of sample size or analysis techniques that was decided upon after the experiments had started precludes meaningful statistical analysis, unless specific precautions have been taken. A key conclusion from the above is that P-hacking may make results look nice but actually makes them less meaningful up to being invalid. Generally, one should focus less on P-values and more on effect sizes.
Most reported results must be wrong
But even if everything has been done ‘by the book’, findings may have a poor robustness, as already predicted (Ioannidis, 2005) years before the Bayer paper (Prinz et al., 2011) had been published. John Ioannidis emphasized that finding a group difference to be statistically significant may not necessarily have a large positive predictive value – even if no P-hacking occurred. David Colquhoun expanded this idea and highlighted the problem of the ‘false discovery rate’ (Colquhoun, 2014). Thus, a P-value is the probability of seeing a difference as large as you observed, or larger, even if the two samples came from populations with the same mean. However and in contrast to a common perception, it does not tell us the probability that an observed finding is true. Thus, simulations show that a P-value < 0.05 in correctly designed and executed experiments may nonetheless be associated with a false discovery rate of up to a quarter (Colquhoun, 2014). The actual false discovery rate with a P-value < 0.05 depends on several factors, but a poor positive predictive value/high false discovery rate may particularly occur when sample sizes (number of experiments) or effect sizes (magnitude of difference between groups) are small (Ioannidis, 2005). In reaction to this, journals have started to require that minimum sample sizes are required to allow for statistical analysis (Curtis et al., 2015).
50 million Elvis fans can’t be wrong
Confronted with the above, I heard more than once ‘we have always done it this way’ or ‘everyone else is doing this’. I have confessed that in 10 out of my last 10 original papers my own work in one or more ways did not live up to the standards I meanwhile have recognized to be appropriate (Michel, 2014). Therefore, I can understand the emotional reaction to a very fundamental critique of the way research has largely be done in the past, but after having cooled down a bit I trust that you acknowledge that this argument is entirely non-scientific. Fifty million can’t be wrong was a title of a greatest hits album from Elvis released in November 1959, when Elvis was still considered ‘controversial’. While it made for a well-selling album, it is as much of an argument as ‘eat shit – 50 trillion flies can’t be wrong’. We do have a major trust crisis at hand, and issues with study design, comprehensive and transparent reporting, and proper data analysis are likely to be major root causes. Those who have carefully looked into the matter all agree on this. As scientists, we should look at this evidence and act accordingly to modify our own behaviour.

The pharmacology journal Editors’ initiative
Realising that improving our practise of study design, data analysis and transparent reporting will take nothing less than a culture revolution, editors of several major journals in pharmacology have come together to develop shared editorial policies. This includes Biochemical Pharmacology, British Journal of Pharmacology, the Journal of Pharmacology and Experimental Therapeutics, Naunyn-Schmiedeberg’s Archives of Pharmacology and Pharmacology Research and Perspectives. As an initial step, they have developed shared criteria for transparent reporting which will become part of their Instructions to Authors. Major journal publishers including Elsevier, Springer-MacMillan and Wiley have endorsed the initiative. We are currently in the process of inviting all major pharmacology journals to also join this initiative and will be happy to cooperate with the Physiology community to do implement the same within their discipline.
Acknowledgment
Thanks to Dr Harvey Motulsky for helpful comments on this manuscript.
References
Anonymous (2013). Trouble at the lab. The Economist 409: 23-27
Colquhoun D (2014). An investigation of the false discovery rate and the misinterpretation of p-values. R Soc open sci 1, 140216
Curtis MJ, Bond RA, et al. (2015). Experimental design and analysis and their reporting: new guidance for publication in BJP. Br J Pharmacol 172, 3461-71
Ioannidis JPA (2005). Why most published research findings are false. PLoS Med 2, e124
Michel MC (2014). How significant are your data? The need for a culture shift. Naunyn Schmiedeberg’s Arch Pharmacol 387, 1015-1016
Motulsky HJ (2014). Common misconceptions about data analysis and statistics. Naunyn Schmiedeberg’s Arch Pharmacol 387, 1017-1023.
Prinz F, Schlange T & Asadullah K (2011). Believe it or not: how much can we rely on published data on potential drug targets? Nat Rev Drug Discov 10, 712-13
Sorge RE, Martin LJ, et al. (2014). Olfactory exposure to males, including men, causes stress and related analgesia in rodents. Nat. Methods 11, 629-32