Experimental design and irreproducibility in pre-clinical research

Physiology News Magazine

Download this issue

Full issue

Experimental design and irreproducibility in pre-clinical research

News and Views

Experimental design and irreproducibility in pre-clinical research

News and Views

Michael FW Festing, Medical Research Council (MRC) Toxicology Unit, University of Leicester, UK

https://doi.org/10.36866/pn.118.14

Too many pre-clinical experiments, mostly involving mice and rats, are producing results which cannot be repeated. This is probably because the scientists are not using statistically valid experimental designs. As a result, effects that are due to the research environment may be mistaken for treatment effects, leading to bias and irreproducible results. However, two designs, the completely randomised and the randomised block designs (likely familiar to physiologists working with farm animals) avoid such bias and have been used successfully for nearly a century in agricultural and industrial research and in clinical trials. In these two designs, subjects receiving the different treatments are randomly intermingled in the research environment, thereby avoiding environmental bias. Scientists engaged in pre-clinical research should be using these experimental designs.

Most scientists will now know that excessive numbers of pre-clinical experiments produce results that are irreproducible.¹ This leads to a waste of scientific resources² and excessive numbers of animals being subjected to unjustified pain and distress.

This note explains how scientists can design their pre-clinical experiments so as to maximise reproducibility by using the above designs. Some other rarer, named and statistically valid, designs are discussed in textbooks on experimental design but they are not discussed here (for more information see: Cox, 1958; Snedecor and Cochran, 1980).

The origin of randomised, controlled experiments

Randomised, controlled experiments were invented by RA Fisher when he was appointed as the statistician at Rothamsted Agricultural Experimental Station in the UK in the 1920s. His aim was to be able to detect small, but important, differences in the yield of different varieties of crops or following different fertiliser regimes.⁵ He noted that there are two sources of such variation, which need to be controlled if unbiased and reproducible treatment effects are to be detected. First, there is inter-individual variability, controllable by choosing uniform subjects. In pre-clinical research this usually presents few problems because large numbers of high-quality, uniform animals, such as inbred and and pathogen-free strains of rats and mice, are readily available.

Second, there is the variability caused by the research environment. In an animal house, such variation could be associated with cage location within a room or position in a cage rack, variation in lighting levels, cage cleaning and the introduction of new bedding material. In shorter-term experiments circadian and other rhythms may add additional variation. Noise, including ultrasound not heard by the staff, may affect the animals, and the skill of those handling the animals and measuring the outcomes of an experiment may also vary over time. All these factors can cause extra inter-individual variation and need to be taken into account when planning powerful and unbiased experiments.

Figure 1: Three experimental designs, each having three treatments (shades) and a sample size of four. A. The “Completely randomised” (CR) design. B. The “Randomised block” (RB) design. Bars delineate the blocks. C. The “Randomised to treatment”. This is not a valid design

Two designs to use, one to avoid

Fisher developed two designs that provide some control of these sources of variation.

In the completely randomised (CR) design, the “experimental units” (either a single animal, or two animals in a cage counted as a single experimental unit) are numbered 1 to N (as in Fig. 1A). Then one of the treatments, chosen at random, is assigned to each subject (different shades in Fig. 1A). This should be done in the office before the experiment is due to begin. Usually, equal sample sizes are used, although this is not essential.

Such randomisation is easily done using spreadsheet software such as Microsoft Excel. For example, if there are three treatments Lo, Hi, and Ctrl and a sample size of six, column A should have six “Lo”, six “Hi”, and six “Ctrl” entered into it. Column B should then have 18 random numbers generated by typing =rand() into cell B1. This can be replicated by pulling down on the bottom-right box. It will generate 18 random numbers. Columns A and B should then be selected and sorted on column B. The line number will then be the individual ID, and the assigned treatments Lo, Hi and Ctrl, in random order, will be shown in Column A.

The resulting list of subjects, each with one of the treatments assigned to it, is then taken down to the animal house and the subjects are given the appropriate treatment.

The result is a single set of subjects (experimental units), each receiving one of the treatments, determined at random. The subjects receiving the different treatments are randomly intermingled within the research environment, as shown in Fig. 1A. This is the design used in clinical trials because it can accept both the accumulation of patients over a period of time and unequal sample sizes. In pre-clinical studies it will normally be analysed using a one-way analysis of variance (ANOVA).

A randomised block (RB) design is shown in Fig.1B, again assuming that the experimental unit is a cage (housing either one individual, or two individuals with their results averaged for the cage). In this design, the experiment is split up into N independent groups or “blocks”. So it is a “mini-experiment” with a sample size of one. For example, if there are three treatments, each block will consist of three cages, each receiving a different treatment, assigned at random.

The whole experiment will be made up of N blocks. So if the sample size is, say, six, there will be six blocks each consisting of three cages,each receiving a different treatment, or a total of 18 cages.

Treatments need to be assigned at random to each subject within each block. This can easily be done when setting up the individual blocks by writing the treatments on cards, shuffling them and displaying the order.

The individual blocks can be set up over any time period, to suit the investigator. For example, one block could be set up per day for N days. They don’t need to be equally spaced. Spreading the work over a period of time by using the RB design could be useful if measuring the outcome is time-consuming or needs special apparatus.

The results from all the blocks are combined in the statistical analysis, which is a two-way ANOVA without interaction. The treatment means are averages of each treatment across all the blocks. The statistical analysis will indicate whether there are statistically significant treatment effects after removing the variation due to differences between the blocks. Such a two-way ANOVA should be readily available in all statistical packages.

The RB design cannot be used in clinical trials because it requires the ready availability of matched individuals. The one exception might be if identical twins were readily available.

Then each twin would be assigned a different treatment and the pair of twins would represent a “block”. This RB design, with just two treatments, is sometimes known as a “matched pairs” design.

The RB is the most widely used design in agricultural and industrial research because it provides better control of the environmental variation, so it is more powerful than the CR design.6 One estimate, based on five RB experiments, was that a comparable CR experiment would need about 40% more animals to have the same power as an RB design.⁷

Another advantage of the RB design is that it allows the work to be spread over any time period (hours, days, weeks or months), to suit the investigator.⁸ In this way, repeatability can also be built into it.

The RB design has already been used several thousand times, apparently without difficulty, in studying animal models of development. No litter of mice or rats is large enough to make up a whole experiment. So, each litter is regarded as a “block”, with pups within the litter being the experimental units. Each pup receives one of the treatments. The results from the N blocks (litters) are combined in the statistical analysis, which is a two-way ANOVA without interaction.⁹

A third, but statistically invalid, design which might be called “randomisation to treatment group” is shown in Fig. 1C. Scientists can often buy a group of animals that are all virtually identical. So they may see little point in randomising them.

However, as already noted, the research environment is not uniform. For example, if the scientist becomes more skilful as he or she measures the outcomes, this could lead to differences between groups that are only a reflection of this change in skill. So, in this design, variation in the research environment is confounded or mixed with any treatment effect, possibly leading to bias and false conclusions.

Combatting irreproducibility

The “completely randomised” and the “Randomised block” are the only experimental designs suitable for widespread use in pre-clinical research. They both have “intermingled randomisation” in which subjects receiving different treatments are housed in randomised order within the research environment. This avoids bias, in which effects of the environment are mistaken for treatment effects. Further details on how to set up and use these designs are given elsewhere.⁸

The “Randomisation to treatment group” design (Fig.1C) is widely used in pre-clinical research. But it is not statistically valid because it doesn’t randomise the order in which the experiment is done. As a result, it is susceptible to bias because environmental effects can be mistaken for treatment effects, as already explained. This can lead to false positive, irreproducible results. It may even be the main cause of irreproducibility in pre-clinical research.

The CR and RB designs have been used successfully for more than 70 years in agricultural and industrial experiments, and in clinical trials, without excessive levels of irreproducibility. They are the only statistically valid experimental designs suitable for widespread use in pre-clinical research. Scientists, the funding organisations, ethical review committees and journal editors should take note, and act accordingly.

References

Begley CG, Ellis LM (2012). Drug development: Raise standards for preclinical cancer research. Nature 483, 531 – 533. DOI: 0.1038/483531a
Freedman LP et al. (2015). The Economics of Reproducibility in Preclinical Research. PLoS Biology 13, e1002165. DOI: 10.1371/journal.pbio.1002165
Cox DR (1958). Planning experiments. New York: John Wiley and Sons.
Snedecor GW, Cochran WG (1980). Statistical methods Ames (Iowa): Iowa State University Press.
Fisher RA (1960). The design of experiments. New York: Hafner Publishing Company, Inc.
Montgomery DC (1984). Design and Analysis of Experiments. New York: John Wiley & Sons, Inc.
Festing MFW (1992). The scope for improving the design of laboratory animal experiments. Laboratory Animals 26, 256 – 267. DOI: 10.1258/002367792780745788
Festing MFW et al. (2016). The design of animal experiments, 2nd. ed. SAGE.
Festing MFW (2006). Design and statistical methods in studies using animal models of development. ILAR Journal 47, 5 – 14. DOI: 10.1093/ilar.47.1.5

Become a member

Check out our Training Hub

What is physiology?

Meet our members

Physiology News Magazine

Experimental design and irreproducibility in pre-clinical research

Experimental design and irreproducibility in pre-clinical research

Site search

Filter

Content Type

Become a member

Check out our Training Hub

What is physiology?

Meet our members

Neurophysiological Bases of Human Movement 2025

Physiology News Magazine

Site search

Filter

Content Type