The hidden costs of bad statistics in clinical research

Author: Georgi Z. Georgiev, Published: Aug 29, 2018

What if I told you that irrelevant statistics are routinely used to estimate risks of tested treatments and pharmaceutical formulas in many clinical trials? That we fail time and time again to correctly identify good treatments or harmful effects of drugs due to a single practice that most researchers apply without realizing its inadequacy? What if I added that this poor practice continues unquestioned as it is enshrined in countless research papers, textbooks and courses on statistical methods, and to an extent perpetuated and encouraged in regulatory guidelines?

clinical trials

Here I will lay out the issue in as simple terms as possible, but I will provide references to more detailed and technical explanations as I go along.

How we do clinical trials

When a new drug or medical intervention is proposed, it has to be tested before being recommended for general use. We try to establish both its efficacy and any potential harmful effect by subjecting it to a rigorous experiment.

Usually, patients with the condition to be threated are randomly assigned to a control group and one or more treatment groups. The control group receives standard care while the treatment group receives the new treatment or standard care plus the new treatment, depending on the case at hand.

Such an experiment allows us to statistically model the effects of unknown factors and isolate a causal link between the tested treatment and patient outcomes.

Since any scientific measurement is prone to errors, a very important quality of clinical trials is that they allow us to estimate error probabilities for what we measure. For example, they allow us to say that “had the treatment had no true positive effect, we would rarely see such an extreme improvement in recovery rate after treatment X”.

Researchers and regulatory bodies agree on a certain level of acceptable risk before conducting the trial, ideally trying to balance between the risk of falsely accepting a treatment that has little to no beneficial effects and falsely rejecting a beneficial treatment simply because the trial didn’t have the sensitivity to demonstrate the effect. There is an inherent trade-off, since requiring lower risk for false acceptance leads to higher risk of false rejection or, alternatively, to longer trial times (longer time to market / general use) and experimenting on more patients which has both ethical and economic disadvantages.

While the process is good overall, it has some issues and the one I will focus on here is the widespread use of two-sided statistical tests instead of the correct one-sided ones.

Failure to match research claims with risk estimates

As mentioned, before any given trial, the researchers and relevant regulatory bodies decide on a threshold of acceptable risk of falsely declaring a treatment as efficient, for example, “we would not want to approve this treatment unless the measurable risk of it being ineffective compared to current standard care is 5% or less”. So far, so good.

However, what happens in most clinical trials is that the measurement error is reported not based on the risk threshold as defined above but based on the risk of “the treatment effect being exactly zero”. So, the researcher might claim “treatment improves outcomes with error probability equal to 1%” but in fact what the 1% probability they report is for the claim “treatment either improves or harms outcomes”, not for “treatment improves outcomes”. In most cases the error probability that should be reported is half of the reported, or in this case 0.5% instead of 1% (2 times less measurable risk!).

Researchers fail to use the appropriate statistical test, therefore the statistical hypothesis does not match their research hypothesis. In statistical terms, researchers report two-sided p-values and two-sided confidence intervals, instead of one-sided p-values and one-sided confidence intervals, the latter of which would actually correspond to their claims.

This confusion is not limited to medicine and clinical trials, but is present in many behavioral sciences like psychology, psychiatry, economics, business risk management and possibly many others. However, I’ll keep to examples from medical trials for the sake of brevity.

The profound effects of this simple error

You might be thinking: what is the big deal? After all, we are exposed to less risk, not more, so where is the harm? However, the cost is very real, and it is expressed in several ways.

Firstly, we see beneficial treatments being rejected since the apparent risk does not meet the requirement. For example, the observed risk, using an irrelevant (two-sided) estimate is 6%, with a 5% requirement. However, using the correct (one-sided) risk assessment we can see that the actual risk is 3%, which passes the regulatory requirement for demonstrating effectiveness.

Many similar examples can be found in scientific research, including a big Phase III breast cancer trial (8381 patients recruited) which demonstrated a probable effect of up to 45% reduction of the hazard ratio, however the treatment was declared ineffective at least partly due to the application of an irrelevant risk estimate. Had the correct risk estimate been applied, it would have made the treatment accepted as standard practice if the side-effects (of which there was a noted increase) were deemed acceptable.

I’ve discussed this and several other examples, including from other research areas in my article “Examples of improper use of two-sided hypotheses”.

Secondly, we have underappreciation of risk for harmful side-effects. Like measurements of beneficial effects, measurements of harm are also prone to error, and a drug or intervention will not be declared harmful unless the risk of such an error is deemed low enough. After all, we do not want to incorrectly reject a beneficial treatment due to what can be attributed to expected measurement errors.

However, if we use an incorrect error estimate we will fail to take note of harmful effects that meet the regulatory risk standard, and which should have stopped the drug or intervention from being approved. Using a two-sided statistic, we might believe that the observed harm is merely a measurement artefact while the proper one-sided statistic will show us that it exceeds the acceptable risk threshold and should be considered seriously.

Finally, reporting irrelevant risk estimates robs us from the ability to correctly appraise risk when making decisions about therapeutic interventions. Not only are researchers and regulatory bodies led to wrong conclusions, but your physician and you are being provided with inflated risk estimates which may preclude you from making an informed choice about the treatment route which is most suitable for your condition and risk tolerance.

The last point is especially painful for me, since I’m a firm proponent of making personal calculations for risk versus potential harm, in medicine and beyond. No two people are the same, no two personal situations are the same and where one sees unacceptable risk another sees a good chance to improve their situation. Being provided with doubled error probabilities can have a profound effect on any such calculation.

How is this possible and why it happens?

This is a question which I found fascinating, since I’m an autodidact in statistics and even for me it was apparent that when you make a claim of a positive or a negative effect, the relevant statistic should be one-sided. I was especially stumped after discovering that the fathers of modern statistics: R.Fisher, J.Neyman and E.Pearson all embraced and used one-sided tests to a significant extent.

So, somewhere along the road a part of the research community and, apparently, some statisticians, became convinced that one-sided tests somehow amount to cheating, to making unwarranted predictions and assumptions, to being less reliable or less accurate, and so on. As result two-sided tests are recommended for most if not all situations, contrary to sound logic.

I have reasons to believe this is partly due to the apparent paradox of one-sided vs. two-sided tests which is a hard one to wrap your head around, indeed. Another possible issue is one I traced to the graphical presentation of statistical tables from the early 20^th century and which is manifested in a different form in statistical software of nowadays. Finally, mistakes in teaching statistics which lead to mistaking the “null hypothesis” with the “nil hypothesis” or the interpretation of p-values as probability statements related to research hypothesis is surely taking its toll as well.

These reasons are too complex to cover in depth here, but I have done so in Reasons for misunderstanding and misapplication of one-sided tests, if you fancy a deeper dive in the matter.

Whatever the reason, it is a fact that currently one-sided tests are incorrectly portrayed in books, textbooks and university courses on statistical methods. The bad press follows them in Wikipedia and multiple blogs and other online statistical resources. Given the large-scale negative portrayal of one-sided tests, some of which I have documented here, it is no wonder that researchers do not use them. In fact, I am pretty sure some have not even heard of the possibility of constructing a one-sided confidence interval.

Another reason are the unclear regulatory guidelines, some of which (e.g. the U.S. Food and Drugs Administration and the European Medicines Agency) are either not explicit in their requirements or they specifically include language suggesting one-sided statistics are controversial. Some guidelines recommend justification for their use which is not something requested for two-sided statistics.

This naturally leads most researchers to take what appears as a safe road and so they end up reporting two-sided risk estimates, perhaps sometimes against their own judgement and understanding. Peer pressure and seeing two-sided p-values and confidence intervals in most published research in their field of study probably takes care of any remaining doubt about the practice.

How to improve this situation

My personal attempt to combat this costly error is to educate researchers and statisticians by starting Onesided.org. It is a simple site with articles where I explain one-sided tests of significance and confidence intervals as best as I can, correcting misconceptions, explaining paradoxes, and so on. It also contains some simple simulations and references to literature on the topic as I am by no means the first one to tackle the problem.

My major proposal is to adopt a standard for scientific reporting in which p-values are always accompanied by the null hypothesis under which they were computed. This will both help ensure that the hypothesis will correspond to the claim and will also deal with several other issues of misinterpretation of error probabilities.

Of course, it would be great if regulatory bodies could improve their guidelines. My brief proposal for which can be found here. However, this is usually a slow and involved process and it mostly reflects on what is happening in practice already.

In conclusion

I think the important point is making use of error probabilities to measure risk where possible and of using the right risk measurement for the task. Failure to do so while under the delusion that we are in fact doing things correctly costs us lives, health, and wealth, as briefly demonstrated above. Whether it is a government-sponsored study or a privately-sponsored one, I know that in the end the money is being deducted from the wealth we acquire with blood, sweat and tears and I see no reason so as not to get the best value for it that we can.

Furthermore, this poor statistical practice denies us the ability to correctly apply our own judgement to data, thus hindering our personal decision-making and that of any expert we may choose to recruit.

I’m optimistic that bringing light to the issue will have a positive effect on educating researchers and statisticians about it. I have no doubt most of them will be quick to improve their practice, had it been an error for one reason or another.

Enjoyed this article? Please, consider sharing it where it will be appreciated!

Cite this article:

If you'd like to cite this online article you can use the following citation:
Georgiev G.Z., "The hidden costs of bad statistics in clinical research", [online] Available at: https://www.onesided.org/articles/the-hidden-cost-of-bad-statistics-in-clinical-research.php URL [Accessed Date: 26 Apr, 2025].

About the author

Georgi Z. Georgiev is an applied statistician with background in web analytics and online controlled experiments, building statistical software and writing articles and papers on statistical inference. Author of the book "Statistical Methods in Online A/B Testing".

Connect on:

Articles

One-sided statistical tests are just as accurate as two-sided tests The paradox of one-sided vs. two-sided tests of significance Directional claims require directional statistical hypotheses A p-value is meaningless without a specified null hypothesis When is a one-sided hypothesis required? 12 myths about one-tailed vs. two-tailed tests of significance Examples of improper use of two-sided hypotheses Fisher, Neyman & Pearson - advocates for one-sided tests and confidence intervals Proponents of one-sided statistical tests Examples of negative portrayals of one-sided significance tests Is the widespread usage of two-sided tests a result of a usability issue? Reasons for misunderstanding and misapplication of one-sided tests Refining statistical guidelines and requirements for one-sided tests The hidden costs of bad statistics in clinical research