## Publication

## The hidden costs of a poor statistical practice in clinical research

*Author Georgi Z. Georgiev, Bulgaria *

*Currently Georgi Z. Georgiev runs single-handedly the project OneSided. *

*Georgi Z. Georgiev is an applied statistician with background in web analytics and online controlled experiments, building statistical software and writing articles and papers on statistical inference. He has experience in providing statistical tools for the online analytics and A/B testing (online controlled experiments) community and has also authored industry white papers and dozens of in-depth articles on topics concerning statistical analysis and design of experiments.*

As a person dealing with statistics in applied research, I was quite shocked to discover recently that irrelevant statistics are routinely used to estimate risks of tested treatments and pharmaceutical formulas in many clinical trials. As result of a single bad practice applied by most clinical researchers we fail time and time again to correctly identify good treatments or harmful effects of drugs. Even more astonishing was that this poor practice continues mostly unquestioned as it is enshrined in countless research papers, textbooks and courses on statistical methods, and to an extent perpetuated and encouraged in regulatory guidelines.

Here I will share my findings in as simple terms as possible, but I will provide references to more detailed and technical explanations as I go along.

## Quantifying risk through clinical trials

When a new drug or medical intervention is proposed, it has to be tested before being recommended for general use. We try to establish both its efficacy and any potential harmful effect by subjecting it to a rigorous experiment that allows us to statistically model the effects of unknown factors and isolate a causal link between the tested treatment and patient outcomes.

Since any scientific measurement is prone to errors, a very important quality of clinical trials is that they allow us to estimate error probabilities for what we measure. For example, they allow us to say that “had the treatment had no true positive effect, we would rarely see such an extreme improvement in recovery rate after treatment X”.

Before conducting any trial, researchers and regulatory bodies agree on a certain level of acceptable risk by trying to balance between the risk of falsely accepting a treatment that has little to no beneficial effects and falsely rejecting a beneficial treatment simply because the trial didn’t have the sensitivity to demonstrate the effect. As you note, there is a trade-off, since requiring lower risk for false acceptance leads to higher risk of false rejection or, alternatively, to longer trial times (longer time to market / general use) and experimenting on more patients which has both ethical and economic disadvantages.

While the process is good overall, it has some issues and the one I will focus on here is:

## Failure to map research claims to risk estimates

An example for a threshold for acceptable risk would be: “we would not want to approve this treatment unless the measurable risk of it being ineffective compared to current standard care is 5% or less”. This is the significance threshold to which an observed p-value will then be compared, or the confidence level for a confidence interval that estimates the plausible range of effects that cannot be ruled out by the trial.

That is all good, but what happens in most clinical trials is that the measurement error is reported not based on the risk threshold as defined above but based on the risk of “the treatment effect being exactly zero”, leading to inaccurate risk assessment relative to the research claim at hand.

So, the researcher might claim “treatment improves outcomes with error probability equal to 1%” but in fact what the 1% probability they report is for the claim “treatment either improves or harms outcomes”, not for “treatment improves outcomes”. In most cases the error probability that should be reported is half of the reported, or in this case 0.5% instead of 1% (2 times less measurable risk!).

Researchers fail to use the appropriate statistical test since the non-directional statistical hypothesis does not match their directional research hypothesis. In statistical terms, researchers report two-sided p-values and two-sided confidence intervals, instead of one-sided p-values and one-sided confidence intervals.

This confusion is not limited to medicine and clinical trials, but is present in many behavioral sciences like psychology, psychiatry, economics, business risk management and possibly many others. I will keep to examples from clinical trials in this article.

## The profound effects of this simple error

You might be thinking: what is the big deal? After all, we are exposed to less risk, not more, so where is the harm? However, the cost is very real, and it goes both ways.

Firstly, we see beneficial treatments being rejected since the apparent risk does not meet the requirement. For example, the observed risk, using an irrelevant (two-sided) estimate is 6%, with a 5% requirement. However, using the correct (one-sided) risk assessment we can see that the actual risk is 3%, which passes the regulatory requirement for demonstrating effectiveness.

Many similar examples can be found in scientific research, including a big Phase III breast cancer trial (8381 patients recruited) which demonstrated a probable effect of up to 45% reduction of the hazard ratio, however the treatment was declared ineffective at least partly due to the application of an irrelevant risk estimate. Had the correct risk estimate been applied, it would have made the treatment accepted as standard practice if the side-effects (of which there was a noted increase) were deemed acceptable.

I’ve discussed this and several other examples, including from other research areas in my article “Examples of improper use of two-sided hypotheses”.

Secondly, we have underappreciation of risk for harmful side-effects. Like measurements of beneficial effects, measurements of harm are also prone to error, and a drug or intervention will not be declared harmful unless the risk of such an error is deemed low enough. After all, we do not want to incorrectly reject a beneficial treatment due to what can be attributed to expected measurement errors.

However, if we use an incorrect error estimate we will fail to take note of harmful effects that meet the regulatory risk standard, and which should have stopped the drug or intervention from being approved. Using a two-sided statistic, we might believe that the risk of harm is merely a measurement artefact while the proper one-sided statistic will show us that it exceeds the acceptable risk threshold and should be considered seriously.

Finally, reporting irrelevant risk estimates robs us from the ability to correctly appraise risk when making decisions about therapeutic interventions. Not only are researchers and regulatory bodies led to wrong conclusions, but your physician and you are being provided with inflated risk estimates which may preclude you from making an informed choice about the treatment route which is most suitable for your condition.

The last point is especially painful for me, since I’m a firm proponent of making personal calculations for risk versus potential harm, in medicine and beyond. No two people are the same, no two personal situations are the same and where one sees unacceptable risk another sees a good chance to improve their situation. Being provided with doubled error probabilities can have a profound effect on any such calculation.

## How is this possible and why it happens?

This is a fascinating question with no simple answer, especially given that this issue is not due to an error in early statistical literature. All fathers of modern statistics recommended and widely used one-sided statistical tests. The error occurred later on in the transmission of statistical methods.

I have several probable explanations, among them the apparent paradox of one-sided vs. two-sided tests which is a hard one to wrap your head around, indeed. Another reason can be traced back to poor graphical presentation and supporting explanations for statistical tables published in the early 20-th century. This issue continues to be present in a different form in modern-day statistical software.

Mistakes and inappropriate teaching methods might also lead to mistaking the “null hypothesis” with the “nil hypothesis” as well as interpretations of p-values as probability statements related to the research hypothesis, instead of describing the properties of the statistical procedure being used. These can easily lead to wrong interpretation of the results from one-sided versus two-sided tests.

Whatever the reason, it is a fact that currently one-sided tests are incorrectly portrayed in books, textbooks and university courses on statistical methods, particularly in clinical trials and behavioral sciences. The bad press follows them in Wikipedia and multiple blogs and other online statistical resources. Given the large-scale negative portrayal of one-sided tests, some of which I have documented here, it is no wonder that researchers do not use them.

Another reason are the unclear regulatory guidelines, some of which (e.g. the U.S. Food and Drugs Administration and the European Medical Authority) are either not explicit in their requirements or they specifically include language suggesting one-sided statistics are controversial. Some guidelines recommend justification for their use which is not something requested for two-sided statistics.

This naturally leads most researchers to take what appears as a safe road and so they end up reporting two-sided risk estimates, perhaps sometimes against their own judgement and understanding. Peer pressure and seeing two-sided p-values and confidence intervals in most published research in their field of study probably takes care of any remaining doubt about the practice.

##

How to improve this situation

My personal attempt to combat this costly error is to educate researchers and statisticians by starting Onesided.org. It is a simple site with articles where I explain one-sided tests of significance and confidence intervals as best as I can, correcting misconceptions, explaining paradoxes, and so on. It also contains some simple simulations and references to literature on the topic as I am by no means the first one to tackle the problem.

My major proposal is to adopt a standard for scientific reporting in which p-values are always accompanied by the null hypothesis under which they were computed. This will both help ensure that the hypothesis will correspond to the claim and will also deal with several other issues of misinterpretation of error probabilities.

Of course, it would be great if regulatory bodies could improve their guidelines. However, this is usually a slow and involved process and it mostly reflects on what is happening in practice already.

## In conclusion

I think the important point is making use of error probabilities to measure risk where possible and of using the right risk measurement for the task. Failure to do so while under the delusion that we are in fact doing things correctly costs us lives, health, and wealth, as briefly demonstrated above. Whether it is a government-sponsored study or a privately-sponsored one, I know that in the end the money is being deducted from the wealth we acquire with blood, sweat and tears and I see no reason so as not to get the best value for it that we can.

Furthermore, this poor statistical practice denies us the ability to correctly apply our own judgement to data, thus hindering our personal decision-making and that of any expert we may choose to recruit.

I’m optimistic that bringing light to the issue will have a positive effect on educating researchers and statisticians about it. I have no doubt most of them will be quick to improve their practice, had it been an error for one reason or another.

## Comments