Clinical Studies could Benefit from a Different Statistical Strategy

An alternative statistical strategy developed by Cornell researchers can improve the reliability and trustworthiness of clinical trials while simultaneously addressing the scientific community’s “replicability crisis.”

Cornell researchers advance the “fragility index” in a new paper published this month in the Proceedings of the National Academy of Sciences, a method gaining traction in the medical community as a supplement to the p-value, a probability measurement used across science since the 1920s and cited, sometimes recklessly, as evidence of sound results.

“Clinicians trust that the procedures and protocols they carry out are informed by sound, clinical trials. Anything less makes surgeons nervous, and rightly so,” said Martin Wells, the Charles A. Alexander Professor of Statistical Sciences and a paper co-author. “We’re discovering that many of these consequential trials that showed promising results and that were published in top journals are fragile. That was a disconcerting surprise that came out of this research.”

The paper, co-authored by Cornell statisticians and doctors from Weill Cornell Medicine and the University of Toronto, proposes a new statistical toolkit that uses the fragility index as an alternative method to help researchers determine whether their trial results are strong and reliable or simply a result of chance.

“When you tell the world a treatment should or shouldn’t be used, you want that decision to be based on reliable results, not on results that can swing one way or another based on the outcomes of one or two patients,” said Benjamin Baer, Ph.D. ’21, a paper co-author and currently a postdoctoral researcher at the University of Rochester. “Such results can be considered fragile.”

For surgical operations and medicinal therapies, randomized clinical trials are required to determine their effectiveness. For decades, researchers have relied on an often-misunderstood metric called the p-value to determine whether trial results are significant or merely coincidental.

However, in the last 15 years, skepticism about the p-dependability value’s when employed alone and without supporting methodologies has grown, owing to the failure of previous trial results that were initially judged strong to be duplicated in follow-up trials.

Researchers assessed 400 randomized clinical trials using the fragility index in 2014 and discovered that 1 in 4 trials with “statistically significant” p-values had disturbingly low fragility scores, indicating less dependable outcomes.

Clinicians trust that the procedures and protocols they carry out are informed by sound, clinical trials. Anything less makes surgeons nervous, and rightly so.

Martin Wells

“One can see why there is a replication crisis in science. Researchers find good results, but they don’t hold up,” Wells said. “These are serious, large trials studying cutting-edge issues, with findings published in top journals. And yet, some of these big trials have low fragility indices, which raises the question of the results’ reliability.”

Cornell researchers have devised a solution by honing the fragility index, which examines how many patient outcomes can determine whether a trial is successful or failed.

The lower the fragility value, the more unreliable and fragile the results are. A trial with 1,000 participants, for example, with an extraordinarily low fragility index, can be statistically significant or insignificant based on the results of a few patient outcomes.

The fragility index has been criticized for its rigidity since its inception in the 1990s. It only applies to data with two research groups, treatment and control, and a binary, event-or-not result.

This new study proposes a more adaptable fragility index that may be used to assess any type of outcome and any number of explanatory variables. Researchers from all fields may now calculate the fragility index based on the likelihood of specific outcomes, thanks to the team’s method.

“The traditional framing of statistical significance in terms of yes-no is overly simplistic, and the problems we’re investigating aren’t,” said Dr. Mary Charlson, the William Foley Distinguished Professor of Medicine at Weill Cornell Medical College and a paper coauthor.

“With each clinical situation, there are different contexts you’re dealing with. This method allows us a way to test assumptions and consider implications of a much narrower range of outcomes.”