Statisticians clamor for retraction of paper by Harvard researchers they say uses a “nonsense statistic”

via ImageCreator

“Uh, hypothetical situation: you see a paper published that is based on a premise which is clearly flawed, proven by existing literature.” So began an exasperated Twitter thread by Andrew Althouse, a statistician at University of Pittsburgh, in which he debated whether a study using what he calls a “nonsense statistic” should be addressed by letters to the editor or swiftly retracted.

The thread was the latest development in an ongoing disagreement over research in surgery. In one corner, a group of Harvard researchers claim they’re improving how surgeons interpret underpowered or negative studies. In the other corner, statisticians suggest the authors are making things worse by repeatedly misusing a statistical technique called post-hoc power. The authors are giving weak surgical studies an unwarranted pass, according to critics.

Senior author David Chang feels “online trolls” are presenting one-sided arguments, which do not justify retraction. “If we had completely fabricated our data, that would be the only justifiable reason for retracting the study,” Chang told Retraction Watch. “There is no reason to demand retraction based on differences of opinion.”

The study Althouse is referencing, “Is the Power Threshold of 0.8 Applicable to Surgical Science?—Empowering the Underpowered Study,” was recently published in the Journal of Surgical Research by a group of Massachusetts General Hospital investigators, led by Yanik Bababekov.

By looking at the post-hoc (also called “observed”) power of negative studies, the article suggests surgical studies aren’t living up to the widely-accepted goal in biomedical research of achieving 80% statistical power. Statistical power measures a study’s ability to detect treatment effects. Studies with low statistical power, like those with small samples, are more likely to show false negative and false positive results.

The authors conclude that “the surgical community should reconsider the power standard as it applies to surgery.” If these conclusions were acted upon, it might mean adopting new surgical techniques based on weaker evidence from smaller studies.

These conclusions are being questioned because post-hoc power is considered unreliable and misleading, according to statisticians. “The problem is that a post-hoc power calculation is just a transformation of the p-value,” Althouse told Retraction Watch, referring to the much-maligned statistic. “Observed power has nothing to do with the study’s actual designed power to detect a meaningful difference [from treatment].”

Widely-used research guidelines agree. The CONSORT guidelines, for example, state “there is little merit in a post hoc calculation of statistical power using the results of a trial.” Instead, experts recommend calculating the study’s power before it is performed using the minimal effect that would warrant adoption of a treatment.

Going negative

Another concern statisticians raised is that the article only examined negative studies, which presents a biased view of the surgical literature. Althouse suggested that the authors practically guaranteed surgical studies would appear underpowered by only looking at negative results. This biased finding might mislead readers into thinking most surgical studies are too small to be meaningful.

Chang believes statisticians do not appreciate the practical context in which his paper was written. Surgeons, he emphasizes, misinterpret underpowered “negative” studies as evidence that two techniques are equally safe and effective. Surgeons “write these papers claiming A is as safe as B based on nothing other than p>0.05,” Chang says.

To discourage these misleading interpretations, Chang sought to use post-hoc power “as a damage control measure to inject some caution” into how doctors interpret negative studies, although he recognizes that larger changes are needed to how studies are conducted and reported.

In fact, statisticians critiquing the research have been sympathetic to this problem. Andrew Gelman, a statistician at Columbia University,  wrote that the authors are “completely right” in bringing attention to the misinterpretation of negative studies. Althouse says that he agrees “they have identified a real problem,” but he feels their proposal is “not actually a solution.”

Scott LeMaire, editor of the Journal of Surgical Research, told us that the journal is “actively evaluating the comments that we have received about this paper.”

Not the first time

Last year, the same research group published a perspective in the prestigious Annals of Surgery also calling for surgical studies to include a post-hoc power calculation. An extended back-and-forth in the journal’s pages ensued (here, here, here, and here) between statisticians (including Althouse) and the authors. Bababekov and Chang appeared undeterred, replying that they “respectfully disagree that it is wrong to report post hoc power.”

Althouse, whose efforts previously led to the retraction of a cardiology study, is not alone in his concerns. Dozens of comments have been posted to PubPeer about the latest paper, calling it “completely flawed” and suggesting it “should be urgently retracted.”

“Papers that have had corrections or rebuttals issued often continue to be cited as though the rebuttals simply don’t exist,” Althouse told us. “If this paper remains published, the ripple effect that I fear is that people will still believe that this idea of post-hoc power has legitimacy.”

Gelman, who penned a letter opposing the Annals of Surgery article, wrote on PubPeer that “it is irresponsible for [the authors] to have written this new paper given that various people have already pointed out their error in print.”

Like Retraction Watch? You can make a tax-deductible contribution to support our work, follow us on Twitter, like us on Facebook, add us to your RSS reader, sign up for an email every time there’s a new post (look for the “follow” button at the lower right part of your screen), or subscribe to our daily digest. If you find a retraction that’s not in our database, you can let us know here. For comments or feedback, email us at

3 thoughts on “Statisticians clamor for retraction of paper by Harvard researchers they say uses a “nonsense statistic””

  1. From above: “Senior author David Chang feels ‘online trolls’ are presenting one-sided arguments…”

    That’s not an accurate description at all. The comments on the Pubpeer thread are thoughtful and not trollish at all. It’s really sad to see a researcher not being able to take sincere, high-quality criticism.

    1. Regarding the PubPeer thread you mention, it should be stressed that almost all comments are signed, including some big names.

      Andrew D. Althouse (Pittsburgh)
      Russell V. Lenth (Iowa)
      Sander Greenland (UCLA)
      Paul M Brown (Alberta)
      Dwight Barry (Seattle)
      Zad Chow (NUY Langone Health)
      Maarten Van Smeden (Leiden)
      Ryan Miller (?)
      Pavlos Msaouel (Houston)
      Daniel E. Leisman (Northwell Health, NY ?)
      Andrew Gelman (Columbia Univ.)
      Aleksi Reito (Tampere University Hospital)
      Raj Mehta (Cincinnati ?)
      Yevgeniy Feyman (Boston)
      Samantha R. Seals (Univ. West Florida)
      David Nunan (Oxford)
      Frank E. Harrell (Vanderbilt Univ.)
      Timothy Feeney (Boston)
      Guillaume A. Rousselet (Glasgow)
      Thom Baguley (Nottingham)

      This is a very odd situation on PubPeer, where anonymity is the rule. It thus seems hard to believe that trolling is at work here. On the other hand, perhaps Prof. David Chang was referring to another medium.

  2. For those unaware, there was an update to this story in the last week.

    The original version of the paper included a Supplemental Table with the list of PMID, mean effect size, and mean power for the studies in the review. Myself and a few others were combing through the reference list to see if we could reproduce their work in the hopes we could illustrate to the authors that the observed power would just be a linear function of the p-values, in the hopes that perhaps they would be more convinced if they saw it using their own data (we also needed to recreate their effort because in a few cases it seems that the authors made mistakes, so the data would not appear correctly unless we could find and correct those mistakes). However, sometime in between the first version being posted online and last week, the Supplement containing the list was removed:

    Catriona Fennell responded with an explanation:

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.