A leading psychology research society in Germany has called for the end of PubPeer postings based on a computer program that trawls through psychology papers detecting statistical errors, saying it is needlessly causing reputational damage to researchers.
Last month, we reported on an initiative that aimed to clean up the psychology literature by identifying statistical errors using the algorithm “statcheck.” As a result of the project, PubPeer was set to be flooded with more than 50,000 entries for the study’s sample papers — even when no errors were detected.
On October 20, the German Psychological Society (DGPs) issued a statement criticizing the effort, expressing concern that alleged statistical errors are posted on PubPeer before authors of original studies are contacted. The DGPs also claimed when mistakes that are detected by statcheck and posted on PubPeer turn out to be false positives, it still results in damage to researchers that is “no longer controllable,” as entries on PubPeer cannot be easily removed.
Today, statcheck’s creators, led by Michèle Nuijten — a PhD student at Tilburg University in the Netherlands, who we’ve previously interviewed about statcheck — responded to DGPs’ critcisms, saying that there is value in
…openly discussing inconsistencies in published articles in an impersonal and factual manner, given that our own experiences in corresponding directly with authors about errors have not led to any documented corrections…
In their statement, the DGPs say:
…many researchers – especially those whose papers are among the 50,000 that were automatically screened – are worried about the fact that the screening of their article occurred (1) without the authors’ awareness, (2) without being able to actually verify whether the results of this screening are actually correct, and (3) without the opportunity to comment on the screening of their paper before the results were published on pubpeer. In addition, many colleagues are deeply concerned about the fact that it is obviously difficult to remove an entry on pubpeer after an error that had been “detected” by statcheck turned out to be a false positive.
The statement goes on to add:
…the detection of an alleged error necessarily requires a high level of sensitivity and cooperative intentions among all parties. Before a paper is publicly flagged for alleged statistical errors (on pubpeer or elsewhere), the authors of this paper should be given the opportunity to double-check and comment on the results of the screening. If an alleged error then turns out to be a false positive, any posts or comments in which the articles is flagged need to be removed or revoked at once.
The DGPs statement cites a paper posted on the preprint server arXiv earlier this month by Thomas Schmidt, a professor of experimental psychology at Technical University Kaiserslautern in Germany, which concludes:
The goal of this comment is to point out an important and well-documented flaw in this busily applied algorithm: It cannot handle corrected p values. As a result, statistical tests applying appropriate corrections to the p value (e.g., for multiple tests, post-hoc tests, violations of assumptions, etc.) are likely to be flagged as reporting inconsistent statistics, whereas papers omitting necessary corrections are certified as correct. The STATCHECK algorithm is thus valid for only a subset of scientific papers, and conclusions about the quality or integrity of statistical reports should never be based solely on this program.
In their reply today, Nuijten and colleagues write:
We clearly noted statcheck’s shortcomings in our publications. We continue to further refine statcheck and investigate the influence of possible bugs or other problems on our estimates of the prevalence of inconsistencies in psychology (see e.g., Nuijten, 2016). We therefore welcome all researchers’ comments on the performance of statcheck. So far, no bugs have been found that noticeably affect estimates of inconsistencies in statistical results in psychology. Hence we see no reason to adapt our initial estimates, or to discourage using statcheck in scientific articles, given that researchers take into account the program’s limitations.
They conclude:
We as scientists have the obligation to correct reporting errors even if the tools we use are not 100% accurate.
As we previously reported, statcheck received a mixed response from psychologists on social media when the PubPeer project was inaugurated last month.
Today, Nuijten noted that the PubPeer project was separate from statcheck, and told Retraction Watch:
I would like to stress that statcheck cannot (and does not pretend to) say anything about intentional mistakes, misconduct, or even fraud. It is simply a tool that calculates if the degrees of freedom and the test statistic correspond with the p-value.
Another possible application of statcheck, Chris Hartgerink — also a PhD student at Tilburg University and the second author of statcheck’s letter — previously told Retraction Watch that journals could run statcheck on manuscripts before accepting them. Psychological Science is already running a pilot to incorporate statcheck into their reviewing process, Nuijten noted.
Update, 5 p.m. Eastern, 10/27/16: Hartgerink has forwarded us his response to the DGPs. It concludes:
In my opinion, the reports on PubPeer should be seen as part of the scientific debate, enabling authors and other researchers to check the accuracy of statistics in published articles. Post-publication review represents a powerful forum for such scientific debate. It compliments traditional peer review that apparently has been unsuccessful in catching inconsistencies that statcheck can detect readily albeit not with a 100% accuracy.
Like Retraction Watch? Consider making a tax-deductible contribution to support our growth. You can also follow us on Twitter, like us on Facebook, add us to your RSS reader, sign up on our homepage for an email every time there’s a new post, or subscribe to our daily digest. Click here to review our Comments Policy. For a sneak peek at what we’re working on, click here.
The probability that Statcheck will find errors in a paper is sufficiently high that the fact that a checked paper contains no errors is (slightly) positive information. That is one reason for leaving those posts, which are very clearly worded.
Going forward, a compromise approach would be for such comments to be visibly tagged. A user has modified the browser plugins in this sense and this practice could be extended to the rest of the site. https://twitter.com/PubPeer/status/790273728005431296
If there are a high number of false positives, it will hopefully be possible for future runs of statcheck to post the new results.
In the meantime, our advice for all comments remains the same: all users should read the comments and make up their own mind about their veracity, importance, etc.
In my view any possible reputational damage caused by statcheck is likely to be minimal, if only because everyone knows that statcheck is not perfect at this point. So I’d like to see statcheck continue but I think the posts should be tagged.
I just have to point out here, that StatCheck is at best a 7/10 on claiming what it ought to do. Its notable, worthwhile and urgently required and I tip my hat to the creators. But given that its not a 10/10 and its criteria for checking papers is a 10/10 (i.e. all the reported p-values are correct), statcheck’s own paper/performance will be flagged using its own algorithm. This is a mismatch and imbalance in power when statcheck is applied to other papers. It also raises the fundamental issue with regard to science. A published paper is supposed to be 100% fool-proof and even at the smallest mistake, cries of ‘retract!’ is common and the witch hunt issues. As an engineer, this is troubling; every software has bugs and there are multiple versions that correct previous errors. Where 6-sigma certification is required, the in house development process takes years with vast teams. An FDA regulated drug takes a decade to come out into the market. But somehow magically, scientists are expected to churn out perfect papers every 6 months to a year. The whole system is flawed top to bottom. Folks on this site and pubpeer have hit on the perfect solution; post on arxiv with version history. The day that becomes a reality, the requirements to have statcheck will become obsolete. Sorry for the rant.
“even at the smallest mistake, cries of ‘retract!’ is common and the witch hunt issues. ”
Could you give an example in the field of psychology?
Calls for retraction over minor honest or careless errors are rarely part of the scientific discourse on PubPeer or elsewhere.
Indeed, stating that a paper should be retracted will very likely get your comment retracted on PubPeer.
stormchaser’s comment here is hyperbole. While subjective, retractions must come with fabrication, falsification or plagiarism as well as serious honest errors that affect the validity-veracity of the study.
A paper of mine gained an entry on PubPeer in this way. Statcheck had correctly identified a single incorrect p-value, which we now believe must have been the result of copy-pasting a sentence to preserve the formatting, but then updating only the statistics, not the p-value. That was sloppy of us and we are relieved that both the true value and the value published were statistically significant, with the true value more so. Although annoyed with ourselves, we were pleased that PubPeer provided us with an easy opportunity to respond and admit the mistake so that it is now documented in the scientific record; the journal itself would not have wanted to publish such a minor correction. Maybe readers will give us some credit for our honesty and any criticism that they still harbour is, after all, deserved. Authors of articles inappropriately flagged can also easily respond and explain. No doubt amongst all these non-mistakes and minor mistakes, the broad application of statcheck will also identify some serious errors that deserve a more radical response such as a published correction or retraction. This is an excellent thing.