Retraction Watch

Tracking retractions as a window into the scientific process

Here’s why more than 50,000 psychology studies are about to have PubPeer entries

with 16 comments

pubpeerPubPeer will see a surge of more than 50,000 entries for psychology studies in the next few weeks as part of an initiative that aims to identify statistical mistakes in academic literature.

The detection process uses the algorithm “statcheck” — which we’ve covered previously in a guest post by one of its co-developers — to scan just under 700,000 results from the large sample of psychology studies. Although the trends in Hartgerink’s present data are yet to be explored, his previous research suggests that around half of psychology papers have at least one statistical error, and one in eight have mistakes that affect their statistical conclusions. In the current effort, regardless of whether any mistakes are found, the results from the checks are then posted to PubPeer, and authors are alerted through an email.

Till now, the initiative is one of the biggest large-scale post-publication peer review efforts of its kind. Some researchers are, however, concerned about its current process of detecting potential mistakes, particularly the fact that potentially stigmatizing entries are created even if no errors are found.

Chris Hartgerink, a PhD student at Tilburg University in The Netherlands, has posted a preprint online outlining the processes he and others used to mine just under 700,000 results from his sample of more than 50,000 papers, which all had statcheck run on them. 

The project has value for readers as well as individual academics, who can fix any mistakes in their papers accordingly, Hartgerink told Retraction Watch.

Some researchers’ welcomed the project on social media:

Not all researchers see the program in a positive light, however. For example, two papers co-authored by prominent psychologist Dorothy Bishop, who is based at the University of Oxford, UK, have so far been flagged by statcheck. One, however, says the program detected no statistical mistakes in her paper. Bishop was unhappy with the paper being flagged despite no errors being found, and took to Twitter to express her concern:

She told us:

The tone of the PubPeer comments will, I suspect, alienate many people. As I argued on Twitter, I found it irritating to get an email saying a paper of mine had been discussed on PubPeer, only to find that this referred to a comment stating that zero errors had been found in the statistics of that paper.

As for the other paper, in which statcheck found two allegedly incorrect results, Bishop said:

I’ll communicate with the first author, Thalia Eley, about this, as it does need fixing for the scientific record, but, given the sample size (on which the second, missing, degree of freedom is based), the reported p-values would appear to be accurate.

Bishop would like to see statcheck validated:

If it’s known that on 99% of occasions the automated check is accurate, then fine. If the accuracy is only 90% I’d be really unhappy about the current process as it would be leading to lots of people putting time into checking their papers on the basis of an insufficiently sensitive diagnostic.

Hartgerink said he could see why many researchers may find the process frustrating, but noted that posting PubPeer entries when no errors were detected is also “valuable” for post-publication peer review. Too often, post-publication peer review is depicted as only questioning published studies, and not enough emphasis is put on endorsing sound content, he said.

Furthermore, he noted that statcheck is by no means “definitive,” and its results always needs to be manually checked. A few authors, for example, have commented on PubPeer claiming that their papers didn’t contain the flagged mistakes, said Hartgerink. In the end, there appeared to be mistakes in the algorithm itself, he said.

Hartgerink, therefore, recommends that researchers should always check whether errors highlighted by statcheck actually exist. If they do, researchers can then consider contacting journal editors, and issuing corrigenda where necessary, he said.

For the future, Hartgerink thinks it wouldn’t hurt for journals to run statcheck on manuscripts before accepting them. Michèle Nuijten, who is also a PhD student at Tilburg University and author of the November 2015 Retraction Watch article about statcheck, is speaking with several journal editors to pilot use of the algorithm as part of the reviewing process, Hartgerink explained.

Originally, Hartgerink, whose doctoral research is about detecting potential data anomalies, aimed to use the current data as psychology literature baseline for other projects; for instance, one of his projects about how extreme certain results are draws on the present data.

Like Retraction Watch? Consider making a tax-deductible contribution to support our growth. You can also follow us on Twitter, like us on Facebook, add us to your RSS reader, sign up on our homepage for an email every time there’s a new post, or subscribe to our daily digest. Click here to review our Comments Policy. For a sneak peek at what we’re working on, click here.

Written by Dalmeet Singh Chawla

September 2nd, 2016 at 11:35 am

  • Mark Underwood September 2, 2016 at 1:10 pm

    This is a helpful post.

    Overall, the use of intelligent software to improve scientific processes is to be lauded.

    But there must be adequate transparency and provenance in these, as well as any use of algorithmic reasoning systems. We know from work in the NIST Big Data Working Group that provenance and transparency are often weak as algorithms make their way from R, SPSS or Tensorflow models to third parties with greater or lesser practitioner expertise.

    Using the tools earlier in the process, as mentioned in this piece, will produce better results, but only if it is coupled with similar controls on provenance and appropriate use. For instance, would this tool be approved by its authors for 50K papers en masse? As mentioned here, “. . . statcheck is by no means “definitive,” and its results always needs to be manually checked.”

    It’s not behind a pay wall (a provenance-strengthener), so here’s the original paper cite:

    M. B. Nuijten, C. H. J. Hartgerink, M. A. L. M. van Assen, S. Epskamp, and J. M. Wicherts, “The prevalence of statistical reporting errors in psychology (1985–2013),” Behavior Research Methods, pp. 1-22, 2015. [Online]. Available:

    The goal should be to improve the quality of work, speed editorial workflow, improve objectivity — not sabotage careers, or trade one family of faulty inference-making for another.

  • Ken September 2, 2016 at 7:40 pm

    Probably a lot of these are from researchers who could find the right p value, but couldn’t find the correct statistic and definitely not the correct degrees of freedom. Typos are also more likely to occur in the test statistic as they aren’t as easy to interpret.

  • Paul Brookes September 2, 2016 at 11:29 pm

    Is PubPeer doing this as part of the normal front page feed, or in a separate channel? If the former. That’s really going to screw up navigating the site for those of us who are not in this area of science, a separate channel for these outcomes would be one possibility, or the ability to toggle a filter to remove these hits from the main page list.

    • Chris Hartgerink September 3, 2016 at 3:17 am

      Hi Paul,

      PubPeer made sure these wouldn’t flood the recents page. Only when people start responding to them will they get on there.


      • Sylvain Bernès September 3, 2016 at 1:15 pm

        The mechanism is still unclear: if threads are filtered on the basis of people’s comments, all threads must be available at one point on the main PP page, which is generally limited to 75-100 entries.

  • Michele Nuijten September 3, 2016 at 5:20 am

    As a reply to Dorothy Bishop’s concerns: an extensive validity check in which statcheck results are compared with manually extracted results can be found in our (open access) paper:

    Furthermore, statcheck is an open source program and all its code is on GitHub:

    We try to make statcheck as sensitive & specific as possible, and any suggestions or direct adaptations in the code via GitHub are welcome!

    I fully agree with Chris that an automated program is by no means definitive and for strong conclusions results should always be double checked.

  • Nick September 3, 2016 at 9:47 am

    My (pre-academic) experience of deploying big, automated tools like this is that you always find new, but less-prevalent, bugs as you go along. So while I generally agree with the principle of this initiative, I think it might be better to check 1000 articles first and post automated feedback on the 70 or so with errors; then multiply the sample size by 4 and repeat; and so on up to 700,000. This will probably substantially reduce the number of false positives, as each round will probably reveal new problems that had previously been below the radar due to low prevalence.

  • Paul Thompson September 3, 2016 at 10:31 am

    I am not 100% comfortable with this. OK, if no mistake is found, a comment is posted: “No mistake found”. 50,000 posts later, who is paying attention to this? This is a “boy-who-cries-no-wolf” situation – he’s always saying something. Speech, when uttered, is a signal. if the signal is “no problem” pretty soon no one will pay attention. If there’s no problem, Grise says we should have no speech, since the assumption is no problem.

  • Aidan September 3, 2016 at 11:12 am

    It’s just spellcheck, but for agreement between F/t and p values, right? I’d be interested to know what portion of mistakes it finds are typos vs. rounding errors vs. true statistical errors. For instance, not many folks calculating p by hand these days.

    • Paul Thompson September 3, 2016 at 1:00 pm

      If you are following rules for reproducible research, you should be porting tables directly from software into the MS, and not doing any manual manipulation. As such, I think the inaccuracy of a p value as a function of the F/t value to be of modest interest at best. What is more important is the correct selection of tests, the use of inappropriate methods for repeated measures analysis, the lack of concern about litter effects in most basic biology analysis, the fact that most basic biologists are not enthused about statistics but do their own statistics anyway, and issues like that. Put another way, I trust the calculation of the p value more highly than I trust the calculation of the observed test statistic.

  • BB September 4, 2016 at 6:37 am

    50.000 entries? That’s literally spamming. PP has been declining in quality – I can’t even find the old mypubpeer secion anymore, I don’t get any alerts for responses, now this sounds like disaster.

    A PP entry needs some kind of personal insight, an automatically generated report is not the same. “In Fig 9 the authors may have incorrectly assumed statistical significance”

    If 50.000 studies are affected it is almost the same as if there were none. Its easy to hide in the crowd, and I assume that the authors of truly problematic papers will find it a LOT easier to shun unwelcome criticism.

    • Anonymous September 4, 2016 at 3:59 pm

      PubPeer has been declining in quality for a long time now. The site is spammed with ‘problematic’ Western blot images. It is very difficult to find genuinely insightful posts there.

  • Robin Mayes September 16, 2016 at 12:04 pm

    It is easy to write software that can search and find text and numbers, but to have it understand what they mean is still in its infancy. As I understand it, StatCheck requires all stat results to be in strict APA format. Not likely. Given publishers cannot even agree on document metadata it is unlikely that authors would totally understand and use APA stat reporting formats. Moreover, If an article had only one stat this might work, but when there is a plethora of reported stats across several research questions we are talking about an seemingly infinite number of possible outcomes.

  • Thomas Schmidt September 30, 2016 at 1:56 pm

    STATCHECK CANNOT DEAL WITH CORRECTED P-VALUES. This is one of the reason the program currently sends out large numbers of false alarms. In the two papers of mine that have so far been flagged for multiple mismatches between F and p values, it turns out that almost all instances come from reporting Huynh-Feldt-corrected p values together with the original, uncorrected degrees of freedom, as is customary. (To be fair to the algorithm, there are two or three instances in its report that might point to genuine mistakes, and we are going to check them. One of the two reports, however, seems to consist entirely of false alarms about corrected p values.)

    The problem apparently applies to the Greenhouse-Geisser and Huynh-Feldt corrections, corrections for multiple testing (e.g., Bonferroni test, Tukey tests) and corrections for post-hoc testing (e.g., Scheffé tests) – basically any correction that is applied to the p value after the other test statistics have been calculated. Put another way, the algorithm seems to systematically flag papers for applying the necessary corrections to the p value, but to clear papers that omit those corrections.

    Note that the trouble with corrected p values has been explicitly acknowledged in the authors’ paper on the Statcheck project, where it is stated that “statcheck will will miss some reported results and will incorrectly earmark some correct p-values as a reporting error” (Nuijten et al., 2015, p. 3, left column).

    Nuijten, M. B., Hartgerink, C. H. J., van Assen, M. A. L. M., Epskamp, S., & Wicherts, J. M. (2015). The prevalence of statistical reporting errors in psychology (1985-2013). Behavior Research Methods, pp. 1-22. doi:10.3758/s13428-015-0664-2

  • David Levine September 30, 2016 at 9:42 pm

    It is great to have “spellcheck for test statistics.”

    At the same time, the headline is the quote, “We found that half of all published psychology papers … contained at least one p-value that was inconsistent with its test,” Nuijten and her co-authors reported in 2015 in the journal Behavior Research Methods. [, today] And every author who is on the flagged list now has to justify the research.

    I like papers that adjust for clustering, multiple comparisons, autocorrelation, and so forth. Thus, I would have preferred a more conservative test that dropped articles with words indicating a possible adjustment of the degrees of freedom, test statistics, or critical values. My word list includes “multiple comparison,” “random effects,” “clustering,” “Bonferroni”, “Scheffe,” “Tukey”, “Huynh-Feld,” “adjusted degrees of freedom,” “adjusted P value”, “longitudinal,” etc. (with multiple spellings and hyphens…) I assume skimming 100 articles with inconsistencies might uncover other term.

    The authors found that “Bonferroni” and “Huynh Feldt” were rare strings; I still would have preferred they drop any articles with those terms, and my longer list. As other commentators note, we want to encourage authors to do appropriate adjustments, not make them defend themselves for having a different answer than Excel gives.

    As a proofreading tool, it would be great to have stricter settings, so authors can see flags of anomalies that are probably ok, but worth double checking. I prefer that publicized anomalies almost all remain problematic after checking.

  • PP October 21, 2016 at 6:01 am

    I can see the principle value of this – for identifying real errors, for checking one’s own manuscripts, etc. No doubt, really useful.

    However, I do not at all see why anyone would want to produce – probably thousands of – posts that claim statistical errors where authors have rounded decimals of p-values. Is there any agreed-upon standard that five decimals have to be reported in Psych papers? Did I miss that? If not, it might make a lot of sense to correct this, I think – if only to reduce global workload for the respective authors of ‘cleaning up’ behind these guys …

    Second point – science is an endeavor involving many many individuals. It might be good practice to at least try to establish some kind of consensus among interested researchers involved in such topics, before rolling out such large-scale initiatives.

    Third – has there been any response by the authors of the algorithm concerning the multiple comparison issue pointed out by Thomas Schmidt? This is a second semester undergraduate thing every psych student learns – how come this is not taken into account before bringing this project public? It would be interesting to get some kind of feedback here by the authors.


  • Post a comment

    Threaded commenting powered by interconnect/it code.