We’re pleased to present a guest post from Michèle B. Nuijten, a PhD student at Tilburg University who helped develop a program called “statcheck,” which automatically spots statistical mistakes in psychology papers, making it significantly easier to find flaws. Nuijten writes about how such a program came about, and its implications for other fields.
Readers of Retraction Watch know that the literature contains way too many errors – to a great extent, as some research suggests, in my field of psychology. And there is evidence that problem is only likely to get worse.
To reliably investigate these claims, we wanted to study reporting inconsistencies at a large scale. However, extracting statistical results from papers and recalculating the p-values is not only very tedious, it also takes a LOT of time.
So we created a program known as “statcheck” to do the checking for us, by automatically extracting statistics from papers and recalculating p-values. Unfortunately, we recently found that our suspicions were correct: Half of the papers in psychology contain at least one statistical reporting inconsistency, and one in eight papers contain an inconsistency that might have affected the statistical conclusion.
The origins of statcheck began in 2011, when my supervisor Jelte Wicherts and his former PhD student Marjan Bakker started checking statistical results in psychology papers. The reasons? Firstly, they thought it was strange that we have all kinds of blinding procedures during data collection to avoid expectancy effects among experimenters, but that there was no such blinding during the analysis phase. As though the analyst is not prone to error and bias!
Secondly, they had seen a paper published in a top journal that had so many errors that they could not help but realize how general these errors were. (Perhaps unsurprisingly, its results later proved not be replicable.) Thirdly, they already found that many psychologists were unwilling or unable to share data, which led them to study whether failure to share data was associated with reporting errors (something they later confirmed).
The only problem with this line of research was the huge amount of time it took to comb through studies, and detect the errors.
Enter Sacha Epskamp. Sacha was a master’s student at the Psychological Methods department at the University of Amsterdam with a love for statistics and programming. He read one of Bakker and Wicherts’ papers from 2011, noting the frequency of statistical errors, and became inspired to automate this tedious process of manually checking p-values for consistency, using the software platform known as R. But what started out as a simple proof of concept, quickly turned into a complete statistical package for R: statcheck.
Sacha created the main framework of statcheck: the algorithm that 1) converted PDF and HTML files into plain text, 2) searched the text for statistical results, 3) used the extracted numbers to recalculate p-values, and 4) compared the reported and computed p-value to see if they were consistent or not.
At that point I became involved. I was also doing my master’s degree in Psychological Methods at the University of Amsterdam, and I thought it would be cool to learn how to create your own R package. Sacha was kind (and patient) enough to show me the ropes. And when Sacha was hired as a PhD student at the University of Amsterdam on a different project, he crowned me maintainer of statcheck.
Programming statcheck consisted of a lot of trial and error: programming something, getting an error message, then spending the rest of the day finding the cause and debugging the code.
We ran into all kinds of weird things. We need to remove strange subscripts before we could start extracting the statistics (e.g., the use of prep instead of a normal p-value), type setters that use an image of mathematical symbols such as “=” instead of the ASCII sign (WHY??), or chi-square tests that statcheck initially couldn’t read, because R doesn’t allow for non-ASCII signs in the code (so no Greek letters).
To avoid getting stuck in infinite lines of code that take into account every single reporting style, we decided to filter out results reported EXACTLY in APA style (e.g., “t(28) = 1.23, p = .229“), otherwise there is no end to the number of exceptions we’d need to program (although you would be amazed how many variations of spacing etc. all fall under APA reporting).
Once we’d worked out these (and other) issues, we had the laborious task of checking a wide range of test articles by hand, and then with statcheck. We compared the output, identified mistakes in the coding, fixed the error, ran it again, compared it again, fixed it again, etc. (many thanks go to Chris Hartgerink for helping in this process.)
Until finally, one day, we decided that we were done.
We knew we would never be able to program statcheck in such a way that it would be as accurate as a manual search, but that wasn’t our goal. Our goal was to create an unbiased tool that could be used to give an indication of the error prevalence in a large sample, and a tool that could be used in your own work to flag possible problems. And with statcheck we managed to do that.
It would be amazing if we would manage to extend statcheck to be able to read, for instance, epidemiological or biomedical research. In theory, that should be possible. The only thing that is absolutely essential is that a field has a very specific reporting standard concerning statistical results, in order for us to be able to create a regular expression that finds specific strings of text.
For now, we will focus on fixing bugs that will undoubtedly arise now statcheck has been introduced to a wider audience.
If you want to work with statcheck yourself, I’d recommend first reading our paper, especially the appendix in which we perform an extensive validity study of statcheck and provide detailed examples of situations in which statcheck does or doesn’t work.
For more information about statcheck and how to install it, see Nuijten’s website: http://mbnuijten.com/statcheck. You can see the entire process of creating statcheck and keep track of the latest updates on GitHub: https://github.com/MicheleNuijten/statcheck. This website tracks every single thing they change, add, or remove from the package. And even better: it allows anyone who is interested to “fork” the code and make his or her own adaptations. Nuijten thanks Jelte Wicherts and Sacha Epskamp for refreshing her memory about some of the earlier events in this five-year process.
Like Retraction Watch? Consider making a tax-deductible contribution to support our growth. You can also follow us on Twitter, like us on Facebook, add us to your RSS reader, and sign up on our homepage for an email every time there’s a new post. Click here to review our Comments Policy.