How common are calculation errors in the scientific literature? And can they be caught by an algorithm? James Heathers and Nick Brown came up with two methods — GRIM and SPRITE — to find such mistakes. And a 2017 study of which we just became aware offers another approach.
Jonathan Wren and Constantin Georgescu of the Oklahoma Medical Research Foundation used an algorithmic approach to mine abstracts on MEDLINE for statistical ratios (e.g., hazard or odds ratios), as well as their associated confidence intervals and p-values. They analyzed whether these calculations were compatible with each other. (Wren’s PhD advisor, Skip Garner, is also known for creating such algorithms, to spot duplications.)
After analyzing almost half a million such figures, the authors found that up to 7.5% were discrepant and likely represented calculation errors. When they examined p-values, they found that 1.44% of the total would have altered the study’s conclusion (i.e., changed significance) if they had been performed correctly.
We asked Wren — who says he thinks automatic scientific error-checkers will one day be as common as automatic spell-checkers are now — to answer a few questions about his paper’s approach. This Q&A has been slightly edited for clarity.
Retraction Watch (RW): What prompted you to perform your study?
Jonathan Wren (JW): A portion of my research involves text mining and, prior to the study, I had anecdotally noticed the presence of errors in several different areas. For example, in a study of published URL decay I noticed that 11-12% of the URLs had spelling/formatting errors that made them invalid web links. My bachelor’s degree is in management information systems and, in management, error rates are generally just accepted as a part of life and the goal is to understand what factors contribute to making errors and how to best mitigate them. In science, errors seem to be more stigmatized. But I think it is just as important to science to understand error rates for tasks that we frequently perform. Without knowing how common errors are and how big they are, we can’t even intelligently begin to fix or prioritize them.
RW: How does your method for detecting errors work?
JW: Our study relied upon the reporting of paired, dependent values. That way, one value could be compared with the other and, if they don’t agree, then that is a problem. For example, if someone says “we found 3/10 (40%) of our patients responded” then something is wrong. Another way of detecting errors is to compare reported items to external sources to see if they match. For example, we found that at least 1% of published Clinical Trial IDs were wrong because they did not link to a valid web page at clinicaltrials.gov.
RW: Your approach to detecting errors seems primarily geared toward finding inadvertent mistakes. Do you think this approach can also help detect intentional research misconduct?
JW: The system was not designed to detect or discern intentional misconduct, but we did find that a relatively small fraction (~14%) of discrepancies seemed to be systematic. In these cases, the authors of a paper disproportionately made the same type of error in one paper more often than expected by chance, suggesting they either do not know how to correctly perform the calculations or have some kind of spreadsheet problem whereby one error is propagated to all calculations. The study did uncover one case of potential misconduct, but this was something I noticed when examining the data rather than by design.
RW: Do you think researchers will respond differently to an algorithm identifying errors in their work compared to a human reviewer?
JW: I think there might be an increased level of skepticism. But I think it will be better received in the sense that we would rather have our errors caught by an algorithm rather than our colleagues who might think poorly of us, particularly if it were only caught after publication.
RW: Is simply looking for calculation errors ignoring more serious underlying problems, such as p-hacking? Are sloppy research methods what lead to technical errors?
JW: I wouldn’t say it’s ignoring it, just not designed to detect it yet. As I argued in a perspective, this is just proof-of-principle that some of these problems can be solved algorithmically. I envision their level of sophistication will rise as time goes by and more research is done. For example, we are beginning to work on a way to detect if an inappropriate statistical test might have been used on the basis of how an experimental design was described. This is a very hard problem to do at scale, though. But I envision the technology will proceed in a similar fashion, which is to start with the most simple, basic things that can be detected and then continue advancing both the scope and the precision of the algorithm.
I don’t think algorithms will replace human reviewers anytime soon, but there are certainly aspects to reviewing that not only can, but should, be “outsourced” to algorithms.
Like Retraction Watch? You can make a tax-deductible contribution to support our growth, follow us on Twitter, like us on Facebook, add us to your RSS reader, sign up for an email every time there’s a new post (look for the “follow” button at the lower right part of your screen), or subscribe to our daily digest. If you find a retraction that’s not in our database, you can let us know here. For comments or feedback, email us at [email protected].
I am so grateful for these tools. I recently fixed a small handful of statistical typos in a manuscript I was working on before acceptance thanks to statcheck. These typos had gotten through multiple author proofreads and our editors and peer reviewers. Some kinds of errors are easily detected by computer and hard to detect for humans. In a few years there will be no excuse for a principal investigator to have sent out a manuscript containing errors that could have been detected by these algorithms.
With the risk of sounding holier than thou, it appears to me that a number of retractions we have seen reported here are due to a somewhat inadequate culture of numbers. In my field (physics) experimental data are, or should be at least, scrutinized every step on the way in the analysis. Obviously this alone does not prevent mistakes from happening, but errors arising from blind application of some third party software seem to be more rare than for example in life sciences. It is somewhat surprising to me to see counts of negatives mistakenly reported as positives etc. Such mistakes ought to be caught by someone involved in the paper, the PI for example. The software needed for this is ideally located behind the eyes of the people involved in the project.