In the beginning, there was Scott Reuben.

Well, not quite. Reuben, a Massachusetts anesthesiologist who fabricated data and briefly topped our list of most-retracted authors, didn’t invent research fraud, although he did spend six months in federal prison for his crimes. But his case was in no small measure responsible for the birth of this blog, and, well, the rest of human history that followed.

Although Reuben’s retractions are behind him now — his count ends at 22 — and other scientists, including two anesthesiologists, Joachim Boldt and Yoshitaka Fujii, have or likely soon will dramatically eclipsed his mark, a new paper has revisited his publications with an eye toward seeing if they could identify statistical evidence of data manipulation. It’s the same kind of effort that Ed Yong highlighted as noteworthy about the Dirk Smeesters case, which we covered yesterday and which involved an anonymous statistically inclined whistleblower.

Before we get to whether there was evidence of such manipulation — you already know the answer — we note this approach with a mix of enthusiasm and caution. Enthusiasm, because it’s precisely the kind of thing post-publication peer review entails; but apprehension because, just like plagiarism software can become a tool of inconclusive, and potentially abusive, fishing expeditions, this sort of analysis could in the wrong hands become a weapon for vendetta, intimidation and other unintended ends.

Okay, back to the study, which appeared in *Der Anaesthesist*, a German-language journal. The researchers, themselves anesthesiologists from Switzerland, used something called the Newcomb-Benford law to probe the likelihood that Reuben’s data in his 21 retracted studies (the last one was a letter the editor) were experimentally derived. As they explain:

The Newcomb-Benford law was initially used for financial audits and by tax authorities to detect fraud in filed statements or declarations. Deviations from the expected digit frequencies provoke detailed analysis of data by tax authorities. Until now this law has not been used for the statistical review of data extracted from medical papers although abstracts were recently investigated.

So how does that apply in practice?

Financial auditors and tax authorities use statistical methods for data analysis to detect fraud based on the observations described by Newcomb and Benford and numbers from natural sources show a counter-intuitive frequency distribution, which has also been shown for medical data. According to Benford’s law the digit 1 appears as the leading number to the left of the decimal point more often compared to digits 2–9. …

The text and tables of the 21 retracted articles were manually screened for leading digits and numbers. Extracted numbers were transferred to an Excel spreadsheet and the occurrence of each number determined using the built-in Excel functions (Microsoft® Office Excel 2003). In order to reduce keyboard errors, extractions were performed 3 times and the results compared to each other. The frequencies of digits 1–9 as the leading digit to the left of the decimal point and digits 0–9 as the digit in the second leading position were determined.

They continue:

The procedure found anomalies in 19 of the 20 papers, therefore the approach seems to be sensitive. …

If this method is confirmed in further studies any statistical analysis with deviations from Benford’s law will throw up a red flag. Recent focus has been on the detection of plagiarism by database comparisons and substantive plausibility. Even if fabricated data sets are generally accepted as being identified by statistical analysis with reference to Benford’s law, falsifiers will not likely be able to easily create Benford-compliant records.

The authors also include this aside:

It could be argued that this paper is not suitable for publication in an anesthesiology journal and would be better

suited to an applied statistics journal; however, anesthesiology journals missed these cases of fraud and will need to learn to identify other cases of fraud in the future.

In fact, as we have reported, the journal *Anaesthesia* published its own statistical investigation of Fujii’s studies, which found what appears to be ironclad evidence of fraud.

The notion that a Newcomb-Benford analysis might be useful for catching science cheats isn’t new, either. In 2007, Andreas Diekmann published a paper in the *Journal of Applied Statistics* making the case. As his article, “Not the First Digit! Using Benford’s Law to Detect Fraudulent Scientific Data,” states:

Is it possible to apply Benford tests to detect fabricated or falsified scientific data as well as fraudulent financial data? We approached this question in two ways. First, we examined the use of the Benford distribution as a standard by checking the frequencies of the nine possible first and ten possible second digits in published statistical estimates. Second, we conducted experiments in which subjects were asked to fabricate statistical estimates (regression coefficients). The digits in these experimental data were scrutinized for possible deviations from the Benford distribution. There were two main findings. First, both digits of the published regression coefficients were approximately Benford distributed or at least followed a pattern of monotonic decline. Second, the experimental results yielded new insights into the strengths and weaknesses of Benford tests. Surprisingly, first digits of faked data also exhibited a pattern of monotonic decline, while second, third, and fourth digits were distributed less in accordance with Benford’s law. At least in the case of regression coefficients, there were indications that checks for digit-preference anomalies should focus less on the first (i.e. leftmost) and more on later digits.

Although the Swiss group cited one Diekmann article on detecting fraud, it’s not this one.

This German study has a feel of cargo cult science. They take the case of the known fraudster and demonstrate that he might be a fraudster. Why not analyze the output of other scientists in the field as well to see if they pass the test with flying colors or not. If the latter, either call these people fraudsters or question your own methodology.

I think you’re looking at this a bit too black and white. The authors state that by applying the Newcomb-Benford test, the fraudster’s papers would have thrown up a red flag. And they argue that the anaesthesiology field needs a way to flag suspicious papers.

I agree that they should have included currently non-suspect articles, anonymized, of course, to also evaluate the specificity of the method insofar as that’s possible without a real ground truth. But they are not allowed to start pointing fingers and yell ‘fraudster’. That’s exactly the kind of abuse of the method that A&I point out in the third (real) paragraph.

The problem with the Newcomb-Benford law is that it should only be applied to data that encompasses a number of orders of magnitude, for instance, 10s, 100s, 1000s and so on. Very few RCTs in humans contain variables that cover multiple orders of magnitude. This is why I didn’t use it when analysing data from Fujii et al. It will be interesting to read the paper and see whether they have considered this.

According to the abstract, the “the digits of the published regression coefficients were approximately Benford distributed.” So they applied the law to the regression co-efficients I guess. Can’t access the full paper.

19 out of 20 is a specificity number. We need to know how many false alarms there are. Plus, every time something like this is published it is easy for fakers to adapt and make the model useless.

Showing that your test detects fraud in known frauds isn’t a proof of the test. You would have to show that it correctly rejects fraud in “comparable” real studies, i.e. using the same techniques, stats, and units.