Bloodhound code sniffs out copied-and-pasted numerical data

Pexels

Markus Englund, a software developer and sleuth based in the Netherlands, first hit paydirt with invasive plant species in China. After having scanned 12 other published scientific datasets with his novel detection software with no results, he came across one showing something suspicious: rows and rows of measurements of plant roots repeated across entirely different species. 

“I was really excited,” he said in a recent call with Retraction Watch. “I couldn’t think of any innocent explanation for why that would be the case.” 

Englund had built a tool dedicated to “purging” fabricated data by identifying “impossible” data in spreadsheets available on open repositories, according to Science Detective, his site about the initiative. From his initial review, he has found 18 datasets containing duplicated values that are possibly serious enough to need correcting — including one from an influential paper on Parkinson’s disease, as The Transmitter recently reported. (Retraction Watch’s cofounder Ivan Oransky is that publication’s editor-in-chief.)

The idea came from reading Retraction Watch. Englund found himself particularly interested in the cases involving Nobel laureate Thomas Südhof and Jonathan Pruitt, a spider researcher whose datasets were found to contain copied and pasted values. In the Retraction Watch database, Englund noticed the “immense success” people had identifying duplicated images, and it occurred to him that the same was probably happening with numerical data. 

Englund explains the software is structured like a funnel. He begins with datasets from Dryad, an open access data repository, and prioritizes ones from the most cited papers. The software scans them for duplications. The process doesn’t end there, however, because this step catches many false positives. 

“There are a million different perfectly explainable reasons why this would happen in a real data set,” he said. For example, the average weight of a species of mouse might be copied across every row of individual mice of that strain for purposes of a particular calculation in a spreadsheet. 

To filter out false positives, Englund has written a prompt for an AI (specifically Google’s Gemini 3.1) that tasks it with assessing whether the duplication is expected or worthy of more scrutiny. The AI gives a rating of how likely the duplication is to be a normal scientific artifact. Among the 600 papers he has fully analyzed so far, 35% were caught by the initial filter and the AI marked 33% of those as suspicious.

But false positives still occur. For example, AI had flagged a study that calculated the biomass of termite mounds using the height of grass. These would differ from each other by a few millimeters. Overall, he has identified problems in about 3% of the datasets that have been scrutinized by AI and a human, and is now going through several additional cases flagged in another 500 datasets. 

One of his key findings involves a 2016 Cell paper on Parkinson’s disease. The paper, which found evidence that the illness might originate in the gut, has accrued more than 4,000 citations. Its dataset had been publicly available on Dryad for almost a decade when Englund’s software flagged it for suspicious duplication — finding duplications in half of the motor function data for the healthy mice and about 40% of the data on mice with an altered microbiome. 

“It’s kind of newsworthy in and of itself,” Englund said. “Scientists thought this paper was interesting enough to cite, but no one ever found this particular issue.” 

On PubPeer, other researchers have also taken issue with the paper’s methodology. When we asked about the errors, the article’s corresponding author Sarkis Mazmanian called them “honest mistakes” and said the lab was working to correct the manuscript.

Shortly after Englund’s comment pointing out the duplications, a similar issue was flagged in another paper from the same lab. 

Anonymous commenters on Englund’s post about his findings say the duplications could be innocent and, in response to his PubPeer comments, many researchers have claimed the errors don’t affect their conclusions.

Regarding one flagged study, the authors said on PubPeer that the measurements of fish size had been accidentally matched to the wrong individual fish because of an issue with merging Excel files. In another case, data from one species was accidentally pasted into the spreadsheet for a different species, according to an author’s comment. The author said the conclusions still held and that they are working with the journal to issue a correction. In five other papers (out of the 20 at the time of writing that Englund posted comments for on PubPeer), authors acknowledged the errors but claimed that they do not affect their findings. 

“We’ll never know what was actually behind the duplications,” Englund said. His goal is to get the attention of funding agencies and journals to prompt them to investigate further, but he’s cautious about jumping to conclusions. “It would be a great injustice to accuse someone of fraud when it’s just an innocent mistake.”

Englund said the response he has gotten from journals has been slow at best. The invasive plant study, which was published in Journal of Ecology and flagged by him in May 2025, has not yet been corrected. The managing editor of the journal, Rowena Gordon, told us they have finished investigating the paper and are finalizing a resolution with the publisher and the authors. “I’m not yet in a position to share what the final outcome will be,” Gordon said. 

Dryad, the data repository, has posted expressions of concern on their site and has so far issued two expressions of concern about repeated values in datasets detected by Englund. Dan Edwards, Dryad’s head of data publishing, said he anticipates more notices to be issued pending the outcomes of their investigations. 

“Englund’s work has helped to surface previously unrecognized issues with tabular data and a new potential avenue to promote research integrity,” said Edwards, who acknowledged that such checks will “demand significant investment in the human follow-up required to ensure the resolution of each case.” 

Englund has fully examined only about 600 of the more than 24,000 Dryad’s datasets that contain Excel files. If his 3% problematic rate holds, he said he expects to find 700 more cases in that sample alone.


Like Retraction Watch? You can make a tax-deductible contribution to support our work, follow us on X or Bluesky, like us on Facebook, follow us on LinkedIn, add us to your RSS reader, or subscribe to our daily digest. If you find a retraction that’s not in our database, you can let us know here. For comments or feedback, email us at [email protected].


Processing…
Success! You're on the list.

One thought on “Bloodhound code sniffs out copied-and-pasted numerical data”

  1. This is the way. I’ve always argued that scanning for image duplications will only get us so far. Detection of duplicate data is an advancement. Ultimately what would be best, I think, is a suite of forensic tools that automates analysis of statistics at the level of stating, for example, that such a p-value or error is impossible given a data set of such characteristics; such a data set is unlikely to have been generated experimentally, and so forth. Of courses these data forensics tools exist, but are not yet deployed at scale.

Leave a Reply to Michael JonesCancel reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.