What if we could scan for image duplication the way we check for plagiarism?

Paul Brookes

Paul Brookes is a biologist with a passion for sleuthing out fraud. Although he studies mitochondria at the University of Rochester, he also secretly ran a science-fraud.org, a site for people to post their concerns about papers. Following legal threats, he revealed he was the author and shut the site in 2013 — but didn’t stop the fight. Recently, he’s co-authored a paper that’s slightly outside his day job: Partnering with computer scientist Daniel Acuna at Syracuse University and computational biologist Konrad Kording at the University of Pennsylvania, they developed a software to help detect duplicated images. If it works, it would provide a much needed service to the research community, which has been clamoring for some version of this for years. So how did this paper — also described by Nature News — come about?

Retraction Watch: Dr. Brookes, you study mitochondria. What brought you to co-author a paper about software to detect duplications?

Paul Brookes: I had authored a paper on the relationship between levels of internet publicity (blogs etc.) and actions taken against problematic papers in the biosciences – retractions, corrections, etc.  As such, I was sitting on a large database (500+ papers) with documented image problems. Konrad and Daniel approached me by email, to request this set of examples, to act as a training set for their machine learning algorithm.

Daniel Acuna: We thought Paul was being treated very unfairly for what he was doing. We definitively wanted to have him in the team because of his expertise. Also, while I was a postdoc with Konrad, he always had an open mind and would encourage us to pursue risky but important projects. This project felt like the right thing to do.

RW: Dr. Brookes: Although you run an active lab, you’ve maintained an active interest in ferreting out fraud (including maintaining the website science-fraud.org that was forced to shut down in 2013). What keeps you focused on this mission?

PB: I first got into this area while reviewing grants for the American Heart Association in 2011 (which triggered an ORI case). Nowadays I mainly just post on PubPeer (in some cases on behalf of other individuals who require anonymity), chase down older cases with editors and journals, and report new cases that show up during peer-reviewing activities, journal clubs and other reading. There’s no shortage of new cases, and a large backlog to be dealt with (see, e.g., frequent comments of “Fernando Pessoa” on RW, and the tiny fraction of PubPeer comments eliciting responses from authors or journals).  I don’t view this activity as separate from a normal research career – everyone involved in research has a role to play in policing the literature. Obviously the amount of time anyone can commit to this is dependent on other factors such as grant deadlines, teaching etc.

Konrad Kording

RW: Dr. Kording, you mentioned Dr. Brookes has been an inspiration — can you say more?

Konrad Kording: Paul kept going for the frauds despite the tremendous danger they pose to him personally. Nothing says real scientist like being willing to fight for what is right. Besides, I admire the breadth of his area of interest. Also: His biology papers are beautiful.

RW: What made you decide to develop this type of software?

DA: We were outraged that scientists would do these types of manipulations. They are highly damaging to science and to the public perception of scientists. The painstaking work manually done by Paul and others to fix the problem would never scale.  We thought there must be a better way!

KK: We realized it could be done. And that it would be fun. Also, the frauds harm everyone. I don’t like frauds.

Daniel Acuna

RW: How does your tool work?

DA: The method to estimate the rate of problematic figures follows several steps. First, we split the figures into sets belonging to either the same first or last author. Within each set, we detect which portions of the image are “interesting” (high entropy) and compute fingerprints for them. To create the initial matches, then we search for similar fingerprints across all images. Unfortunately, this results in many false positives because scientists reuse arrows, labels, and other graphical elements. Therefore, we pass these duplicates through a filter that detects which of them is biologically meaningful. This final set of matches are reviewed in context by the reviewers who added comments and tags. These tags allowed us to estimate the rate of potentially problematic articles and figures.

RW: We’ve recently noted a new tool for checking for signs of image manipulation, especially duplication, developed by researchers supported (in part) by Elsevier. How is yours different?

KK: Ours looks for images that are reused (think “the same image shows cancerous cells in paper A and healthy cells in paper B”) while theirs looks for image manipulation. Both are frequent ways of image fraud. I love that they are doing this.

DA: From what I understand, other tools detect reuses images if you know which image has a problem. Our tool detects which images should be analyzed first and then analyzes them. At the end, we love what the team at Harvard is doing and these efforts are complementary.  

We think this is the tip of the iceberg in that there are other types of integrity problems that harder to detect such as data manipulation, misleading figures, text paraphrasing, etc. If anything, we need more people working on this area.

RW: What is the next step? Have you started working with journals or research integrity offices to develop, test and/or license the tool?

KK: We are talking with many of them. I would like the tool to be used on every paper published in the near future.

DA: We are talking to many types of entities, from individuals who want “objective” answers for suspicious cases at hand to offices of research integrity who want to test and provide feedback. We are exploring the best way to make these technologies available.

RW: You explain in the Nature story why you don’t want to make the tool public. Are you concerned that, if there are free tools available (such as the aforeone mentioned), they could take over the market?

DA: I would not like public shaming if a junior scientist is involved. I think the pressures to publish and get a job in academia could push someone to do something like this. So maybe we need to think about some of the root causes behind image fraud. Also, again I think this is the tip of the iceberg in terms of research integrity problems. Having said that, we would still contact the ORIs and journals of the cases we detected, and make the tool available on a case by case basis.

KK: I am really just worried about the pitchfork mob. We can not usually positively distinguish stupidity (“what was in img0078325.jpg again?) from fraud (“I will totally not run this unreasonable control, let me pretend that old image was the new control”). Worse, our current dataset is only from the open journals which probably have lower fraud rate.

Like Retraction Watch? You can make a tax-deductible contribution to support our growth, follow us on Twitter, like us on Facebook, add us to your RSS reader, sign up for an email every time there’s a new post (look for the “follow” button at the lower right part of your screen), or subscribe to our daily digest. If you find a retraction that’s not in our database, you can let us know here. For comments or feedback, email us at team@retractionwatch.com.

4 thoughts on “What if we could scan for image duplication the way we check for plagiarism?”

  1. “Nowadays I mainly just post on PubPeer (in some cases on behalf of other individuals who require anonymity)”

    Why can’t they just post them themselves anonymously on PubPeer?

    1. Because people are afraid that their anonymity will not be respected. And if my experiences in the past few years are any indication, they are correct that sometimes their names will be leaked. So putting in a proxy is a good way to ensure staying anonymous.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.