Semi-automated fact-checking for scientific papers? Here’s one method.

Jennifer Byrne

Wouldn’t it be terrific if manuscripts and published papers could be checked automatically for errors? That was the premise behind an algorithmic approach we wrote about last week, and today we bring you a Q&A with Jennifer Byrne, the last author of a new paper in PLOS ONE that describes another approach, this one designed to find incorrect nucleotide sequence reagents. Byrne, a scientist at the University of Sydney, has worked with the first author of the paper, Cyril Labbé, and has become a literature watchdog. Their efforts have already led to retractions. She answered several questions about the new paper.

Retraction Watch (RW): Seek & Blastn allows for “semi-automated fact-checking of nucleotide sequence reagents.” Can you explain what these reagents are used for, and what Seek & Blastn does?

Jennifer Byrne (JB): Nucleotide sequence reagents are used in genetics and cell biology, often to measure gene activity or inhibit gene function. These reagents work by binding to either DNA or RNA, and they are made of DNA or RNA themselves. Nucleotide sequences are composed of 4 different nucleotides, which are written as A’s, G’s, C’s and T’s. These reagents are therefore defined by their exact sequences, and they can be treated as “words” made up of 4 letters. However, because of redundancies in the genetic code and different ways that sequences can be “read” within the cell, nucleotide sequence reagents cannot be easily read by eye. Scientists therefore rely on search algorithms and databases to identify these reagents. Seek & Blastn does this automatically, by scanning publications and manuscripts for nucleotide sequences, which are then extracted and submitted to blastn analysis, which is a very widely used algorithm that finds homologous sequences in databases. The surrounding text is also analysed to determine whether the sequence is intended to either target a gene, or to serve as a non-targeting control. Seek & Blastn puts this information together and predicts whether a given nucleotide sequence is either a targeting or a non-targeting reagent, and whether this identity matches that described by the authors.

RW: Where in the publication process could Seek & Blastn be used? To screen submitted manuscripts? During peer review? Post-publication peer review?

JB: We think that Seek & Blastn could be used before and after publication. However, most peer reviewers won’t want to carry out an additional screening step themselves- they would be more likely to expect the journal to do this for them, in the way that manuscript text is screened for plagiarism before being sent for peer review. To screen submitted manuscripts at scale, Seek & Blastn would need to be more reliable, with a very low false positive rate. This is something that we’re working on, with support from a US Office of Research Integrity grant. Seek & Blastn also has difficulty in detecting targeting sequences that target a different gene from that described, although these errors can be detected by manually checking Seek & Blastn outputs. We are currently applying Seek & Blastn to check reagent identities post-publication. We hope that other researchers will try the tool and provide feedback on how it works, and its ease of use.

RW: You’ve previously found that “incorrectly identified nucleotide sequence reagents are characteristic of highly similar human gene knockdown studies, some of which have been retracted from the literature on account of possible research fraud.” Can you walk us through how these would suggest fraud?

JB: Because nucleotide sequences usually don’t make visual sense, they can accumulate the equivalent of spelling errors which can pass unnoticed. These kinds of errors might occasionally result when researchers type in sequences by hand. However, some of the errors that we have detected were more unexpected. We have found many nucleotide sequence reagents that were completely misidentified, and sometimes their verified identities corresponded to genes that weren’t discussed in these papers. Some of these incorrect sequences can also be found in many papers, along with other features such as similarly structured figures, and unusual levels of textual similarity. Papers with these features can also describe misidentified or contaminated human cell lines, and contain image duplications. We therefore hypothesise that incorrect sequences in these papers represent inadvertent errors that might result from manuscripts being assembled relatively quickly, possibly with the help of third parties such as paper mills. We have written about this recently in the journal Biomarker Insights.

RW: You write that “incorrect nucleotide sequence reagents represent an under-recognized source of error within the biomedical literature.” What is your sense of how common these errors are?

JB: We don’t yet know, but we expect that nucleotide sequences with spelling mistakes might present at similar rates as incorrect URL’s in publications. We expect that there will be a low baseline of these errors within the biomedical literature. Even so, given that nucleotide sequence reagents have been described in hundreds of thousands of papers, this could still equate to many papers describing nucleotide sequence reagents with hidden typographic errors. We expect that incorrectly identified sequences might characterise particular publication types, but this also needs further study. Either way, it is always worth checking the identity of a nucleotide sequence reagent before ordering it for your own experiments- a few minutes of checking can save a lot of time in the long run….

Like Retraction Watch? You can make a tax-deductible contribution to support our growth, follow us on Twitter, like us on Facebook, add us to your RSS reader, sign up for an email every time there’s a new post (look for the “follow” button at the lower right part of your screen), or subscribe to our daily digest. If you find a retraction that’s not in our database, you can let us know here. For comments or feedback, email us at team@retractionwatch.com.

One thought on “Semi-automated fact-checking for scientific papers? Here’s one method.”

  1. That’s interesting. I freelance edit scientific papers in medicine for authors for whom English is a second language. I tend to read the paper and always pause when the expectedly bumpy English suddenly gets very idiomatic; when I see words like “shortcoming” and “hallmark,” I start copying and pasting into Google Scholar, and at least half the time, I find they have plagiarized. I guess Seek & Blastn is a very automated form of what I do!

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.