Large language models should not be used to weed out retracted literature, a study of 21 chatbots concludes. Not only were the chatbots unreliable at correctly identifying retracted papers, they spit out different results when given the same prompts.
The “very simple study,” as lead author Konradin Metze called it, used LLM chatbots like ChatGPT, Copilot, Gemini, and others to see whether they would successfully identify retracted articles in a list of references.
Metze and colleagues compiled a list of 132 publications. The list comprised the 50 most cited, retracted papers by Joachim Boldt, a prolific German researcher who also sits at the top of the Retraction Watch Leaderboard. Another 50 were Boldt’s most cited non-retracted papers. The rest were works by other researchers with the last name “Boldt” and first initial of “J.” The study authors prompted each chatbot to indicate which of the listed references had been retracted.
On average, the 21 chatbots correctly identified fewer than half of the retracted papers, the authors reported October 10 in the Journal of Clinical Anesthesia. The LLMs also produced a large proportion of false positives, incorrectly classifying almost 18 percent of Boldt’s intact papers, and about 4.5 percent of other authors’ valid work as retracted.
After a three-month gap, the researchers queried seven of the 21 original chatbots with a different, shorter, prompt, as well as repeating the original prompt. “We know that the wording of the prompt may influence the answer,” said Metze, a researcher at the State University of Campinas, in Brazil. “If you ask a question to a chatbot, and you repeat the same prompt tomorrow, you may get different answers. This is not scientific, and it’s a great, great problem.”
Comparing responses from April and July, the researchers did find differences, as expected. Whereas the chatbots in the first round classified papers as retracted or not retracted, in the second round they created a new, hybrid reply, flagging a particular paper as possibly retracted, saying, for example, a paper was “worth double-checking,” “is likely one of the retracted ones” or “high-risk papers to verify.” In practical terms, this hedging is useless, Metze said.
The results of Metze and colleagues’ experiment are hardly surprising, said Serge Horbach, assistant professor in the sociology of science at Radboud University in Nijmegen, the Netherlands, who has written about generative AI in research. “I read the article as a warning: People, please don’t use LLMs this way.”
But the paper replicates just what a conscientious author might do — enter a list of references for an LLM to check as a shortcut to the time-consuming process of checking them individually, said Mike Thelwall of the University of Sheffield, in England.
In August, Thelwall and colleagues asked ChatGPT to evaluate 217 articles that had been retracted, had expressions of concern, or had been flagged on PubPeer or another platform. They submitted each article to ChatGPT 30 times. None of the 6,510 reports generated mentioned the retractions or concerns, they reported in Learned Publishing.
ChatGPT “reported some retracted facts as true, and classified some retracted papers as high quality,” Thelwall said. “So it does not seem to be designed to be aware of, or careful about, retracted information at the time that we tested it.”
LLMs are also using material from retracted scientific papers to answer questions on their chatbot interfaces, according to a study in the Journal of Advanced Research in May 2025. “People are increasingly using ChatGPT or similar to summarize topics, and this shows that they risk being misled by the inclusion of retracted information,” Thelwall added.
Generative AI does have roles to play in the editorial process, Horbach said. But weeding out retracted papers is not yet among them, he said —and “it’s not going to lead to any better science.”
Like Retraction Watch? You can make a tax-deductible contribution to support our work, follow us on X or Bluesky, like us on Facebook, follow us on LinkedIn, add us to your RSS reader, or subscribe to our daily digest. If you find a retraction that’s not in our database, you can let us know here. For comments or feedback, email us at [email protected].
