Large language models should not be used to weed out retracted literature, a study of 21 chatbots concludes. Not only were the chatbots unreliable at correctly identifying retracted papers, they spit out different results when given the same prompts.
The “very simple study,” as lead author Konradin Metze called it, used LLM chatbots like ChatGPT, Copilot, Gemini, and others to see whether they would successfully identify retracted articles in a list of references.
Metze and colleagues compiled a list of 132 publications. The list comprised the 50 most cited, retracted papers by Joachim Boldt, a prolific German researcher who also sits at the top of the Retraction Watch Leaderboard. Another 50 were Boldt’s most cited non-retracted papers. The rest were works by other researchers with the last name “Boldt” and first initial of “J.” The study authors prompted each chatbot to indicate which of the listed references had been retracted.
Continue reading AI unreliable in identifying retracted research papers, says study