Sorry, fans of papers by Maggie Simpson and I. P. Freely, your days of chortling may be coming to an end.
Springer, responding to a case last year in which it and IEEE had to eventually retract more than 120 papers created by SCIgen, is making software that detects such manuscripts freely available.
From a Springer press release:
After intensive collaboration with Dr. Cyril Labbé from Université Joseph Fourier in Grenoble, France, Springer announces the release of SciDetect, a new software program that automatically checks for fake scientific papers. The open source software discovers text that has been generated with the SCIgen computer program and other fake-paper generators like Mathgen and Physgen. Springer uses the software in its production workflow to provide additional, fail-safe checking. Springer and the University are releasing the software under the GNU General Public License, Version 3.0 (GPLv3) so others in the scientific and publishing communities can benefit.
Springer had been working with Labbé since last year, and funded a PhD in his lab to work on the project.
Our dogs are disappointed that they will not be able to contribute to the scientific literature as easily, but SciDetect seems like a good development presuming publishers actually use it.
Really?
This *obviously* cannot work. Any competent computer science PhD student would
have given that answer is minutes. Basically the winner of this arms race is the last
actor.
I fear the solution is going to involve the editors and/or reviewers actually reading
the articles. The horror.
Agreed. All this does is pose a rather mild challenge problem to computer science grad students working in natural language generation. Indeed, a much easier problem than the one spammers solve every day, because Springer released their code.
I haven’t looked at the code itself, but Labbe’s approach seems to be “intertextual distance,” which has been geared to Corneille and Molière or something. The naive assumption would seem to be that it’s supposed to have a natural extension to identifying an ad hoc context-free grammar as author.
The “SciDetect” press release, however, makes it clear that the idea is premised on a reference corpus (conveniently PowerPDFed for the instant case here), and “intertextual distance” doesn’t seem to scale well with text length. I wonder whether it bothers to “lemmatize” the text, and I strongly doubt that whatever it actually does do generally runs in polynomial time.
Any additional efforts to make such software available are, one hopes, a boon to publishers, but there is not too much new about such software. Labbe had already made an online version available, which some publishers such as Hindawi had been using; it’s good that he has now got the code open so that lots of publishers don’t keep hitting his online website. ArXiv has been using its own version of this to screen papers for some time. (http://www.nature.com/nature/journal/v508/n7494/full/508044a.html).
When can the scientific community expect a free (self)plagiarism detection software that is comparable, or more powerful than, the current commercial versions?