Can linguistic patterns identify data cheats?

Cunning science fraudsters may not give many tells in their data, but the text of their papers may be a tipoff to bad behavior.

That’s according to a new paper in the Journal of Language and Social Psychology by a pair of linguists at Stanford University who say that the writing style of data cheats is distinct from that of honest authors. Indeed, the text of science papers known to contain fudged data tends to be more opaque, less readable and more crammed with jargon than untainted articles.

The authors, David Markowitz and Jeffrey Hancock, also found that papers with faked data appear to be larded up with references – possibly in an attempt to make the work more cumbersome for readers to wade through, or to tart up the manuscript to make it look more impressive and substantial. As Markowitz told us:

References can serve as credibility or legitimacy markers in science (and we note this in the paper). But pragmatically, references make the reader evaluate the veracity or genuineness of the author’s claim. A reader has to think about why the reference is there, what that external source is claiming, and how it fits within the author’s argument. Therefore, while amplified rates of references can attempt to enhance the paper’s credibility, references also increase the cost of evaluating the paper on the reader’s end (which has been noted in other obfuscation pieces).

Readers of Retraction Watch may recall a 2014 study by Markowitz and Hancock that served as a linguistic post mortem of the work of Diederik Stapel. Here’s more from that abstract:

The analysis revealed that Stapel’s fraudulent papers contained linguistic changes in science-related discourse dimensions, including more terms pertaining to methods, investigation, and certainty than his genuine papers. His writing style also matched patterns in other deceptive language, including fewer adjectives in fraudulent publications relative to genuine publications…This research supports recent findings that language cues vary systematically with deception, and that deception can be revealed in fraudulent scientific discourse.

The latest paper looked at 253 articles with fabricated data, 253 unretracted control articles, and another 62 articles pulled for reasons other than fabrication, such as ethics violations. (It contains only one of Stapel’s; most, Markowitz told us, were “one-off deceptions,” to enable the researchers to analyze the language of a wide range of cheaters.)

Markowitz and Hancock calculated an “obfuscation index” from the language of the papers, which factored in the use of causal terms, abstractions and jargon. On the plus side of the ledger were general readability and something called “positive emotional terms” – such as “support, worthwhile and inspired.”

Their conclusion:

Scientists reporting fraudulent data wrote their reports with a significantly more obfuscated writing style than unretracted papers and papers retracted for reasons other than fraud (e.g., ethics violations, authorship issues). Furthermore, we found that linguistic obfuscation was correlated with the number of references per paper, suggesting that fraudulent scientists were using obfuscation to make claims in their papers more difficult and costly to assess.

As with their Stapel work, the authors say obfuscation isn’t ready to be used as a means of detecting misconduct:

Although this represents a statistically significant improvement over chance, it is clear that our limited model is not feasible for detecting fraudulent science with an especially problematic false-positive rate (46%). To improve the classification accuracy, more computationally sophisticated methods to analyze language patterns (e.g., machine learning, natural language processing) will be required. These steps, in addition to widening the feature set beyond the theoretically derived obfuscation dimensions, should improve deception detection accuracy.

Like Retraction Watch? Consider making a tax-deductible contribution to support our growth. You can also follow us on Twitter, like us on Facebook, add us to your RSS reader, and sign up on our homepage for an email every time there’s a new post. Click here to review our Comments Policy.

5 thoughts on “Can linguistic patterns identify data cheats?”

Ed Goodwin says:

November 11, 2015 at 9:45 am

One of the best obfuscating devices is to float the most likely flaw (objection) in their research and systematically shoot holes in the objection and thereby create the impression of self critique. For the unwitting reader this reduces the tension and allows the fraudster to say “we considered the objection in detail and ruled it out as we disclosed in our paper”. Case closed.

Reply
David Sabaj Stahl says:

November 11, 2015 at 11:06 am

The study is really very interesting. Liars and cheats seem to give off many visual and spoken verbal cues- so it makes sense they’d do the same in their writings. I’m curious, though, what if any application these techniques would have once they are refined and broadly applied.

Policing of the literature is very important- but should it come at the cost of “false positives”- even if those are reduced to some nominal percentage? I thought of this, because, I’ve had two lengthy publications recently. Although they were each on the same topic, one had relatively few references, while the other had a great many. The reason for the disparity is that one was very concrete in its approaches, while the other was highly abstract.

I’m betting using their criteria, one of these publications would have been flagged. And then, if used as a policing instrument, I would be in a position to defend against an accusation… or at least have some publisher raise eyebrows and scrutinize my work from a negative standpoint.

So even if the rate of false positives is reduced, do we want to subject honest and ethical scientists to an adversarial process without merit? Perhaps I do not understand the approaches they pursue, but, I am scratching my head and wondering if the potential unfair grief associated with intensive policing of the literature makes it worthwhile. I would say it doesn’t- but I am open to being convinced otherwise.

Reply
Mary Kuhner says:

November 11, 2015 at 4:07 pm

The situation seems very analogous to the detection of cheating (use of chess computers) in online chess. Major online chess sites have heuristic algorithms to detect such cheating. They encounter several problems. False positives are inevitable, and players banned due to false positives are infuriated; even the fear of being banned discourages some players. Furthermore, if cheaters are aware of the heuristics they can defend against them, so the heuristics are normally kept secret, but this means that players accused of cheating cannot be confronted with the evidence against them. Finally, the data set needed to detect cheating is fairly large, so the cheater has normally won many games before being banned–there is no straightforward way to compensate their opponents for the losses or to avoid damage to the structure of tournaments.

On the other hand, cheating is sufficiently common that as far as I know no one successfully runs an online chess site without cheater detection.

Part of the problem is that the highest penalty an online chess site can impose is banning. Perhaps we would see fewer cheaters in science if they were prosecuted for fraud? But it would be challenging to get international cooperation on this.

Reply
1. David Sabaj Stahl says:
  
  November 11, 2015 at 4:30 pm
  
  Your comments reminded me of the kindly advice I received from a professor when I was an undergraduate. Myself and several other students approached the professor to complain about a chronic cheater in our class. This person cheated all of the time, to the point (believe it or not) of standing on his chair so he could read the exam answers of other students at a distance. He was frequently excoriated by faculty and students alike- and actually went on to graduate with a BS degree.
  
  The very wise professor counseled us, and asked to reflect upon the significance of cheating on an exam. Then, he said, how might that compare to cheating on a spouse, or employer, or trusted friend? His point was that all of the grief associated with potentially disciplining this person over an undergraduate exam paled in comparison to what he would confront later in life. He said, “ignore him, he’s his own worst enemy, and one day he will pay the price, in spades.”
  
  So I think your example of cheating in online chess games is palpable. I suppose if wagers are involved, then someone would be defrauded. But I think so much of this comes down to a snake oil salesman’s mentality. Meaning- if you buy in, know the risks.
  
  Reply
  1. Tony Mitchell says:
    
    November 11, 2015 at 4:53 pm
    
    I gave a lot of take-home exams and had students who would rely on others for their answers. I I advised my students to give these students the wrong answers.
    
    Later, I was able to create different forms of the exams so that students could work together but had different problems. Some of my students never caught on and I knew they were cheating because their answers didn’t match the problems.
    
    Reply

Share this:

Related

5 thoughts on “Can linguistic patterns identify data cheats?”

Leave a ReplyCancel reply