Language of a liar named Stapel: Can word choice be used to identify scientific fraud?
A pair of Cornell researchers have analyzed the works of fraudster Diederik Stapel and found linguistic tics that stand out in his fabricated articles.
According to the abstract for the article, “Linguistic Traces of a Scientific Fraud: The Case of Diederik Stapel,” which appeared in PLoS ONE:
When scientists report false data, does their writing style reflect their deception? In this study, we investigated the linguistic patterns of fraudulent (N = 24; 170,008 words) and genuine publications (N = 25; 189,705 words) first-authored by social psychologist Diederik Stapel. The analysis revealed that Stapel’s fraudulent papers contained linguistic changes in science-related discourse dimensions, including more terms pertaining to methods, investigation, and certainty than his genuine papers. His writing style also matched patterns in other deceptive language, including fewer adjectives in fraudulent publications relative to genuine publications. Using differences in language dimensions we were able to classify Stapel’s publications with above chance accuracy. Beyond these discourse dimensions, Stapel included fewer co-authors when reporting fake data than genuine data, although other evidentiary claims (e.g., number of references and experiments) did not differ across the two article types. This research supports recent findings that language cues vary systematically with deception, and that deception can be revealed in fraudulent scientific discourse.
In more detail:
Liars have difficulty approximating the appropriate frequency of linguistic dimensions for a given genre, such as the rate of spatial details in fake hotel reviews , the frequency of positive self-descriptions in deceptive online dating profiles , or the proportion of extreme positive emotions in false statements from corporate CEOs . Here we investigated the frequency distributions for linguistic dimensions related to the scientific genre across the fake and genuine reports, including words related to causality (e.g., determine, impact), scientific methods (e.g., pattern, procedure), investigations (e.g., feedback, assess), and terms related to scientific reasoning (e.g., interpret, infer). We also considered language features used in describing scientific phenomena, such as quantities (e.g., multiple, enough), terms expressing the degree of relative differences (e.g., amplifiers and diminishers) and words related to certainty (e.g., explicit, certain, definite).
We were also interested in whether the fake reports contained patterns associated with deception in other contexts.
To probe Stapel’s studies, Markowitz and Hancock:
applied a corpus analytic method using Wmatrix , , an approach that is commonly used for corpus comparisons (e.g., , ). Wmatrix is a tool that provides standard corpus linguistics analytics, including word frequency lists and analyses of major grammatical categories and semantic domains. Wmatrix tags parts of speech (e.g., adjectives, nouns) in relation to other words within the context of a sentence (e.g., the word “store” can take the noun form as a retail establishment or a verb, as the act of supplying an object for future use).
You can see a table of Stapel’s word choices here.
But the Cornell researchers expression caution about the obvious leap here — using linguistic tools to probe manuscripts for evidence of fraud before they’re published:
… [I]t is tempting to consider linguistic analysis as a forensic tool for identifying fraudulent science. This does not seem feasible, at least for now, for several reasons. First, nearly thirty percent of Stapel’s publications would be misclassified, with 28% of the articles incorrectly classified as fraudulent while 29% of the fraudulent articles would be missed. Second, this analysis is based only on Stapel’s research program and it is unclear how models based on his discourse style would generalize to other authors or to other disciplines.