We have a tension about resolving inaccuracies in scientific documents when they’re past a certain age.
Specifically, what should we do with old papers that are shown to be not just wrong, which is a fate that will befall most of them, but seriously misleading, fatally flawed, or overwhelmingly likely to be fabricated, i.e. when they reach the (very high) threshold we set for retraction?
To my way of thinking, there are three components of this:
(1) the continuing use of the documents themselves as citable objects in contemporary research – some research stays current and relevant, other research is consigned to obscurity, or is so completely superseded that it has no bearing on contemporary research whatsoever.
(2) the profile of the authors – some authors of such documents are alive, famous, and have theories with contemporary relevance. Others are dead, obscure, and have theories which have no continuation in any other papers. Like it or not, these authors are treated differently.
and, of course,
(3) the nature and extent of the errors – some are small but definite mistakes, some are blatant multi-paper fabrications.
After these determinations have been made, there are additional barriers to taking these worst case scenarios to retractions. Some of these barriers are practical (if a journal has not existed for 40 years, for instance, how do you contact the editor? Where could the data ever be located?) but the primary reason seems to be that it is regarded as ‘ahistorical’ to retract older work, because at some poorly-defined point, a transition occurs – a document stops being ‘living’ knowledge, and starts being ‘historical’ knowledge. The mistakes, even if they invalidate the paper utterly, become context rather than problems.
Imagine we discovered evidence that BF Skinner or Claude Bernard or Gregor Mendel or Isaac Newton had seriously plagiarised a document, made a crucial error of analysis, or so on – there would be absolutely no will for the ‘correction of the scientific record’, because in some meaningful sense, their work is not the scientific record any more. It is cited, but usually more as a historical context for later ideas.
And this brings us to Eysenck. Obviously he achieved some form of extreme eminence associated with historical scholars, but the papers in question here are sufficiently recent that they are modern citable objects, and contribute to modern research topics. While his PhD was in 1940, his final papers were published posthumously into the new millenium. New editions of his central works are more recent still.
Some are also, and rather obviously, extremely problematic. The initial analyses of the effects of smoking on lung cancer had their strongest results reporting a relative risk of about 2-6 times, and these were a public health triumph. But Eysenck himself gave the relative risk of personality factors as being six times greater than smoking.
About this and similarly unbelievable figures, David Marks did not mince words earlier this year:
There is absolutely no scientific evidence that any of these statements are true, and Eysenck is proved by his own words to be guilty of some of the most egregious and harmful falsehoods made by any psychologist ever.
Gilbert (1985), quoted in Pelosi (2019) also had an interesting point:
My reservations are based on the limitations imposed by the less-than-perfect reliability/validity of any psychological measure. The less-than-perfect reliability/validity of Grossarth-Maticek’s rationality/antiemotionality should lead to less-than-perfect predictions. Yet he reports perfect prediction of the incidence of lung cancer from the combination of number of cigarettes smoked per day with degree of antiemotionality. I expect that the test-retest reliability of the rationality/antiemotionality scale is in the range of .64 to .81 and that the reliability of one’s saying one smokes more than 20 cigarettes per day is not much higher, (.85?). Since the maximum validity of any measure is the square root of its reliability the question becomes how can √0.81 * √0.85 = 1.00?
This is, for curiosity’s sake, the precise point that Vul et al. (2009) noted in analyses of fMRI papers, which was instrumental in changing practice around how neuroimaging results were analysed and reported!
Also, the original papers were rather straightforwardly invalidated by van der Ploeg, who actually inspected some of the datasets involved and described what is overwhelmingly likely to represent manipulation of the relevant records.
There has never been any question that something was incredible dubious with this body of work, and the field is littered with similar critiques and failed replications stretching back more than 30 years. The conclusion from Pelosi (2019) was particularly brutal:
There is a complicated and multi-layered scandal surrounding Hans Eysenck’s work on fatal diseases. In my opinion, it is one of the worst scandals in the history of science, not least because the Heidelberg results have sat in the peer-reviewed literature for nearly three decades while dreadful and detailed allegations have remained uninvestigated. In the meantime, these widely cited studies have had direct and indirect influences on some people’s smoking and lifestyle choices. This means that for an unknown and unknowable number of individual men and women, this programme of research has been a contributory factor in premature illness and death. How can members of the public and their policymakers turn to science for help with difficult decisions when even this most extreme of scientific disputes cannot be resolved?
The problem is, and has always been: do we have the will to do anything about it?
It seems now, finally, that King’s College has conducted their own retrospective investigation of their most famous alumnus and concluded that an enormous body of Eysenck’s work conducted at the then-Institute of Psychology is problematic. It is also notable that this is less than half of the papers, books, and monographs that Marks (2019) recently identified as potentially problematic.
This is a highly unusual situation. If these papers were retracted, it would vault Eysenck – more than a decade after his death – onto the Retraction Watch leaderboard, tied for 22nd. If the full body of 61 documents was retracted, Eysenck would eclipse Diederik Stapel (58) as the most retracted psychologist in history, a scarcely believable legacy for someone who was at one time the most cited psychologist on the planet.
This entire mess sat, in plain sight and painstakingly spelled out in dozens of published articles, entirely uninvestigated for three decades. So, why now? What changed? Some presumptive combination of institutional inertia, hero worship, and the unwillingness to impose sanction has finally been overcome.
Like Retraction Watch? You can make a tax-deductible contribution to support our work, follow us on Twitter, like us on Facebook, add us to your RSS reader, sign up for an email every time there’s a new post (look for the “follow” button at the lower right part of your screen), or subscribe to our daily digest. If you find a retraction that’s not in our database, you can let us know here. For comments or feedback, email us at [email protected].
This is reminiscent of the case of Cyril Burt, an influential psychologist and educationalist of the mid-20th century. Burt was interested in the heretability of IQ, and used monozygotic (identical) twins reared apart. He found, at the end of his career, 53 pairs of such twins, and determined that the correlation of IQ is .77, which indicates that the hereditable component of variance is > .5. After his death in 1971, investigators were unable to find 2 collaborators, and concluded that his research was questionable.
The correlation between monozygotic twins reared apart (in mutually independent environments) is a direct estimate of the variance that genetic factors explain, e.g. an MZA correlation of .77 means that the genetic proportion of the phenotypic variance is 77%. You don’t square the MZA correlation because, conceptually, the correlation itself is an estimate of the square of the correlation between genotypes and phenotypes.
Burt’s case is quite different from Eysenck’s in that the twin correlation Burt reported is quite similar to those reported by others. The average correlation in five MZA studies by other research teams is .75, not significantly different from Burt’s. Burt’s data, even if fabricated, are therefore not implausible. In contrast, Eysenck and Grossarth-Maticek’s results are highly implausible in light of all other research.
If you were even moderately familiar with adoption placement policy in England in those days, you should first ask yourself how Burt managed to find his monozygotic reared-apart twins in the first place.
His “research” can be adequately summarised in the phrase “It is a crock of shite, and it stinketh.”
One has to seriously worry about those who still insist that this was real.
” Burt’s data, even if fabricated, are therefore not implausible.”
If the data are fabricated the results are not believable, except to the religious.
Interestingly, Eysenck was a defender of Burt’s in the period of discovery of the questionable Burt studies.
“Imagine we discovered evidence that BF Skinner or Claude Bernard or Gregor Mendel or Isaac Newton had seriously plagiarised a document, made a crucial error of analysis, or so on – there would be absolutely no will for the ‘correction of the scientific record’, because in some meaningful sense, their work is not the scientific record any more. ”
IMO, this is a dubious contention, since it assumes that there is an equivalent of the legal term “Statute of limitations” in the scientific record. Besides, it it odd to see BF Skinner in the company of the other three gentlemen. It is true, though, that those works that had laid the foundation of a large scientific field are especially difficult to correct even when they (not their authors) are shown to be faulty from the beginning in the light of new discoveries, because in the meantime from knowledge they may have evolved into dogma.
I am not aware of any common term that means the opposite of “scientific discovery”. And, for the record, in the post-genomic era the pioneering claim of Claude Bernard that animals can be used as models of human disease is seriously in question, and Gregor Mendel managed to completely overlook what is now known as epigenetics.
There “wWould be absolutely no will to correct the scientific record” if Mendel plagiarized or made an error of analysis?
Don’t tell RA Fisher:
Fisher, R. A. (1936). Has Mendel’s work been rediscovered?. Annals of science, 1(2), 115-137.
Still being considered:
Franklin, A., Edwards, A. W., Fairbanks, D. J., & Hartl, D. L. (Eds.). (2008). Ending the mendel-fisher controversy. University of Pittsburgh Press.
Fisher sounds very compelling to me in his re-analysis of Mendel’s experiments, and he implies that data were indeed “filtered” and new experiments set to enhance the presentation of the main discovery, as, in his opinion, any good lecturer would have done. This does not invalidate the basis of classical genetics, it just shows that the reality is more complex than it describes. In fact, no user of classic genetic methods ever stumbled on the world of long non-coding RNA genes, there were discovered by other means.
Fisher also implies, however, that Mendel’s work was not re-discovered in the true sense of the word, as only some of his progressive ideas were understood by his immediate followers.
How cited are the problematic papers? I never heard of them until now.
100-200 cites for the 6 or 7 most cited, so pretty cited.
By far the most cited meta analysis of the association of stress with cancer depends crucially on the fraudulent data of Eysenck to get the OR away from 1.0 and only barely so. I doubt the meta analysis would have been otherwise accepted in such a prestigious journal with essentially null effects. We attempted to get a retraction of the highly cited meta analysis. The authors relied that their protocol did not provide for exclusion of data because it was fraudulent.
See https://www.nature.com/articles/ncponc1134-c1
According to various sources, the mean height for men in the US is 176 cm while the mean for women is 162 cm. The standard deviation is about 10 cm. I ran the following R code to generate a sample of 100,000 men and women whose heights match those parameters:
sex <- rbinom(100000,1,prob=0.5) # women=0, men=1
height <- rnorm(100000,162+sex*14,10)
sex_and_height <- data.frame(sex,height)
Then I ran the following command to get mean heights by sex:
aggregate(sex_and_height,by=list(sex),FUN=mean)[,2:3]
The result is that women average about 162 cm while men average about 176 cm. Now, if I understand you correctly, this means that it is no longer believable that the mean heights of men and women are 176 cm and 162 cm in the US. After all, it’s clear that my data are fabricated. Because of my nefarious data forgery, only the religious can still give any credence to these numbers!
Note that this is a response to “Fernando Pessoa” above.