A team of researchers in England has retracted a 2014 paper after a graduate student affiliated with the group found a fatal error while trying to replicate parts of the work — and which might affect similar studies by other scientists, as well.
The article, “Perceptual load affects spatial tuning of neuronal populations in human early visual cortex,” was written by Benjamin de Haas, then of the Institute of Cognitive Neuroscience at the University College London, and his colleagues at UCL.
According to the retraction notice:
In our paper, we reported a significant increase of parafoveal population receptive field (pRF) sizes and eccentricity in visual areas V1–3 under high versus low perceptual load at fixation. We have recently been notified of a potential flaw in our analysis pipeline for this paper. As described in the original manuscript, the analysis defined eccentricity bands according to one of the two conditions that were then compared (the low-load condition). This is currently an approach that is widespread in the field, but we now realize has an unappreciated potential bias. The circularity can bias the results due to a combination of regression to the mean and heteroskedastic error variance (eccentricity errors are larger in the periphery than in the central visual field). Also, as reported, the original analysis expressed changes in pRF sizes proportionally. This can curtail the negative end of the difference distribution because growth, but not shrinkage, can be greater than 100%. Simulations and re-analyses of our original data strongly suggest that these steps did indeed inflate the effects we reported. Specifically, we conducted a re-analysis using absolute rather than proportional changes in pRF sizes, and binning data according to independent probabilistic maps. Reassuringly, this analysis reproduced trends for increased pRF sizes in V1–3 under high versus low load. However, these no longer survived family-wise error correction (FWE). Likewise, a trend for the reported increase in pRF eccentricity was now only observed in V1 and failed to reach FWE significance.
Therefore, we no longer consider the reported results reliable and wish to correct the scientific record by voluntarily retracting our paper. We apologize to the scientific community for any inconvenience caused and caution fellow researchers against the use of non-independent binning practices, which appear widespread in the field. Finally, we would like to thank our colleague Susanne Stoll, who first pointed out the problem to us and plans to publish in due course a more general exposition on the difficulties of this approach.
de Haas, who is now at Justus Liebig University, in Giessen, Germany, told us that Stoll — who is finishing her PhD under the supervision of study co-author D. Samuel Schwarzkopf — contacted him in June 2020 with the bad news:
She tried to do something similar to our original experiment, but noticed that using a similar pipeline she would get opposite results depending on which condition she used for binning (which should have no effect). Susanne contacted me in June and told me about her odd results and that she thought the problem may apply to my original paper as well.
At first I was skeptical, because she mentioned regression to the mean and the pattern of our results were the opposite of that – I just couldn’t see how what we found could be the result of a selection problem. But I was worried enough to dig out the old hard drive and have a look. I wanted to know! Within a week I had reproduced the original analysis and the problem Susanne reported for her data.
The pattern of results strongly hinged on using one, but not the other condition for binning. Digging deeper into individual data sets made this puzzle more tangible. The pRF data were strongly heteroskedastic – the spread of data scaled with its amplitude – and binning based on one condition effectively curtailed that noise component for that condition but not the other.
de Haas said he contacted his co-authors to tell them what he’d learned:
As you may imagine this wasn’t the greatest of feelings, but fortunately everybody was understanding and supportive. My colleagues encouraged me to do an additional, unbiased analysis for which I used publicly available probabilistic retinotopic maps as a sort of surrogate condition for independent binning. This took a while to implement and gave non-significant trends in the direction of the original effects, but it was clear that the bulk of evidence was gone. So we pulled the trigger and contacted the journal.
It was also important to me to explain my mistake, because it’s so easy to miss and a scan of the literature suggests it may happen more often. Susanne already spent a lot of time and effort digging into the general issue before contacting me and planned on writing a technical paper anyway. I’m now a co-author on this and our preprint will go online in the next couple of days. I’m not proud of my original mistake and as worried about my career as most junior researchers without tenure are. But I’m very happy to have done the right thing and to have colleagues supportive of that.
In a blog post about the affair, Schwarzkopf wrote that the problem Stoll discovered may well have permeated a large swath of related research, as well as other fields:
It is so easy to make this mistake that you can find it all over the pRF literature. Clearly, neither authors nor reviewers have given it much thought. It is definitely not confined to studies of visual attention, although this is how we stumbled across it. It could be a comparison between different analysis methods or stimulus protocols. It could be studies measuring the plasticity of retinotopic maps after visual field loss. Ironically, it could even be studies that investigate the potential artifacts when mapping such plasticity incorrectly. It is not restricted to the kinds of plots I showed here but should affect any form of binning, including the binning into eccentricity bins that is most common in the literature. We suspect the problem is also pervasive in many other fields or in studies using other techniques. Only a few years ago a similar issue was described by David Shanks in the context of studying unconscious processing. It is also related to warnings you may occasionally hear about using median splits – really just a simpler version of the same approach.
I cannot tell you if the findings from other studies that made this error are spurious. To know that we would need access to the data and reanalyse these studies. Many of them were published before data and code sharing was relatively common2. Moreover, you really need to have a validation dataset, like the replication data in my example figures here. The diversity of analysis pipelines and experimental designs makes this very complex – no two of these studies are alike. The error distributions may also vary between different studies, so ideally we need replication datasets for each study.
de Haas said the experience has provided a couple of “very specific and technical lessons,” which Stoll describes in a forthcoming paper, on which he’s also a co-author:
If you can, base selection and comparison on independent data.
Test your analysis pipeline on simulated data.
de Haas elaborated:
mixing up data selection and comparison can go wrong in very unintuitive ways. In our original dataset the mechanisms usually leading to regression to the mean essentially flipped to ‘egression’ from the mean, most likely because of heteroskedasticity and the natural limitation of values to the positive range. It’s an example of how things can go awry in ways you may not expect and just won’t catch by merely looking at data plots.
The irony is that in the original paper we were quite paranoid and did a whole bunch of control analyses and experiments (for instance, we collected extra data to fit condition-specific hemodynamic response functions for each participant). But as it turned out we just missed an important one. That’s why I now think the safest bet is to apply your whole pipeline to simulated data and check whether it behaves the way it should.
de Haas, who started his own lab this year, said that:
Whenever my PhDs and I think of a pipeline for a tricky analysis question I encourage them to first try it on simulated data (both null and effect). We started doing this before I learned about my mistake in this paper, but this very much encourages me to carry on with this practice.
Like Retraction Watch? You can make a tax-deductible contribution to support our work, follow us on Twitter, like us on Facebook, add us to your RSS reader, or subscribe to our daily digest. If you find a retraction that’s not in our database, you can let us know here. For comments or feedback, email us at [email protected].
Science self-correction at its best.
Even though I know very little about statistical analysis, I had a feeling that if it was an error that affects lots of papers, it’d be an unsuitable method of statistical analysis.
Mistakes like this are unavoidable, but the way people behaved in this case when they found the mistake can be held up as a model of good science and professionalism.
(My former lab head once distributed a piece of software that, unknown to them, implemented the “method of minimum likelihood.” I comfort myself that I haven’t been quite *that* wrong yet, but gods know I’ve been wrong quite a few times.)
I’m not in any science field at all but I work as a journalist, and I have some sense for how bad corrections/retractions feel. They always hit me hard!
I just wanted to say how much I appreciate and admire de Haas and his co-authors for how they well they handled this. Everyone involved is going to go on and do great things.