Psychology researcher explains how retraction-causing errors led to change in her lab

jperssocpsychLast month, we brought you the story of two retractions by Yale’s Laurie Santos because the team discovered errors in the way the first author had coded the data. That first author, Neha Mahajan, took full responsibility for the coding problems, according to the retraction notices, and a university investigation cleared her of any “intentional, knowing, reckless, or grossly negligent action.”

But a few of our readers noted that the papers refer to a second coder on some of the experiments, and have questioned whether that’s compatible with Mahajan being solely responsible for the errors.

We asked Santos earlier this week to explain the apparent discrepancy, which she did along with a description of how her lab has made changes to prevent such errors in the future:

Both retracted papers followed what (until recently) was the typical procedure for coding looking time studies in my lab: we had one experimenter (in this case, Neha) code the duration of subjects’ looking for all the sessions and then used this first experimenter’s measurements in all the analyses we report in the paper. We then had a second experimenter code only a small subset of the sessions (JPSP paper: 10 of the sessions of Experiment 1, Developmental Science paper: 6 of the sessions of Experiment 1) to establish reliability with the first coder. Basically, we check to be sure that a second coder would get qualitatively similar results to the first coder on a random subset of the sessions by performing a correlation on the two coder’s measurements. In both of the now retracted studies, these reliability correlation scores were a bit lower than we had observed in previous published studies using the same techniques (r = 0.75 in JPSP, r = 0.85 in Developmental Science), but at the time we didn’t consider them low enough to raise any red flags (and neither did the reviewers) so we used the first coder’s measurements alone in the original analyses. Unfortunately, only double-coding a subset of the data wasn’t enough to spot the problem. We’ve now changed our lab’s coding procedures to prevent this in the future– all our studies will now have a second coder recode all sessions, so that every datapoint can be double-checked. If we had used this full dataset double-coding procedure before, we probably would have detected the problems in the retracted papers.

The second of the two retractions, which was not yet live when we published our first post, has been published in the Journal of Personality and Social Psychology. (The full version, available behind a paywall, is the same as we reported it would be, but the abstracted version is missing the last few lines.)

16 thoughts on “Psychology researcher explains how retraction-causing errors led to change in her lab”

  1. I am not sure that Santos has told us anything more than what we already knew (except that procedures will be different in future.)

    We already knew that a second coder recoded a subset of the trials. It remains to be explained how, if Coder 1 was the only person who was coding incorrectly, Coder 2 was in strong agreement with her.

    That the agreement was somewhat less strong than in previous studies doesn’t change the fact that it was strong – not once but twice (r=0.75 and then, separately, r=0.85).

    These correlations would be impressive in any branch of psychology.

    1. Actually, no, r = 0.75 isn’t impressive for a reliability measure. It means that you can predict around 50% of variance in one set of codes from the other. For a reliability measure (where you are measuring the same thing twice you are aiming for correlations above .9 and approaching 1) this level should be considered barely adequate or adequate.

      It is also easy to see how you could get high r for a subset where overall r is low. Some things will be events to code and some hard to code and agreement will be high for the easy sections. You could at random pick a session with a lot of easy to code behaviours/events. Also it is likely that the selection of the coded sections is not at random and the selection could be biased towards highly reliable sections.

      In addition reliability measures ought to include a confidence interval. With a small sample of codes the CI will be large and reviewers will not be so tempted to conclude that reliability is high from this sort of procedure.

      1. What is the probability of getting r=0.75 AND r=0.85 in independent studies, just by chance? This is a case of joint probabilities, which become quite low very quickly. There is an unexplained correlation between coder 1 and coder 2. Perhaps coder 2 did not even do what he/she was supposed to but just saw what coder 1 did and added some “noise” to the scores. Who knows? Laziness, drive to publish quickly, a bit of megalomania perhaps… But something seems seriously off here.

    2. “It remains to be explained how, if Coder 1 was the only person who was coding incorrectly, Coder 2 was in strong agreement with her.” Yes, that does remain to be explained. 0.85 is a decent reliability value in this field. Given that there was no effect whatsoever, due to errors by coder 1, why is there such a high reliability value with independent coder 2? This is rather disturbing, actually, and this story is far from over, I think.

    3. The problem is that if both coders do not code all of the protocols the r is an overestimate of the agreement. This simple correlation treats coder effects as if it can be systematically accounted for but it cannot. The appropriate measure of agreement would have been the intraclass correlation and variance due to coder should have been treated as real error.

  2. The Yale researchers originally thought the monkeys would look longer at photographs of outgroup members than those of ingroup members, but after they detected the erroneous coding found out that they looked equally long at both categories. Thinking about this, it would really make sense for any animal to look longer at outgroup members. The first look at conspecifics should serve the identification of friend or foe, and an outgroup member has always more potential of being hostile than a familiar individual. I wonder if the non-effect in these study has anything to do with the artificial nature of photographs. Do monkeys also look equally long at ingroup members and outgroup members when they meet them in person?

  3. What was the nature of the coding error? It seemed to produce a significant effect in the predicted direction multiple times. Revealing more about this error would help other labs avoid it in the future.

  4. Where is the lab director stepping to say that it is THEIR fault? The American Psychol Assoc (APA) code of ethics ( to which all practicing psychologists ought to be bound) clearly states professional psychologists assume responsibilities for the proper training of their assistants. Where does Santos acknowledge this? Her entire explanation is about singling out someone else for blame. Perhaps one person proximally caused this, but I’d like to know a lot more about the lab culture that was allowing this to happen. This does not seem like an isolated incident. I’m not trying to pick on Santos, I just think this should prompt some bigger discussions about the drive to publish so quickly — especially in field as tricky as this. Maybe the conversation should really be about whether this is a hidden invitation to slow down?

    1. It’s not that we are picking on Santos. It’s just that this is a recent instance where something went seriously wrong. Not taking blame by the PI plus the still unexplained high correlation between first and second coder are rather suspicious at this point, in my opinion.

      1. Yes, that is my point, too. I think we have not heard the end of this. I just meant that I think the kind of lab culture and practice that Santis fostered is probably widespread — it’s not just her in particular. It’s the drive to publish, to be a TED. Talker, to be scientifically famous , that leads to this. And the fact that she won’t own up and say “I’m responsible for not having had proper training oversight etc.” speaks volumes about what’s really driving her.

  5. I have just been told by a reliable source that Laurie Santos was one of the three students who exposed Marc Hauser’s misconduct. Which was courageous. And I have to say, as a science writer I have been very fond of her research over the years, loved to write about her studies. Most of the “sinners” exposed here at RW are small figures anyway whose papers only touch marginal aspects of research subjects and would not be of any interest for journalism, were it not for the misconduct (and many are not even able to deliver that little without cheating). But Santos always delivered top notch work, highly innovative and enlightening. In one case – a story I loved deeply at that time – the media may have been guilty of misconduct. You may have heard of this. She taught monkeys to use money, and very quickly there was a downright monkey economy. And then: The monkeys invented prostitution. That was all over the media, but it may have been fake, invented by the media. Found this on the web, do not know if it is true, but if, it shows in how big of a mess we are. Maybe we will never get rid of that false “monkey prostitution meme”:

    “sorry to ruin the fun here, but i worked in Laurie Santos’ lab at Yale, and was an undergrad in the lab when this article came out. When discussing this article, she was entirely confused where the reporter had gotten that idea from. I believe her exact words were, “I called Keith when I saw the article, and we determined that neither of us told the reporter that monkeys pay for sex. That’s just not true– we’ve never seen that. He completely made that up.”

    I hope that this affair will be cleared up and we get to know what exact kind of error happened there. I do not like the prospect of the shadow of suspicion hanging over her had.

    1. “I have just been told by a reliable source that Laurie Santos was one of the three students who exposed Marc Hauser’s misconduct.” The timeline for that does not make much sense. Santos got her PhD in 2003. Harvard started investigating Hauser in 2007. So, she was not a student at that time, had left the lab long ago, and probably had no longer access to records etc anymore. She got tenure in 2009, according to wiki. So, in any case, I seriously doubt if she did anything to ruffle feathers before then: somebody who puts so much effort into a career knows well not to do that!

      1. One of the papers retracted in the Hauser case was published in 2002, which meant the work was carried out well before then, when Santos and others in her cohort would have been around to witness. Further taking into account the very long time (many years) these types of investigations take at various levels, I don’t see how the timeline doesn’t make sense.
        And I hope that anyone would blow the whistle when bearing witness to misconduct, no matter one’s current career status. Avoiding ruffling feathers when there are real implications for the field is not a good policy for anyone hoping to live in that field in the long term and for anyone who cares about the work.

        1. These are all speculations that are rather irrelevant to this case, in part because we cannot verify them. However, we have ojective evidence: the retracted papers. We need a convincing explanation for the mysterious correlation between raters that has been pointed out above. Without that, I’m not convinced it was just an innocent mistake of some sort.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.