Doing the right thing: Yale psychology lab retracts monkey papers for inaccurate coding

developmental scienceIn the midst of the holiday season, it’s a pleasure to be able to share the story of a scientist doing the right thing at significant professional cost — especially a researcher in psychology, a field that has been battered lately by scandal.

Sometime after publishing two papers — one in Developmental Science and another in the Journal of Personality and Social Psychology — Yale’s Laurie Santos and her students realized there were problems with their data. We’ll let Santos — who made sure to respond to our request for comment immediately, in the midst of holiday travel, so that we had all the details and could help get the word out — tell the story:

My students and I were trying to replicate the results of Experiments 1-3 that we published in Mahajan et al. (2011) JPSP with a larger sample of monkeys (Note: since we know replication tends to be a problem in primate studies in particular due to subject access, it is our lab’s policy to regularly try to replicate our own effects with new samples at our field site, just to be sure the effects we report hold in different samples and to serve as baseline conditions for follow-up studies).

In this case, we were also trying to extend the sample we had initially tested to  look for individual differences in the magnitude of the “ingroup bias” we reported in the JPSP paper. Our JPSP paper observed that when monkeys were presented with photographs of ingroup and outgroup members, they looked significantly longer at the outgroup faces than the ingroup faces (see effects in Experiments 1-3). When we tried to replicate this pattern in a larger sample, we didn’t observe the original effect. Instead, we saw that monkeys didn’t show any consistent overall difference in looking across the ingroup and outgroup faces. We thought we could have failed to replicate the effect reported in the JPSP paper for a number of uninteresting reasons (e.g., several other research groups had recently started using photographs at the field site, and therefore monkeys may have habituated to pictures overall, etc) but just to be on the safe side, we decided to go back to the initial videos from Mahajan et al. and have new coders recode them all from scratch.

This was when detected the problem with Neha Mahajan’s (the first author) coding of the original datasets. We then quickly realized that Neha had probably used the same coding techniques in other studies she worked on and thus that similar problems might be present in other datasets that Neha had coded in the lab. So we went back and checked the original coding for all other studies Neha had coded as well. This was how we caught the second set of errors in the now-retracted Developmental Science paper you wrote to me about initially.

Since both of the coding problems resulted in results that no longer existed, we thought it was our responsibility to report the situation to the university. They did indeed perform an investigation, but found that Neha was not guilty of any misconduct or negligence. It seems it was just human error. (Sucky annoying human error that resulted in the retraction of two papers, but human error nonetheless).

The retraction notices include a good amount of detail from that narrative. Here’s the one from Developmental Science:

Neha Mahajan, Jennifer L. Barnes, Marissa Blanco, Laurie R. Santos “Enumeration of objects and substances in non-human primates: experiments with brown lemurs (Eulemurfulvus)” Developmental Science, Volume 12, Issue 6, pages 920–928, 2009

The above article, published online on 18 May 2009 in Wiley Online Library (, has been retracted by agreement between the authors, the journal Editors in Chief, Michelle De Haan and Charles Nelson and John Wiley & Sons Limited.

The authors determined that the looking time coding performed by the first author, N. Mahajan, was inaccurate and did not reflect the looking times found by other trained coders. A reanalysis of the data reported in this paper failed to verify the reported effects, and thus the authors have requested that the publication be retracted.

Ms. Mahajan takes sole responsibility for the inaccurate coding. An investigation found that the inaccurate coding was not caused by intentional, knowing, reckless, or grossly negligent action by Ms. Mahajan. All authors of the original publication joined in the request for retraction.

The now-retracted study has been cited seven times, according to Thomson Scientific’s Web of Knowledge. And here’s the retraction in the Journal of Personality and Social Psychology, which will appear in the January issue:

The following article from the March 2011 issue is being retracted: Mahajan, N., Martinez, M., Gutierrez, N. L., Diesendruck, G., Banaji, M., & Santos, L. R. (2011). The evolution of intergroup bias: Perceptions and attitudes in rhesus macaques. Journal of Personality and Social Psychology, 100, 387– 405. doi: 10.1037/a0022459

The retraction is at the request of the authors. This article reported two independent sets of effects concerning monkeys’ intergroup behavior: (a) that there are differences in monkeys’ vigilance toward ingroup and outgroup members (Experiments 1–5) and (b) that monkeys show implicit associations toward ingroup and outgroup members (Experiments 6 and 7). After lab members were unable to replicate the first set of published effects for new research purposes, the authors determined that the looking time coding performed by one of the coauthors, N. Mahajan, was inaccurate and did not reflect the looking times found by other trained coders. Ms. Mahajan had been the sole coder for most of the studies, resulting in the reporting of inaccurate data for Experiments 1–5. The coding performed by the other coauthors was accurate. A small subset of the studies were doublecoded for reliability, but this coding was not used in the overall analyses reported in the article.

A full recoding of the data from Experiments 6 and 7 verified the conclusions from the second set of findings reported in this article. The results of Experiments 6 and 7 will be resubmitted for publication as a separate article.

Ms. Mahajan takes sole responsibility for the inaccurate coding. A formal investigation conducted by Yale University found that the inaccurate coding was not caused by intentional, knowing, reckless, or grossly negligent action by Ms. Mahajan.

All authors of the original article joined in the request for retraction.

That study has been cited 22 times.

Santos reflected on the experience:

Having to retract papers is a scientist’s worst nightmare. Especially in the current climate in psychology right now (e.g., Hausergate, Stapelgate, etc…), this is pretty much the most awful thing that could happen to a PI. But I also hope that this awful situation can– at least in some sense– serve as a positive example of correcting the scientific record. We would have never caught the coding error without replicating the initial JPSP effects (which particularly given the subject access that plagues primate cognition work is something that more and more scientists need to do). And as soon as we found the problems, we immediately went back and checked all the other datasets too. I’m obviously embarrassed that we didn’t catch all this earlier, but I’m still glad that we caught it when we did.

The fact that Santos mentioned another Ivy League psychology researcher who studied monkeys is a reminder of the stark contrast between the straightforward way she approached these problems and the way that Marc Hauserwith whom she studied at Harvard — chose to. Kudos to Santos and her colleagues.

Hat tip: Rolf Degen

20 thoughts on “Doing the right thing: Yale psychology lab retracts monkey papers for inaccurate coding”

    1. I would be interested in knowing more about the nature of the coding inaccuracy that could produce again and again a statistically reliable effect in the predicted direction in multiple experiments across multiple papers.

      Also, it would be smart for reviewers to never accept a paper with only a single coder, or accept a paper with multiple coders that did not report a measure of Inter-rater reliability. Otherwise the DV could be complete nonsense.

  1. “I would be interested in knowing more about the nature of the coding inaccuracy that could produce again and again a statistically reliable effect in the predicted direction in multiple experiments across multiple papers”

    If the experiments were not blinded then such systematic bias would be a constant worry. If the first author set up the experiments and did the coding as well, then it would be expected that there would be bias, of a greater or lesser extent.

    I seems to me that these experiments (if I understand them) could easily be done in a blinded format so that experimenter bias could be removed.

  2. There are some puzzles here. The Methods in the JPSP paper state that the coder was blind to the experimental condition. If that’s correct, then even if she was coding differently from other people, any difference between groups that emerged using her data would have to reflect a real difference in the populations. Attributing the discrepancy to “human error” is not really an adequate explanation.

    Also, the methods for Experiment 1 state that the data from 10 sessions were coded by a second coder in order to check the reliability of coding — which came out at 0.75, whatever that might mean.

    Bottom line: there seems to be more going on here than meets the eye.

    1. An astute comment. The JPSP paper, Experiment 1, says that a second (blinded) coder re-scored 10 sessions of data, and that they agreed with the main rater r=0.75.

      In the Dev Sci paper, Experiment 1, they say that a second rater re-scored 6 lemur’s data, and agreed with the main rater r=0.85.

      It is very hard to see how this is compatible with the story of a single bad rater.

      1. We can’t judge Santos by the fact that Hauser was her boss.

        After all it was one of Hauser’s juniors who blew the whistle on Marc Hauser. You could equally well say she comes from a distinguished lineage of whistle-blowers.

        Judge her on her own merits…

        1. We are talking about her own merits. At the same time, a prosecutor would make the following argument. Hauser was not just her boss, he was her main mentor during the PhD years, the person who “made” her. If a successful mentor has a “casual” attitude about doing rigorous studies and keeping records, it is quite likely that attitude will be trickle down to his students, at least implicitly: the guy is superfamous, it means his approach to science works, so why should I do things differently?
          Now for the paper. Perhaps there is an entirely innocent explanation (though Santos has to address the above comments about correlation between raters).
          But let’s assume for a minute that there was something fishy in those studies (this is just a SPECULATIVE EXERCISE). A level 1 cheater would simply deny everything and hope that the thing goes away, the standard approach until recently, before people like Hauser were busted. However, perhaps there was a rumor developing in the field that those results did not replicate (there have been talk about lack of replication of some of Hauser’s work for a long time, for example). The strategy of a level 2 cheater is to come out with a nice story, put it in the noble context that the lab always replicates findings etc, and blame the assistant. In other words, you can never take these things at face value, they can always be explained with a more sophisticated level of cheating. One of the effects of sites like RW is probably to push for selection of better and more sophisticated cheaters.
          With regard to the “significant professional cost”, look, once you are a tenured professor, the only significant professional cost is getting caught cheating, which could result in you losing your job. There would be a more significant professional cost for a postdoc, or a nontenured faculty member, not for a full professor. If there was cheating, retracting and spinning an “error” story is the best proactive scenario.

  3. I disagree that this situation was handled in a model way.

    “[T]his is pretty much the most awful thing that could happen to a PI.”

    Wrong. This didn’t happen *to* the PI. It happened because of the PI. Why can’t the corresponding author just accept the responsibility that she should have for the paper? A simple, “I’m sorry, it was my responsibility to supervise the coding, train the coder and look at the data, but I didn’t do that.” Would suffice and make this a model case. As it stands… it’s just another PI throwing a poorly trained/supervised student under the bus and taking a woe is me stance.

    1. I doubt if her mentor in grad school, Marc Hauser, had the best practices in the lab. Well, we know how that story ended.

    2. First, a disclosure: I was an undergraduate student in Professor Santos’s lab many years ago.

      While I agree that PIs should always look at a study’s data, it is unreasonable to expect a PI to look at the raw data for such a study. In the case of these looking time studies, coding the raw data generally involves going frame by frame through hours of videotape and recording what the primate is looking at in every single frame. It is a time-consuming and laborious process that is usually delegated to graduate students and undergraduate research assistants. It’s just not feasible for a PI to undertake such a task, and I bet most PIs never do it once they get their faculty positions because numerous other responsibilities take precedent. Professor Santos likely did not go over the data until after they had been coded (e.g. In condition A, monkey 2 looked at stimuli X for some number of frames and stimuli Y for some number of frames).

      Furthermore, a graduate student who is a first author on a paper should take responsibility for his or her data analysis, which Dr. Mahajan notably did and should be commended for. Part of the process of graduate school is to learn to become an independent thinker and researcher. Labs could not function if PIs were micromanagers who did not trust their students.

  4. These studies did make a big splash in the media because they seemed to show the evolutionary origin of prejudice and xenophobia. It will be interesting to see if these media will now set the record straight. Here a few examples:

    Just look up “The Evolution of Intergroup Bias” at Google and you will find many others. There are also some academic books by big shots of the field who cite the study. This thing is embarrassing for the field. And it would be nice if someone connected with the study would explain hoe a neutral coding error constantly yields results that lead in one direction. This is not accusatory, it would be very important to know the mechanism to prevent this thing in the future.

  5. She learned from the best: She got her PhD with Marc Hauser during the years when the fishy stuff was going on in his lab. Just saying…

    1. Wouldn’t jump to conclusions. Looks like RW is in touch with the author, maybe she can help reconcile the apparent discrepancy between the retraction notices, the narrative provided to RW, and what was stated in the article.

  6. Thank you Retraction Watch for this excellent post – the article and comments below are very interesting for the reproducibility debate (

    A study recently found that scientists reward authors who report their own errors. Citation frequencies decline less drastically when authors self-retract their paper – compared to retractions that were not self-reported (

    By making the replication problems public, initiating an investigation, requesting a retraction and commenting on the whole process openly, Santos and her lab definitely did the right thing – even if we do not know if really just a single coder made errors or what the whole story is.

    1. It is good that a false finding is no longer in the literature. Probably, the community already doubted the finding, however.

  7. “I would be interested in knowing more about the nature of the coding inaccuracy that could produce again and again a statistically reliable effect in the predicted direction in multiple experiments across multiple papers.”

    Experimenter bias is a (disturbingly) common phenomenon across many branches of science. By “experimenter bias,” I mean to refer to a general trend for people to create and run studies in such a way that they are more likely to get a desired or hypothesized result than they should be. As one of the other commenters mentioned, if the coder had access to the different experimental conditions when coding the data, this could easily have led to bias of this type. One of the interesting things about experimenter bias is that it is often not conscious and experimenters are often unaware that their hypothesis can lead to distorted results (see the old work of Robert Rosenthal for more detail on this). So even if there was systematic inaccuracy in the predicted direction across multiple papers, that is not necessarily proof that there was conscious, volitional misconduct going on. To me, more than anything, I think the PI learned that the lab needed to be even more prudent in guarding against experimenter bias, and likely implemented protocols that would prevent something like this from happening again.

    1. Yes, something like this may possibly have happened. This makes it so pertinent to find out what went wrong, to learn from the experience Supposedly the coder (and the control coders) were blind to the experimental condition. The study wanted to find out if the monkeys were looking longer at the photographs of ingroup or outgroup members. Perhaps some information “leaked” unintentionally. That even happens in placebo controlled double blind studies in pharmacological research: The “real” drugs produce significant side effects which make the subjects aware of being in the experimental group. Whatever, here are some nice pictures of the test subjects.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.