Retraction Watch

Tracking retractions as a window into the scientific process

Let’s not mischaracterize replication studies: authors

with 28 comments

Brian Nosek

Brian Nosek

Scientists have been abuzz over a report in last week’s Science questioning the results of a recent landmark effort to replicate 100 published studies in top psychology journals. The critique of this effort – which suggested the authors couldn’t replicate most of the research because they didn’t adhere closely enough to the original studies – was debated in many outlets, including Nature, The New York Times, and Wired. Below, two of the authors of the original reproducibility project — Brian Nosek and Elizabeth Gilbert – use the example of one replicated study to show why it is important to describe accurately the nature of a study in order to assess whether the differences from the original should be considered consequential. In fact, they argue, that one of the purposes of replication is to help assess whether differences presumed to be irrelevant are actually irrelevant, all of which brings us closer to the truth.

Published in Fall 2015, the Reproducibility Project: Psychology reported the first systematic effort to generate an estimate of reproducibility by replicating 100 published studies from 2008 issues of three prominent journals in psychology.  We are two of the co-authors of this 270 author project.  Overall, the reproducibility rate was ~40% across 5 distinct criteria.  In their critique published last week, Dan Gilbert (no relation), Gary King, Stephen Pettigrew, and Tim Wilson (2016) suggested that effective reproducibility rate was not distinguishable from 100% reproducibility because of flaws in the methodology.

Many co-authors of the Reproducibility Project: Psychology published a response countering the critique.  Within days, independent commentaries have also emerged challenging Gilbert and colleagues’ methodology and conclusion by Sanjay Srivastava, Uri Simonsohn, Daniel Lakens, Simine Vazire, Andrew Gelman, David Funder, Rolf Zwaan, and Dorothy Bishop.

We will address a simple point not covered by these commentaries.  Gilbert and colleagues suggested that dramatic differences between the original and replication studies caused lower reproducibility.  To support their point, they briefly described differences between six original and replication studies including this one:

An original study that asked Israelis to imagine the consequences of military service was replicated by asking Americans to imagine the consequences of a honeymoon

They then elaborated in their on-line rebuttal to our response:

In one original study, researchers asked Israelis to imagine the consequences of taking a leave from mandatory military service (Shnabel & Nadler, 2008). The replication study asked Americans to imagine the consequences of a taking a leave to get married and go on a honeymoon. … And not surprisingly, the replication study failed.

Based on Gilbert and colleagues’ description of such a dramatic difference, one can’t help but conclude that the replication team was incompetent.  And if one replication could be fouled up this badly, it would seem likely that there are more problems in the other 99.  With this description, it is easy to believe that the Gilbert and colleagues’ press release headline “Researchers overturn landmark study on the replicability of psychological science” is not just bloviating.

Full disclosure: One of us (Elizabeth) led this replication attempt, and the other of us (Brian) may be the only person on the planet who thinks that military service and honeymoons are the same thing.  Together, we disagree with their characterization of this study.  It is certainly possible that we are just feeling defensive about our work.  So we will describe this original study and replication, and you can assess whether Gilbert and colleagues describe the study accurately, and came to a reasonable conclusion that the replication was “dramatically different” from the original.

The Original Study

The original study was about what psychological resources increase the chance that perpetrators and victims will reconcile. Across four vignette studies, the original authors found evidence that restoring victims’ sense of status and power and restoring perpetrators’ sense of public moral image increased the likelihood that they were willing to reconcile. These results supported their hypothesis that victims and perpetrators have different psychological needs.  This is an important topic and a creative study, and our single failure to replicate it does not mean that the original finding is wrong.

How did the original authors test this question?  All the original and replication materials are available on the Open Science Framework.  Here is a summary of the key parts.  In the original study, 75 female and 19 male Israeli students (average age 23.5) came to the laboratory, read a short story, and answered questions about the story including what they would do in one character’s position.  The story was about a coworker (the perpetrator) who took credit for another coworker’s work (the victim).  The two coworkers had worked together for several years and recently had been collaborating on a joint project, but shortly before the project was due, the victim had to leave work for a while.  The perpetrator then turned in the joint work, receiving all the credit for it and a promotion, whereas the victim got demoted when s/he returned. The key part of the experiment was whether the participant imagined being the victim or the perpetrator in the story.

Here are the original “victim” stories, altered slightly depending on whether it was a male or female participant.  (The “perpetrator” story is identical except for switching Assaf/Anat and the participant’s role (“you”) in the story.) The materials were translated by the original author from Hebrew to English. Male participants read about Assaf and reserve duty, female participants read about Anat and maternity leave.

Original Study “Victim” condition:

You and Assaf/Anat are working together for several years in a successful advertising company. Recently you have been collaborating on a joint project that included a campaign to a thriving fashion brand.  Just prior to the completion of the project you had to leave for [reserve duty]/[maternity leave] earlier than expected and you asked Assaf/Anat to take care of two tasks that you haven’t managed to complete.  Assaf/Anat has completed the tasks with distinction, and thus, one month after you left for the [reserve duty]/[maternity leave] the boss offered him/her to take over your role and responsibilities and demote you to an underling role in a different department. Although Assaf/Anat knows you might be hurt by it, s/he accepts the promotion including everything that goes with it. When you learn about it you feel betrayed, hurt, and very angry at him/her because s/he agreed to accept your role. You know that now, despite your proven successes, you will be far less confident in job interviews.

After reading this story or the “perpetrator” version, all participants answered survey questions such as how much “I was hurt by Assaf/Anat” and whether “Assaf/Anat perceived me as completely moral.” Then, participants read another short vignette that Assaf/Anat later provided public feedback that either praised the participants’ professional skills (the condition intended to increase sense of power) or their interpersonal skills (the condition intended to increase public moral image). Participants then reported their willingness to reconcile with Assat/Anat.

The Replication Study

We conducted the replication study with 144 students at the University of Virginia (82 men, 62 women).  Based on existing theory and evidence, we had no reason to expect cultural differences in whether restoring status and power to victims and restoring moral acceptance to perpetrators increases chances of reconciliation.  However, the scenario text was not perfect for our sample.  In the U.S., reserve duty isn’t common for men, and demotions for maternity leave are illegal.  So, we needed to alter the scenarios to have a reason for being away from work that was relatable.  We chose one that would work for both men and women, a honeymoon. We retained gender matching between the participant and the co-worker, but also altered the names to be more relatable to Americans. We made a couple of additional phrasing edits, here is the full text, again just the “victim” condition with the same revisions made to the “perpetrator” condition:

Replication “Victim” condition:

You have graduated from college and have been working at a successful advertising company for several years. During the last year, you and your colleague, Amy/Andy, have been collaborating on a joint project that included a campaign for a thriving fashion brand. You also are recently engaged to be married, and unfortunately your wedding and honeymoon are scheduled to occur right before the completion of the project. You need to take a leave of absence for this, so you ask Amy/Andy to take care of two tasks that you haven’t managed to complete. Amy/Andy has completed the tasks with distinction. Before you return two weeks later, your boss offered her/him to take over your role and responsibilities and demote you to an underling role in a different department. Although Amy/Andy knows you might be hurt, she/he accepts the promotion, along with all benefits that come with it. When you return you feel betrayed, hurt, and very angry at her/him because she/he agreed to accept your position. You know that now, despite your proven successes, you will be far less confident in job interviews.

The follow-up survey, vignette about public praising, and final questions about possible reconciliation were the same as the original study.  We shared this revised design with lead author of the original research for feedback.  She asked us to pilot test the materials to make sure our U.S. participants responded to it the same way her Israeli participants responded to the original.  That is, did U.S. participants find it easy to imagine themselves in the situation, how angry and hurt would they would feel in this situation, etc.?  We pilot tested the revised scenario and confirmed that it met those criteria.  The original author approved this revision to the design for conducting the replication.  We pre-registered the protocol noting the changes, conducted the replication, and wrote the final report. Unfortunately, as that report notes, the original findings did not replicate:

This study failed to replicate the primary original result of interest.  That is, victims were not more willing to reconcile after receiving the empowerment message compared to the acceptance message, and perpetrators were not more willing to reconcile after receiving the acceptance message compared to the empowerment message.

Gilbert and colleagues’ (mis-)characterization

Gilbert and colleagues asserted that this study illustrated “dramatic differences” between the original and the replication.  They had described the study as being about imagining “the consequences of military service” versus ” the consequences of a honeymoon.”  In our opinion, this is quite misleading.  First, the study was about how victims and perpetrators respond when someone else takes credit for their work.  Second, the reason for being away from the office (reserve duty, maternity leave, or honeymoon) was an incidental feature of the scenario to allow someone to take credit for another person’s work.  Third, that incidental feature was held constant across the experimental conditions (victim or perpetrator).  Fourth, 80% of the original participants were women, so their reason for being away was maternity leave, not military service.  Fifth, we conducted manipulation checks to make sure that the scenario had similar effects as the originals before conducting the study.  And, sixth, the lead original author was very generous with time and advice, and ultimately approved the changes – something that Gilbert and colleagues argued was essential for reproducibility.

The Conclusion

We believe that Gilbert, King, Pettigrew, and Wilson’s characterization of this replication and others is unfair.  And, as we pointed out in our published response, among the other five studies that they call out as flawed, some were endorsed by the original authors, and another replicated successfully.  So who is right about the “fidelity” of these replications?  You don’t need to take our word for it, or theirs.  You can review all the protocols, materials, data, and scripts yourself on the Open Science Framework.

We suggest that this was a fair replication.  Simultaneously, our assertion does not mean that the replication overturned the original result.  It is perfectly reasonable to wonder if the effect is, for example, culturally constrained to Israelis or other more collectivistic societies.  It is even reasonable to wonder if it is dependent on the reason the victim is out of the office.  The fact that new interpretations are generated after the fact is not a threat to science; it is the engine of discovery.

When a replication produces a different result than an original study it may challenge our existing understanding of the phenomenon.  As we said in our original article:

It is also too easy to conclude that a failure to replicate a result means that the original evidence was a false positive. Replications can fail if the replication methodology differs from the original in ways that interfere with observing the effect. We conducted replications designed to minimize a priori reasons to expect a different result by using original materials, engaging original authors for review of the designs, and conducting internal reviews. Nonetheless, unanticipated factors in the sample, setting, or procedure could still have altered the observed effect magnitudes.

We can use the different result to update our beliefs or try to preserve our pre-existing beliefs by identifying methodological differences that could explain the difference. As long as we recognize that such exploration is a means of generating hypotheses, then we don’t risk overconfidence and premature conclusions.

The Reproducibility Project: Psychology offered descriptive evidence of challenges for reproducibility.  We and our 268 co-authors recognized that our evidence was not sufficient to draw strong conclusions about why that reproducibility rate was observed.  To emphasize those limitations our original article’s conclusion opened thusly:

After this intensive effort to reproduce a sample of published psychological findings, how many of the effects have we established are true? Zero. And how many of the effects have we established are false? Zero. Is this a limitation of the project design? No. It is the reality of doing science, even if it is not appreciated in daily practice. Humans desire certainty, and science infrequently provides it. As much as we might wish it to be otherwise, a single study almost never provides definitive resolution for or against an effect and its explanation. The original studies examined here offered tentative evidence; the replications we conducted offered additional, confirmatory evidence. In some cases, the replications increase confidence in the reliability of the original results; in other cases, the replications suggest that more investigation is needed to establish the validity of the original findings. Scientific progress is a cumulative process of uncertainty reduction that can only succeed if science itself remains the greatest skeptic of its explanatory claims.

Gilbert, King, Pettigrew, and Wilson’s error was not that they offered an optimistic interpretation of reproducibility from the Reproducibility Project: Psychology data.  Their error was mistaking their hypothesis that was generated from the data for a conclusion that was demanded by the data.  Accurate calibration of claims with evidence is what makes science special, and credible, as we all continue stumbling toward understanding.

Like Retraction Watch? Consider making a tax-deductible contribution to support our growth. You can also follow us on Twitter, like us on Facebook, add us to your RSS reader, sign up on our homepage for an email every time there’s a new post, or subscribe to our new daily digest. Click here to review our Comments Policy.

Written by Alison McCook

March 7th, 2016 at 2:05 pm

Posted in not reproducible

Comments
  • Benson Honig March 7, 2016 at 2:20 pm

    I find it very distressing, perhaps even opportunistic, that serious scholars would levy these sorts of accusations without understanding, or explaining, the context. It really smacks of marketing. They got an “A” hit for their efforts. Yea! But why whitewash someone else’s work? At a minimum, they should have given the original authors a chance to verify and interpret, much like was done with the first set of replications. It is ironic that they failed to do what they called others to do.

  • Marcel van Assen March 7, 2016 at 2:27 pm

    Mmm, interesting!
    In my opinion, it was a conceptual and not a direct replication.
    Note, however, and I believe this is important; in principle, a conceptual replication may elicit a smaller, the same, or a LARGER effect.
    Of the 100 replications only very few (17) effect sizes were larger than the original effect sizes.

    • Neuroskeptic March 7, 2016 at 3:16 pm

      I see your point, but doesn’t that mean that it would be impossible to conduct a direct replication of this study in any country where men are never called away on military reserve service… or perhaps, only in Israel?

      • Marcel van Assen March 7, 2016 at 4:49 pm

        Yes, that are the implications. No problem, if you ask me. One can conduct direct replications in Israel, and/or highly or less related conceptual replications. This is all relevant for the phenomenon to be examined.

        And, Gilbert et al or anyone else cannot get away with ‘of course it does not replicate, because it is no direct replication’, for two reasons:
        (I) if context matters, some changes of context may also yield stronger effects than the original finding
        (Ii) if the effect only holds in the very specific context it was originally studied, what was the mechanism that generated the finding?
        Theories are also about stipulating the conditions under which the effect can be found. If the theory is not clear about these conditions, I do not regard such a theory highly.

    • Andrew Gelman March 7, 2016 at 4:47 pm

      Marcel:

      Yes, this is a conceptual replication. But as Nosek et al. pointed out in their recent article, _all_ replications are conceptual. There is no such thing as an exact replication. Conditions will always differ.

      • Boris Barbour March 7, 2016 at 5:19 pm

        That’s simply reducing the notion of a conceptual replication to being completely useless. Even if it’s not a binary issue, the idea does have some value, surely? The more different the conditions, the more conceptual the replication?

  • Susa March 7, 2016 at 2:51 pm

    Thank you for taking the time and making this so clear. Very helpful to put this specific critic into perspective! The line between a conceptual and a direct replication seems to be quite fine in a intercultural context, since adjustments always have to be made, because some phrases might not exist in one language and specific social constructs may not be present in a specific culture.

  • Andrew Gelman March 7, 2016 at 4:57 pm

    Brian and Elizabeth:

    Well put. I basically agree with everything you wrote. Gilbert et al. really seem to have overreached, I’m glad you went to the trouble of explaining what was going in this example.

    Just one thing. You wrote, “This is an important topic and a creative study, and our single failure to replicate it does not mean that the original finding is wrong.” Perhaps worth emphasizing that the original finding might also be in the wrong direction. If researchers are obtaining statistically significant conclusions by capitalizing on noise (perhaps inadvertently so), then they could well be getting the direction of effects wrong too. This is something we worry about in medical research, but, to the extent that this is an important topic, it should be a concern in psychology research too.

    To put it another way: the debate is sometimes framed as, some people say there’s no effect, others say there is an effect. It’s good to remember the third possibility which is that the published effect is in the wrong direction. So there is a potential cost in believing a conclusion that was constructed from noise.

  • Dr. R March 7, 2016 at 5:35 pm

    The RP:P picked 100 studies published in 2008 in three psych journals.
    They replicated the studies as closely as possible.
    The original studies reported 97% significant results.
    The replication studies reported 35% significant results.
    What explains this difference?

    The answer is well-known since 1959. Sterling (1959, 1995 found success rates over 90% in psychology journals. Thus, there is no sampling bias in the OSF-studies. This is the phenomenal success rate in psych journals.

    What is the true probability of obtaining a significant result and what is the true probability of a successful replication? Cohen (1962) estimated power when an effect is present is 50% a number that has been replicated later (Sedlmeier & Gigerenzer, 1989). Thus, we would only expect 50% significant results in replications if power remained the same.

    A new method that corrects for publication bias in published significant results produced an estimate of 54% power.

    https://replicationindex.wordpress.com/2016/02/26/reported-success-rates-actual-success-rates-and-publication-bias-in-psychology-honoring-sterling-et-al-1995/

    Neither Gilbert nor Nosek mention this important line of research on the replicability of psychological findings, which leads to the heated but pointless debate about the quality of the replication studies.

    Even given the most optimal scenario of exact replication studies, the expected success rate is at best 50%. Some less than perfect replications may explain the discrepancy between the actual rate of 35% and the expected rate of 97%, but bad replication studies do not explain why the success rate dropped from 97% to 35%.

    Moreover, things are even worse for Gilberts area of research, experimental social psychology (ESP). The success rate for ESP studies was only 25%. Only 50% of ESP researches approved the replication protocol. However, when I limited the analysis to approved protocols the success rate remained at 25%. Thus, there is no basis for the claim that problems with the replication studies explain the low success rate in the replication studies.

    Any further discussion of this issue that does not explicitly address the problem of publication bias is a waste of time and does not advance the replicability debate.

    For further readings on statistical approaches to the estimation of replicability, please visit my website https://replicationindex.wordpress.com/

  • Steven McKinney March 7, 2016 at 9:51 pm

    Potti et al. made the same kind of misguided argument a decade ago, when Baggerly and Coombes were unable to replicate the Duke genomics outcomes and described their failure in a letter to the Journal of Clinical Oncology. The Duke team replied:

    “IN REPLY: To ‘reproduce’ means to repeat, following the methods outlined in an original report. In their correspondence, Baggerly et al conclude that they are unable to reproduce the results reported in our study that aimed to use gene expression data to predict platinum response in patients with ovarian cancer, coupled with pathway analyses to reveal other therapeutic opportunities. This is an erroneous claim since in fact they did not repeat our methods.”
    DOI: 10.1200/JCO.2007.15.3676

    If an exact repeat under overly-precise conditions is required, is it even still science? Science is about describing general phenomena, that occur in general settings. A type IA supernova is a type IA supernova, whether it takes place in the Milky Way galaxy (anywhere) or the Andromeda galaxy or any other galaxy. One volt of electricity is one volt, whether it is measured in China or Canada.

    If a social psychology study describes an event that only happens in Israel, involving military personnel who work only in advertising companies on campaigns involving thriving fashion brands, is it still science?

    This type of insistence on overly-exact replication conditions is generally a good sign that an anecdote is at play, not a generalizable scientific finding.

    Many thanks to Brian Nosek and Elizabeth Gilbert for taking the time to explain their entirely reasonable attempt to recapitulate an experiment in a general fashion that maintains scientific integrity.

  • anonymous March 8, 2016 at 2:01 am

    Steven McKinney

    If an exact repeat under overly-precise conditions is required, is it even still science? Science is about describing general phenomena, that occur in general settings. A type IA supernova is a type IA supernova, whether it takes place in the Milky Way galaxy (anywhere) or the Andromeda galaxy or any other galaxy. One volt of electricity is one volt, whether it is measured in China or Canada.
    If a social psychology study describes an event that only happens in Israel, involving military personnel who work only in advertising companies on campaigns involving thriving fashion brands, is it still science?

    If a phenomenon can be observed in a specific setting only, it is still general IN that specific setting, e.g. observable in ALL Israeli military personnel working in advertising companies on campaigns involving thriving fashion brands. So your generality requirement would still be satisfied even if specific conditions have to be met.

  • Andrew Gelman March 8, 2016 at 8:17 am

    Out of curiosity I found the original article on the web (http://socsci.tau.ac.il/psy/images/stories/staff-academic/NuritS/5.pdf).

    Here’s the abstract, in its entirety:

    “The authors propose that conflict threatens different psychological resources of victims and perpetrators and that these threats contribute to the maintenance of conflict (A. Nadler, 2002; A. Nadler & I. Liviatan, 2004; A. Nadler & N. Shnabel, in press). On the basis of this general proposition, the authors developed a needs-based model of reconciliation that posits that being a victim is associated with a threat to one’s status and power, whereas being a perpetrator threatens one’s image as moral and socially acceptable. To counter these threats, victims must restore their sense of power, whereas perpetrators must restore their public moral image. A social exchange interaction in which these threats are removed should enhance the parties’ willingness to reconcile. The results of 4 studies on interpersonal reconciliation support these hypotheses. Applied and theoretical implications of this model are discussed.”

    Neither military service nor maternity leave are mentioned in the abstract. So, at the very least, the general claims that someone might draw from reading the abstract of the paper did not replicate.

    Nor were the terms “military service” and “maternity leave” mentioned in the discussion section of the paper. Indeed, the _only_ place these terms came up was here:

    “They were asked to read a short vi- gnette about an employee in an advertising company who was absent from work for 2 weeks due to maternity leave (for women) or military reserve duty (for men)—the most common reasons for extended work absences in Israeli society.”

    This is not a criticism of Shnabel and Nadler: any experiment is full of conditions that are required to get the experiment going, but don’t seem essential. It just illustrates the usual story that an experiment done under very specific conditions for a very specific group of people is used to make general claims about human nature.

  • Niels March 8, 2016 at 11:50 am

    I too find this discussion about replication details beside the point. The main issue here is that scientists should focus on getting results that replicate, not on getting results that have p < .05. That is the main issue and a far more important debate to have.

  • K.A. March 8, 2016 at 1:01 pm

    I am a bit confused here. Isn’t Nosek the guy behind the OSF project, which aims to build a framework for more reproducible science? Wasn’t it strange to anyone when the study initially came out, that the conclusions of their work benefited greatly the OSF project. What I mean is, if the study would have shown research to be reproducible, then the whole argumentation for the OSF project would have been weaker. But since it showed the research to be irreproducible, that was in my mind a great argument in favor of the OSF project. Does anyone else see a conflict of interest here, or is it just me?

    • Susa March 8, 2016 at 2:13 pm

      It seems to be quite hard to store materials or data for many researchers. The OSC developed an easy to use FREE tool for the community to do exactly that. Its a great platform to organize your workflow within your lab, no need for low reproducibility to make this helpful.

      For the discussion about openness and transparency in the community the results of the study helped to make the problems easier to understand for a broader audience that did not think in detail about the already discussed problems of publication bias, QRP, underpowered studies and their interaction … So in some sense it helped the discussion to be more focused at finding new solutions for old problems.

      Can´t see the conflict of interest for us 270 involved researchers of the RP:P. We are users of a platform, which you could be too – maybe you want to check it out for your work http://www.osf.io.

      • K.A. March 8, 2016 at 3:31 pm

        Oh… I am familiar with the platform very much. I think it does solve some problems of the research workflow, but by far not all. There are more advanced concepts out there. But the platform itself or its usefulness are not the issue here. The issue is whether or not anybody questioned the motivation behind the original work or not at the time it was published. I found it weird, to put it mildly, but if nobody else sees a conflict of interest here, then I guess I am the only one.

        • Susa March 8, 2016 at 3:52 pm

          Oh great — can you point me to these concepts?

          Btw. the developers in the center are really open for feedback, if you have ideas you should let them know via support@osf.io.

          Well my motivation as one of the replicators was to contribute to the public good of reproducible science and I thought one good step to get there is to get an idea how big the problem actually is, because until the RP:P I talked about the problems to replicate only in the breaks at conferences or with close colleagues, the issue of publication biases corrections was often disregarded as a statistical toy, because multiple different methods were used and were not well understood, so I thought this selective sample is not good enough to get an idea about the extent of the problem and put my time and effort into this collective effort. An additional motivation for me was to see how hard it is to actually do the study of another researcher again, because that helps a lot to write up your own stuff in a much clearer way. So I don´t know about every single person within this group, but that might give you an idea.

    • Brian Nosek March 9, 2016 at 6:34 am

      Greetings K.A. –

      A journalist asked me the same question. This was my response to her:

      “Yes, it is absolutely the case that my research practices and interpretation are likely to be influenced by my preconceptions, assumptions, and biases. That must be so because I am human. Psychology has demonstrated how we humans are likely to use motivated reasoning in order to shape evidence to conform to the conclusions that we want rather than the conclusions that are correct.

      Science offers some good tools to try to mitigate these biases. For example, one tool is transparency. If others can observe how I made my claims, then there is more opportunity to identify potential biases in the methodology, reasoning, and conclusions. That is why we made the entire project public right from the start – all protocols, methods, data, analysis scripts, etc. are available for review and critique on the Open Science Framework. Another tool is preregistration. Even if I desire not to be biased, once I observe the data, if I have multiple ways that I could analyze, and choices about what I should report, then I am more likely to use motivated reasoning to justify – even unintentionally – reporting the analyses and outcomes that support my point of view, rather than those that counter it. So, what we did in the Reproducibility Project is seek advice of original authors to maximize the quality of the designs before running then, and then preregistering the design and analysis plan in advance. In that way, we put constraint on ourselves to follow the plan we pre-specified and removed the opportunity for flexibility in how we interpreted the data.

      Do these steps guarantee that we removed all biases from how we did the study? Certainly not. What we can do is make good-faith efforts to minimize those biases, and show what we did. Ultimately, confidence in findings requires independent verification. Others have to repeat our studies to see if they observe the same results as we did. With independent verification, the various individual biases will fade in their potential for influencing the results.”

      • K.A. March 9, 2016 at 9:43 am

        Hi Brian. I am not sure I understand your response. “Psychology has demonstrated how we humans are likely to use motivated reasoning in order to shape evidence to conform to the conclusions that we want rather than the conclusions that are correct.” Aren’t scientists humans, educated and trained to do exactly the opposite? To form conclusions based on evidence?
        Am I getting your argument right, in that you developed OSF as a tool to help bad scientists do their research a little less bad?

        • Brian Nosek March 9, 2016 at 10:46 am

          Yes, scientific training aims to increase skills for objective interpretation of evidence. But, training and intention to be objective is not sufficient to overcome reasoning biases – particularly when we are not even aware of them. A wonderful introduction to this topic is a classic paper by the late Ziva Kunda: http://www.arts.uwaterloo.ca/~pthagard/ziva/psychbul1990.pdf

          Transparency, preregistration, replication are all tools to help mitigate the effects of reasoning biases that affect my judgment and reduce my objectivity.

          • K.A. March 9, 2016 at 2:06 pm

            So if I am motivated to be a successful scientist and execute my experiments in a way that will result in a high impact paper, but may not be legitimate or objective or correct, then this can be justified by motivated reasoning? Where is the line between motivated reasoning being behind poorly executed research, as opposed to incompetence or fraud? Does that mean that people accused of misconduct here on this blog have a way of scientifically proving their innocence? Blaming it all on motivated reasoning.

      • Keith O'Rourke March 9, 2016 at 10:58 am

        Agree that this is not much more you can say about this unavoidable issue.

        Some background for those who might be interested http://jama.jamanetwork.com/article.aspx?articleid=182115

    • Mayo March 10, 2016 at 3:20 am

      This was one of the ironies I brought out in first discussing the way, conceivably, non-significance could become the new significance: http://errorstatistics.com/2014/06/30/some-ironies-in-the-replication-crisis-in-social-psychology-1st-installment/
      “Another irony enters: some of the people working on the replication project in social psych are the same people who hypothesize that a large part of the blame for lack of replication may be traced to the reward structure, to incentives to publish surprising and sexy studies, and to an overly flexible methodology opening the door to promiscuous QRPs … Call this the “rewards and flexibility” hypothesis. If the rewards/flex hypothesis is correct, as is quite plausible, then wouldn’t it follow that the same incentives are operative in the new psych replication movement?”

      However, in psych, I think the effect is overdetermined: “the rewards and flexibility” hypothesis is plausible and operated AND many of the studies really and truly don’t replicate (they fail to describe genuine regularities.)

      That said, I think blindness would help. Critical outsiders might also spot some less recognized flaws, e.g., when participants are told the experiment was actually an attempt to replicate such and such, the instructor’s attitude toward the thesis under test may be conveyed; and while students are asked not to share this info with any students who might potentially opt to participate in the study over the months of the replication project, students do talk. Questions about stopping rules for the replications might also arise.

    • Richard Tomsett March 10, 2016 at 11:02 am

      I don’t think there’s a COI: why do the conclusions benefit the OSF project? Experiments should be replicated, and *one* of the advertised benefits of the OSF is that it could help with replication efforts. If the Reproducibility Project had successfully replicated the majority of studies, the utility of the OSF wouldn’t be diminished in any way.

    • Z Basehore October 26, 2016 at 12:52 pm

      An interesting thing to note is that Nosek is also the person behind the Many Labs Project, which I believe to be a superior replication methodology. You can find the info for the MLP at https://osf.io/wx7ck/ The paper that came from this project was published in 2014; the link is here: http://psycnet.apa.org/journals/zsp/45/3/142.pdf, and there’s a nice summary here: http://www.nature.com/news/psychologists-strike-a-blow-for-reproducibility-1.14232 A good blog post discussing this is here: http://www.talyarkoni.org/blog/2013/12/27/what-we-can-and-cant-learn-from-the-many-labs-replication-project/

      Essentially, MLP was far more optimistic about the reproducibility of psychological science. 10 out of 13 studies successfully replicated; possibly 11 out of 13 (there is apparently a lack of clarity in the analyses over the 11th).

      In the 2015 OSC article, the authors characterized their own approach as ‘shallow and broad’ (http://science.sciencemag.org/content/349/6251/aac4716.full), as opposed to ‘narrow and deep.’ I think there’s a place for both, but the media blitz seems to have been much bigger over the OSC article than the MLP article. I also find it interesting to note that in the OSC article, they called the MLP’s ‘narrow and deep’ approach “complementary.”

      I think it should be the other way around–MLP gives us a more useful approach to replication; _the shallow and broad OSC approach_ is the complementary one. I ultimately believe that the ‘narrow and deep’ approach of MLP leads us to draw stronger conclusions: if some effect fails to replicate once, as in OSC, it’s unclear which is right–the original or the replication. But if it fails to replicate after many attempts, then the original study is likely the unusual one.

  • Costa Vakalopoulos March 11, 2016 at 5:58 pm

    Thanks so much for clarifying the study that one can only describe the reply by Gilbert as being disingenuous. I agree with some commentators that for it to be useful principles ought to be generalizable otherwise there is little benefit in a replication that is required to be absolutely faithful to context.
    A couple more points that may not have been the focus but Gilbert et al. appear to identify in their replies with ‘experts’ who should be consulted to ‘endorse’ replication methods. This suggests a pervasive sense of entitlement within certain elite academic circles that have little to do with science itself. Finally the fact that such polar opposite interpretations of same statistical inferences can appear in a journal like science might be regarded as healthy skepticism, but I would argue are indicative of a true crisis for any number of reasons both sociological i.e. vested interests and the nature of scientific ideas in their own right.

  • Mats Stafseng Einarsen August 12, 2016 at 3:52 pm

    My impression about the exact example is definitively that Gilbert is right that this isn’t a replication. I also buy his argument on more counts and other examples, it has to be said.

    It feels like a separate experiment and the conclusion could be that the initial effect does not hold when the victim is seen as hedonistic rather than duty bound, which is not a crazy hypothesis by any means.

    However, is this a question of opinion? Why not replicate the experiment again, but (assuming US undergrad participants) use jury-duty in one condition and honeymoon in another condition and look for a difference.

  • Post a comment

    Threaded commenting powered by interconnect/it code.