We all know replicability is a problem – consistently, many papers in various fields fail to replicate when put to the test. But instead of testing findings after they’ve gone through the rigorous and laborious process of publication, why not verify them beforehand, so that only replicable findings make their way into the literature? That is the principle behind a recent initiative called The Pipeline Project (covered in The Atlantic today), in which 25 labs checked 10 unpublished studies from the lab of one researcher in social psychology. We spoke with that researcher, Eric Uhlmann (also last author on the paper), and first author Martin Schweinsberg, both based at INSEAD.
Retraction Watch: What made you decide to embark upon this project?
Martin Schweinsberg and Eric Uhlmann: A substantial proportion of published findings have been difficult to replicate in independent laboratories (Klein et al., 2014; Open Science Collaboration, 2015). The idea behind pre-publication independent replications (PPIRs) is to replicate findings in expert laboratories selected by the original authors before the research is submitted for publication (Schooler, 2014). This ensures that published findings are reliable before they are widely disseminated, rather than checking after-the-fact.
RW: You note this was a “crowdsourced” project involving 25 labs around the world. How did you make that work, logistically? What were some of the issues that arose?
MS and EU: We expected to carry out a much smaller-scale project across just a few universities, and were thrilled and delighted when two dozen laboratories decided to join us. It was possible to carry out a crowdsourced project on a limited budget due to the enthusiasm and hard work of our coordination team and many collaborators from around the world. Clear and ongoing communication with our globally distributed teams (weekly Skype calls between coordinators and monthly emails to each research team) was another key to the successful crowdsourced initiative.
RW: In “The pipeline project: Pre-publication independent replications of a single laboratory’s research pipeline,” you focused the work on 10 unpublished studies by last author Uhlmann, all focused on “moral judgment effects” — such as nuances in how people judge someone who tips poorly, or a manager who mistreats his employees. You were look at multiple replication criteria — for instance, for the “bad tipping” study, would people who leave little money be consistently judged more harshly — and by the same amount — if they left the pittance as pennies than as bills? You found that 60% of the unpublished studies (including the “bad tipping” study) met all replication criteria — they showed the same findings, and the same size of the effect. Was that a surprising finding?
MS and EU: We were not particularly surprised by the overall replication rate of 60%, but we were very surprised by which particular original findings did and did not replicate. One unanticipated cultural difference was with the bad tipper effect, which replicated consistently in U.S. samples but not outside the United States.
RW: How does your finding of 60% differ from other studies that have reported on rates of replicability? What possible explanations might there be for the differences?
MS and EU: The replication rates across different crowdsourced initiatives are not directly comparable due to differences in sampling procedures and the relatively small number of original studies typically examined. It is noteworthy that our replication rate was moderate despite the best possible conditions: the replications were carried out by highly qualified experts and the original studies were carried out transparently with all subjects and measures reported (the data and materials are publicly posted on the internet, see https://osf.io/q25xa/). Our results underscore the reality that irreproducible results are a natural part of science, and failed replications should not imply incompetence or bad motives on the part of either original authors or replicators.
RW: All of the studies you tried to replicate came from one lab — could that limit the generalizability of these findings to other areas of social psychology?
MS and EU: Absolutely, and we would not argue that the findings necessarily generalize to other research programs. The purpose of the Pipeline Project was to demonstrate that it is feasible to systematically replicate findings in independent laboratories prior to publication.
RW: You argue that this project offers more “informational value” than previous replication efforts. How so?
MS and EU: We suggest that failed pre-publication independent replications are particularly informative because the original authors select replication labs they consider experts with subject populations they expect to show the effect. This leaves little room to dismiss the replication results after-the-fact as the result of low replicator expertise or differences between the subject populations.
RW: There are already so many pressures for researchers to publish, and publish quickly — to get promotions, grants, etc. What would you say to scientists who would argue that they simply don’t have time to add another task to a paper (replication) before publishing it?
MS and EU: All scientists have limited resources and need to make difficult trade-offs regarding whether to pursue new findings or confirm the reliability of completed studies. As described in more detail below, in a follow-up initiative (the Pipeline Project 2) we will offer to conduct PPIRs of original authors’ findings for them in graduate methods classes (including our own), the idea being to opening up pre-publication independent replication to more researchers around the world at no cost to them.
RW: Are there any other barriers to introducing a system of pre-publication replication more widely?
MS and EU: Replicating research is not yet sufficiently rewarded under the current incentive system. For our next initiative (the Pipeline Project 2) we are currently recruiting graduate methods classes to carry out pre-publication independent replications of studies nominated by the original authors. The students, their professors, the original authors, and the project coordinators will then publish a co-authored empirical report of the crowdsourced replication results. We hope that by offering educational and publication opportunities, free course materials, and financial and logistical support we can help make pre-publication independent replication an attractive option for many researchers.
RW: Let’s talk about how this might work more widely, logistically. You mention an “online marketplace” of willing labs in a particular field, who would offer to replicate whichever unpublished finding strikes their interest. What are the major concerns with that system, and how would you address them? For instance, I imagine you’d have to find ways to incentivize labs for taking the time to replicate someone’s research, and prevent researchers from getting “scooped” by disseminating their unpublished findings so widely.
MS and EU: In the JESP paper we speculate that it may be possible to create an online market place for laboratories interested in replicating each other’s work. This would allow even studies which require a high level of expertise to be independently replicated before they are submitted for publication. To incentivize this, replicators might be offered the opportunity to submit a registered replication report together with the original paper, with the original paper and replication report appearing together in the journal.
Fear of getting scooped is a disincentive to participate in PPIRs. Our current approach to allaying this concern is to only share the original findings within the authorship team for the PPIR project. In the future it may be possible to establish a registration system such that pre-registering your hypothesis and analysis plan (e.g., in the context of a registered report; Chambers, 2014; Nosek & Lakens, 2014) establishes intellectual priority over a finding.
Like Retraction Watch? Consider making a tax-deductible contribution to support our growth. You can also follow us on Twitter, like us on Facebook, add us to your RSS reader, sign up on our homepage for an email every time there’s a new post, or subscribe to our new daily digest. Click here to review our Comments Policy. For a sneak peek at what we’re working on, click here.
It is mentioned in this article that, “Our results underscore the reality that irreproducible results are a natural part of science, and failed replications should not imply incompetence or bad motives on the part of either original authors or replicators.” Failed replications also do not mean the initial results were wrong. A failed replication only means that that study did not find the same results. It is possible that both are correct interpretations of the data. It could just reflect the variability in the phenomenon. Multiple replications across multiple settings can lead to greater dept in understanding (an in-the-moment, dynamic meta-analysis), but we shouldn’t be quick to assume that different results imply either one is wrong. Refer back to the Retraction Watch article at http://retractionwatch.com/2016/03/07/lets-not-mischaracterize-replication-studies-authors/.
Great point! We note in the JESP paper that one of the main reasons findings may fail to generalize from one lab to another is the heterogeneity of human populations. This is highlighted by the bad tipper effect (described above), which replicated in U.S. samples but not outside the United States. However we really should have emphasized heterogeneity of populations as an important cause of replication failures in our responses above. In future Pipeline Projects we hope to use pre-publication independent replications to test a priori hypotheses about cultural and subcultural differences in effects.
I think we are also facing the fact that we tend to overgeneralize the results of our studies. A psycological or a diet study on US student may be representative of US population, but maybe not to European or African or Asian population.
Agreed– given the variability in replication results across different populations, it makes sense to make more cautious claims regarding how generalizable any particular published effect might be beyond the original population. Notably, it is possible to empirically distinguish whether failures to replicate are due to population differences or the original study representing a false positive finding. We attempted to do this in the Pipeline Project 1 by replicating each original finding in both new populations and the original population. For example, studies originally done at Northwestern University were re-run a few years later at Northwestern but also at the University of Washington and the Sorbonne. In the Pipeline Project 1, one original finding replicated in U.S.A. samples but not in non-U.S.A. samples (reflecting population differences), while two original findings failed to replicate in both the original population and new populations (suggesting the original finding was a false positive).
A dose of reality may be needed here. Grant-funding rates are roughly 10% now. In order to replicate every study, the grant-funding rate would have to fall by half, because replication should be done in a different lab. This is neither realistic nor desirable.
Simply asking authors to increase their sample N and to accept P < 0.01 as significant might achieve a similar end for far less cost. In clinical studies, P < 0.001 is preferable, though this may not be practical in softer sciences.
The issue of resource constraints is an important one. A few thoughts:
1. Irreproducible research costs an estimated $28 billion a year in U.S. biomedicine alone (http://www.nature.com/news/irreproducible-biology-research-costs-put-at-28-billion-per-year-1.17711). Reallocating some research funds and energy to replication would likely improve the ratio between costs and scientific truth value produced.
2. Running a larger sample and using a more stringent significance cut off would be a more cost effective alternative to PPIR in biomedical and vision research, to give a few examples. But for studies involving social processes that vary across human populations (for example practically all social psychology and organizational psychology findings), there is a real value-added to running the study in multiple laboratories with different kinds of populations to test how generalizable the results are.
3. In the Pipeline Project 2, we hope to reduce the financial costs to original authors to zero by replicating their unpublished research for them in graduate and undergraduate methods courses. This approach is obviously best suited to low cost studies that do not require high levels of expertise to carry out.
A few chemistry journals, such as Organic Syntheses and Inorganic Syntheses, are devoted (the right word) to publishing more reliable methods to make molecules. Part of the review process is that another lab replicates the entire synthesis to check that it works. If there are problems with the replication that cannot be worked out, the synthesis won’t make it into the journal.
This sounds like a great initiative! The American Journal of Political Science is doing something similar by having an in-house statistician reproduce the findings of accepted papers before they are published. For more details see this link:
https://politicalsciencereplication.wordpress.com/2015/05/04/leading-journal-verifies-articles-before-publication-so-far-all-replications-failed/
At least it should be mentioned that attempt to replicate the study was unsuccessful.
Thanks for your interest! Which specific studies worked and which did not is summarized in Table 1 of the Pipeline Project 1 report, please see page 6 here:
http://home.uchicago.edu/davetannenbaum/documents/pipeline%20project.pdf
Agree with both Amy and Grant here, and still think this is a good idea for social sciences at least, for the following reason: a lot of research is done which isn’t interesting or groundbreaking and eventually heads nowhere. If different groups could consult on which areas or questions are most important or most interesting to pursue, multiple groups could “duplicate” one another’s work without fearing the wrath of senators asking why are we studying this.
Anything that improves the direction(s) different groups are heading will improve the standing of research, its reputation and perceived reliability, will enhance the amount of money that government spends on research; thus improving the rate of funding and the total funding for the field
Thanks for your positive feedback! Crowdsourcing original data collections is a great idea (one also proposed recently by Klein et al., 2014), and something we intend to pursue in a planned project.