Approximately six out of 10 economics studies published in the field’s most reputable journals — American Economic Review and the Quarterly Journal of Economics — are replicable, according to a study published today in Science.
The authors repeated the results of 18 papers published between 2011 and 2014 and found 11 — approximately 61% — lived up to their claims. But the study found the replicated effect to be on average only 66% of that reported in the earlier studies, which suggests that authors of the original papers may have exaggerated the trends they reported.
Colin Camerer, a behavioral economist at the California Institute of Technology in Pasadena, who co-authored the study, “Evaluating replicability of laboratory experiments in economics,” told us:
Four clearly failed to replicate, three were near misses (large effects but not highly significant by conventional p-value) and 11 replicated rather well.
As he and his co-authors note in the paper:
…replication in this sample of experiments is generally successful, though there is room for improvement.
Interestingly, the authors asked peer traders (those working as experimental economists) to predict the rates of replication before the replication experiments were actually conducted. They also surveyed traders on their beliefs of the probability of replication.
On average, the market prediction of the replication rate was 75.2%, and the survey belief was 71.1%, both of which turned out to be higher than the actual replication rate of 61.1%.
Camerer reflected on the results, adding:
We found that both prediction market prices, and simply asking traders for subjective probabilities of replication, correlated with later replication. This means that after experiments are published, peers actually know which ones are likely to later replicate.
The authors point out, however:
… the correlation does not reach significance for the prediction market beliefs
The study based its methods on another recent similar project that replicated 98 high-profile psychology papers (in 100 experiments — two papers were replicated twice each), which found only 39% of the tested papers to replicate the original results.
Camerer described irreproducibility as a “small problem” in economics, but declined to say whether research in the field is more or less replicable than that in other disciplines, given the study’s small sample and limited number of replications (one for each study).
Bob Reed, an economist at the University of Canterbury in Christchurch, New Zealand, who co-founded The Replication Network, a website dedicated to discussing replication studies in economics, told us:
It is a major undertaking to simultaneously run 18 sets of laboratory experiments based on different studies. [But] it is not clear how much one can learn [from] these 18 studies [about] the thousands of other studies that have been published in experimental economics.
One limitation of the paper, said Reed, is the fact that it only includes high-profile papers from the field’s stellar journals, which does not accurately represent the vast amount of economics literature. Another restriction, he added, is that the sample studies are all from the world of experimental economics, which involves research conducted in a laboratory, usually with human subjects, rather than observations from the real world.
Sarah Necker, an economist at the University of Freiburg in Germany, pointed out another limitation of the study:
At least some of the studies were run in different countries than the original studies. It is well known that behavior has a cultural component, [so] lab experiments conducted in different countries often show (at least small) differences in behavior.
Necker, however, called the study “exceptional,” noting that replication experiments are rare but “extremely important” in economics.
The current findings contradict a 2015 paper that could only replicate less than half of economics papers published in 13 top journals. From the abstract:
We successfully replicate the key qualitative result of 22 of 67 papers (33%) without contacting the authors. Excluding the 6 papers that use confidential data and the 2 papers that use software we do not possess, we replicate 29 of 59 papers (49%) with assistance from the authors. Because we are able to replicate less than half of the papers in our sample even with help from the authors, we assert that economics research is usually not replicable. We conclude with recommendations on improving replication of economics research.
Another 2015 report from Econ Journal Watch, which analyzed 162 replication studies published in economics journals between 1977–2014, found that:
Across all categories of journals and studies, 127 of 162 (78%) replication studies disconfirm a major finding from the original study. Interpretation of this number is difficult. One cannot assume that the studies treated to replication are a random sample. Also, researchers who confirm the results of original studies may face difficulty in getting their results published since they have nothing ‘new’ to report. On the other hand, journal editors are loath to offend influential researchers or editors at other journals.
Facilitating a suitable environment for replication is important, Camerer said:
Leading journals now require experimenters to archive instructions, original data, and software but not everyone complies. Software and methods also becomes obsolete. Many economists could do a better job of designing and documenting what they have done along the way, to facilitate replication much later.
Another vital issue, he said, was to make the gatekeeping process to publication — by journal editors and peer reviewers — more robust. One way to achieve this might be to teach more people the craft of replication, Camerer added. His institution, Caltech, for instance, plans to teach students an “Experimental Replications” course from next year. Camerer added:
The fact is that a lot of papers that reviewers presumably liked (which got them published), did not replicate strongly in our study. And peers in the markets predicted those which would not replicate strongly. This suggests that the current process might be using too few opinions. It is certainly possible to use a wider range of opinions in the reviewing process, technologically. Doing so might also reduce favoritism if it exists.
Like Retraction Watch? Consider making a tax-deductible contribution to support our growth. You can also follow us on Twitter, like us on Facebook, add us to your RSS reader, sign up on our homepage for an email every time there’s a new post, or subscribe to our new daily digest. Click here to review our Comments Policy.
a study advertised as at the 95% confience interrval ( the standard for a ll economics journals) that replicates at 1 SD lower (66%), does not, in fact, succeed at a lower level, it fails. So you have to count these not as lower successes, but failures, it would seem to me. While there is a good argument for abandoning the 95% confidence interval as the only marker of an important discovery or outcome, as long as the standard is in place, a study that over-reports the significance of its own results fails on its own terms. There is no “cutting some slack” here…
I was involved in the psychology project, more particularly, in the analyses of the effect sizes of the 100 psychology studies. After reading the paper on the economics project, I come to the conclusion that the same trends are visible in both fields.
First, the original studies overestimate effect size considerably relative to the replication studies.
Second, findings that tend to replicate (i) have larger sample size, and (ii) are ‘highly statistically significant’ (e.g., p < .001 or p < .01).
Check yourself; findings with 0.025 < p < 0.05 very seldom replicate. Implication for readers, reviewers, editors, is not to trust blindly findings based on small sample size, particularly if p is close to .05. Unfortunately, studies with small sample sizes (e.g., less than 100 observations) are still conducted frequently. Studies with large sample size have high power, and their findings are more robust (i.e., replicate better). This is a statistical law, and will therefore hold generally, and not only in psychology and economics.
Third, ironically, scientists can predict quite well which findings replicate, in both fields.
Thank you for covering replications! It should be noted that a failure to replicate does not mean the original finding was false (nor does a smaller effect mean the original effect was overstated). I really like the Uri Simonsohn’s primer on interpreting replications at Data Colada: http://datacolada.org/2016/03/03/47/
I do not like Uri’s primer. He gets lost in statistical details, does not accurately represent RPP psychology, and does not focus on the big picture.
The study is exclusively about experimental economics and in just two journals, only 18 studies tested. The title you chose is way too general.
Your headline and lede imply that this is about top-tier economics papers generally; further on it becomes clear that the study is restricted to experimental economics. Most empirical studies in economics are not experimental.
Your link took me to a Science paper on “Highly stretchable electroluminescent skin” ?!
Fixed, thanks. Here’s the correct link: http://dx.doi.org/10.1126/science.aaf0918