“The Replication Paradox:” Sans other fixes, replication may cause more harm than good, says new paper

M.A.L.M van Assen
Marcel .A.L.M van Assen

In a paper that might be filed under “careful what you wish for,” a group of psychology researchers is warning that the push to replicate more research — the focus of a lot of attention recently — won’t do enough to improve the scientific literature. And in fact, it could actually worsen some problems — namely, the bias towards positive findings.

Here’s more from “The replication paradox: Combining studies can decrease accuracy of effect size estimates,” by Michèle B. Nuijten, Marcel A. L. M. van Assen, Coosje L. S. Veldkamp, and Jelte M. Wicherts, all of Tilburg University:

Replication is often viewed as the demarcation between science and nonscience. However, contrary to the commonly held view, we show that in the current (selective) publication system replications may increase bias in effect size estimates.

During the study, published in the Review of General Psychology, the authors looked at the effect of replication on the bias towards positive findings, taking into account the additional effects of publication bias — the tendency of journals to favor publishing studies that show a statistically significant effect — and the sample size or power of the research.

We analytically show that incorporating the results of published replication studies will in general not lead to less bias in the estimated population effect size. We therefore conclude that mere replication will not solve the problem of overestimation of effect sizes.

In other words, replications, co-author van Assen tells Retraction Watch:

may actually increase bias of effect size estimation. Scientists’ intuitions on the effect of replication on accuracy of effect size estimation are wrong (or weak at best). Replications are beneficial for and improve effect size estimation when these replications have high statistical power.

As long as journals prefer to publish papers that show a positive effect, the authors argue, journals will also prefer to publish replication studies that show a positive effect, regardless of whether it truly exists. Since many scientists believe in the ability of replication studies to reveal the truth about research, some replication studies may actually cloud the truth more than reveal it. Here’s more from the paper:

According to the responses to the questionnaire, most researchers believe that a combination of one large and one small study yield a more accurate estimate than one large study. Again, this intuition is wrong when there is publication bias. Because a small study contains more bias than a large study, the weighted average…of the effect sizes in a large and a small study is more biased than the estimate in a single large study.

All of that, van Assen says, is an argument against positive publication bias, which he calls

currently one of the largest threats to the validity of scientific findings.

Brian Nosek, of the Center for Open Science, has been active in in the push towards reproducibility and replications. He tells us:

The key part of the paper is the answer to the question – what is the culprit?

The answer is publication bias.  Replications will increase precision of estimation as long as publication is not contingent on obtaining significant results in original or replication studies.  But, if original studies are biased by the publication demand to obtain positive results, then including them in aggregation may not improve the accuracy of estimation.

So, it isn’t really a story about replication.  It is a story about how the negative effects of publication bias are hard to undo, even with the presence of some replications.
Much like a bad first date, it is really hard to recover and develop a relationship that you can believe in.

Until the inherent flaws of the publishing system are corrected, scientists should stick to high-powered research, the authors advise:

To solve the problem of overestimated effect sizes, mere replication is not enough. Until there are ways to eliminate publication bias or correct for overestimation because of publication bias, researchers are wise to only incorporate and perform studies with high power, whether they are replications or not.

Like Retraction Watch? Consider supporting our growth. You can also follow us on Twitter, like us on Facebook, add us to your RSS reader, and sign up on our homepage for an email every time there’s a new post. Click here to review our Comments Policy.

25 thoughts on ““The Replication Paradox:” Sans other fixes, replication may cause more harm than good, says new paper”

    1. PS: I forgot to add. It’s not so much the bias towards positive findings as it is against negative ones. The problem is two-pronged.

  1. I think this study makes an implicit assumption that I’m not sure is true. It assumes that publication bias is always a bias towards publication of positive findings.
    This is plausible in a general sense, as journals tend to publish “interesting” results and a positive result is generally more interesting.

    But is that true for replications? I think it makes a far more catchy headline to claim that “important study is probably false” than “important study is confirmed”. It seems plausible to me that with replications publication bias is more likely to favor negative findings. It would be interesting to check whether that is true and if there is a significant difference in publication bias in replications versus original studies.

    Anyway, the obvious conclusion is still that publication bias should be avoided whenever possible, no matter in which direction it is leaning. One way to do so is preregistration. If you want to improve the scientific record via replications make sure you have a preregistration system in place for your replications.

    1. Hanno, you make an excellent point: it is indeed the case that journals seem to want to publish the most eye-catching results, and a non-replication of an important finding definitely seems to be worth publishing.

      However, there are two main reasons why this might not be the case. First and foremost, replications are often not identifying themselves as such. Archive research in the published literature has shown that almost no replication studies are published. However, if you look a bit further, it turns out that actually a lot of replication studies are done, but they are embedded in multi-study papers, or are conceptual replications that don’t have “replication” in their title. Much of the work of Greg Francis shows that multi-study papers (a.k.a. a collection of replications) contain way too many significant findings. Furthermore, research on meta-analyses (a.k.a. another bunch of “conceptual replications”) also shows a lot of evidence for publication bias and inflated effect sizes.

      The second reason why replications that don’t find the original effect size might not be published very often, is that it is very easy to come up with reasons why the replication “didn’t work”. Some (but I fear many) scientist feel that negative results are just not so informative at all (see for instance the discussion concerning Jason Mitchell’s blog about the emptiness of failed replications http://blogs.discovermagazine.com/neuroskeptic/2014/07/07/emptiness-failed-replications/#.VZrXYxPtmko).

      So in sum, even though according to the scientific principle a failed replication should in many cases be considered as controversial and interesting,
      – very few replication studies that identify themselves as such end up in the literature
      – if they do end up in the literature it is in multi-study papers or meta-analyses, and we have evidence that these are severely biased
      – failed replications are easily ignored

    2. Negative results that contradict prior positive results apparently will be favored, at least in some circumstances. See:
      Ioannidis JP, Trikalinos TA. 2005. Early extreme contradictory estimates may appear in published research: the Proteus phenomenon in molecular genetics research and randomized trials. Journal of clinical epidemiology 58(6):543-9.

      1. Yes, I’ve seen that paper, and I think that there might be a difference between fields in the probability that a failed replication will be seen as controversial. In my own field, the social sciences, theories are often multifaceted, and alternative explanations for failures to replicate are relatively easy to come up with. This might be different in epidemiology, in which theories (as far as I know) are more clear cut.

  2. I’m not a scientist, so please correct me if I’m wrong. It seems to me that the description of a positive finding is likely to seem more convincing than the description of a negative funding because a positive finding – “the data support the hypothesis” – implicitly supports the hypothesis and challenges innumerable possible hypotheses. On the other hand, negative results – “the data do not support the hypothesis” – does not imply anything positive. In other words, positive findings feel definitive and error-free, while negative findings feel tenuous because one tiny misstep in the process can produce an incorrect false negative. Science is confirmed in part by discovering and understanding false positives, which can sharpen a hypothesis.

    Whether the above makes any sense, the point that I’m trying to make in this context is that positive and negative findings are not symmetrical; they are different beasts that probably should be dealt with in quite different ways.

  3. Great, all this attention for replication and publication bias. I want to emphasize, however, that Michèle Nuijten is first author of the paper.

    Later on Michèle may react to some comments and questions raised here, concerning the analyses and interpretations of their results.

  4. I recall someone suggesting that the confirmation / disconfirmation biases run in different directions across time. At first, only positive findings of confirmation are publishable. Eventually, though, the craving for headlines makes mere confirmation unpublishable, while findings that disconfirm the received wisdom become newsworthy. So it may be a question of timeframe.

    1. Hi Ed, yes that’s called the Proteus phenomenon (see the discussion under Hanno’s comment above). This is a phenomenon that has been established at least once in epidemiology, but so far, in the social sciences we mainly see evidence for a steady preference for “positive” results (we base this on evidence from multi-study papers and meta-analyses).

  5. So, am i correct in thinking that we need 1) highly-powered studies and 2) publication of all results (so both “positive” and “negative”?

    If this is correct, why are journals/scientists not doing this?

    1. That is absolutely right. Even one of the two would greatly improve things. So indeed, why is science not doing this?

      Firstly, to run a highly powered study is very costly, and at this point it doesn’t have very clear rewards. In a very interesting paper called “The Rules of the Game Called Psychological Science”, Marjan Bakker et al. show that if your goal is to obtain a statistically significant result, and you have money for 100 participants, you have higher chances to find significance if you split this up in 5 small studies of 20 participants. This greatly reduces power, and any significant effect you find is probably a type I error. But in this climate it probably will get you published.

      So, secondly, why don’t publishers stop focusing so much on statistical significance? It might be a misunderstanding of statistics; several studies show that statistical significance is often interpreted as being more interesting and news worthy than nonsignificant findings. On top of that, authors themselves are also less inclined to even write up and submit nonsignificant findings. Maybe they think they’re not interesting, maybe they think the journal will not accept it anyway.

      Luckily more and more scientists start to see the importance of fighting publication bias and encouraging high power, and there are more and more initiatives to support this. There are large scale replication projects (e.g., https://osf.io/ezcuj/wiki/home/) and journals start offering the possibility to preregister your study: they review your proposal, and if it’s good enough the paper will be published not matter what the results are.

      So in short, running studies with high power has high costs and low rewards and there seems to be a general view that nonsignificant findings are not interesting. Luckily things seem to be changing!

      1. So, am i correct in reasoning that because of low-powered studies and publication bias over the past decades we simply don’t know which findings published in the past 50-100 years are “true”, if any ?

        I reason that even with findings that have been replicated over the past 50-100 years, there may be large publication bias concerning these, so even replicated findings don’t necessarily say much concerning their validity.

        1. Well, of course that depends on the specific line of research, but assuming that publication bias is there AND all studies have low power, then it’s likely that the effects are highly overestimated, and you would have a hard time estimating which effects are actually “true”.

          However, I don’t think that we have to throw away everything published in the last decades. I think it is safe to assume that the amount of publication bias differs per field/line of research, meaning that also the amount of overestimation will be different per field.

          Furthermore, not every single study of the last decades was underpowered. If a study has high power, it will not be affected by publication bias (or less so), and it will give an unbiased estimate of the effect.

          So in short, how much trust you can place in published scientific findings depends on a lot of things, of which publication bias & the power are just two examples.

  6. I’m not warm and fuzzy about the premise that we should be wary of replication efforts, in the context of behavioral science especially – a field that has generated some blockbusters only to fall to the ground in flames later on.

    This reminds me of Meehl’s article, THEORY-TESTING IN PSYCHOLOGY AND PHYSICS: A
    METHODOLOGICAL PARADOX
    Philosophy of Science, 1967, Vol. 34, 103–115

    “In the physical sciences, the usual result of an improvement in experimental
    design, instrumentation, or numerical mass of data, is to increase the difficulty of
    the “observational hurdle” which the physical theory of interest must successfully
    surmount; whereas, in psychology and some of the allied behavior sciences, the usual
    effect of such improvement in experimental precision is to provide an easier hurdle
    for the theory to surmount.”

    I’d like to stick with the alleged dangers of replication.

  7. Again and again and again the psychological sciences (apparently social in particular) fail (or avoid, or simply cannot clearly comprehend) the serious culprit at work in the so-called replication crisis. It is the sad nature of theory in the discipline and the marked tendency to demonstration rather than theory driven research that enables and perpetuates much of the apparent distress voiced by the discipline. Time to take a serious look inward.

  8. Gezien de uitkomsten ben Je geneigd je af te vragen waarom de methodologische eisen aan (psychologisch) onderzoek feitelijk niet of nauwelijks worden gesteld? Waarom vormen die eisen geen drempels voor publicatie in wetenschappelijke bladen?
    Wetenschap moet gaan om hoe zaken in elkaar zitten; vooral niet over wat goed verkoopt.

    Over de methode van wetenschappelijk (statistisch) onderzoek naar menselijk gedrag, verwijs ik naar het nuttige boekje van H.A.P. Swart “Over het begrijpen van menselijk gedrag”, blz. 68-73. ISBN 90-6009-523-5

  9. Having reviewed more meta-analytic papers than I care to think about, I feel the entire endeavor is of questionable worth. Time and time again studies are combined in which the factor(s) under analysis is so amorphously conceptualized that the presumed summary conclusions hold no apparent warrant. I would say it is like comparing apples and oranges, but unlike psychological constructs, we have a good idea what apples and oranges consist in. By contrast (as one example familiar to me), meta-analyses of self compare studies in which they key player (self) is so poorly specified that there simply is “no there there”. Psychology’s key problems are not method or replication; they are conceptual.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.