While the presence of publication bias – the selective publishing of positive studies – in science is well known, debate continues about how extensive such bias truly is and the best way to identify it.
The most recent entrant in the debate is a paper by Robbie van Aert and co-authors, who have published a study titled “Publication bias examined in meta-analyses from psychology and medicine: A meta-meta-analysis” in PLoS ONE. Van Aert, a postdoc at the Meta-Research Center in the Department of Methodology and Statistics at Tilburg University, Netherlands, has been involved in the Open Science Collaboration’s psychology reproducibility project but has now turned his attention to understanding the extent of publication bias in the literature.
Using a sample of studies of psychology and medicine, the new “meta-meta-analysis” diverges from “previous research showing rather strong indications for publication bias” and instead suggests “only weak evidence for the prevalence of publication bias.” The analysis found mild publication bias influences psychology and medicine similarly.
Retraction Watch asked van Aert about his study’s findings. His answers have been lightly edited for clarity and length.
RW: How much are empiric analyses of publication bias influenced by the methods used? Based on your work, do you believe there is a preferred method to look at bias?
Empiric analyses of publication bias are definitely influenced by the selected methods. All methods take different procedures for testing/correcting effect size for publication bias, so different methods yield different results depending on characteristics of the primary studies in a meta-analysis. Research on the statistical properties of these methods has also shown this in, for instance, Monte-Carlo simulation studies. This is also the reason why we, together with others, not only recommend in our paper to routinely assess publication bias in every meta-analysis, but also to apply multiple publication bias methods.
I would argue that there is currently no preferred method to look at publication bias; no method outperforms all the other available methods in all possible situations. There are, however, promising results of selection model approaches to correct effect size for publication bias. These methods are nowadays seen as the state-of-the-art. We also recently developed the p-uniform* method that is a selection model approach as well and showed promise to correct effect size for publication bias (preprint: https://osf.io/preprints/metaarxiv/zqjr9/).
RW: Could you explain why you analyzed only meta-analyses with homogeneous effect sizes? Do you think this choice significantly limits the generalizability of the study?
We decided to analyze only homogeneous subsets of effect sizes because the vast majority of publication bias methods do not perform well if the true effect size is heterogeneous. At the time we preregistered this study, no regularly used publication bias method could deal with heterogeneity in true effect size. If we would not have created homogeneous subsets, this could have resulted in completely different and unwarranted conclusions. For example, statistical significance of some of the included tests is interpreted as evidence for publication bias in a meta-analysis, whereas the significance of such a test can also be caused by heterogeneity in true effect size. Moreover, publication bias methods that correct effect size for publication bias generally overestimate effect size if the true effect size is heterogeneous.
Yes, this choice definitely limits the generalizability of our paper because our results can only be generalized to subsets of effect sizes without evidence for medium or higher heterogeneity, as we also state in our paper. However, the recent attention for selection model approaches and also the development of the p-uniform* method (see above) provides a great opportunity for future research, because these methods can actually deal with heterogeneity in true effect size. We are planning to reanalyze the large-scale dataset of meta-analyses by applying these methods to study the presence and severity of publication bias in the homogeneous and heterogeneous subsets.
RW: You wrote “The simulation study showed that the publication bias tests were only reasonably powered to detect extreme publication bias where all statistically nonsignificant effect sizes remain unpublished.” Does this imply that the methods you used may have limited sensitivity in detecting meaningful publication bias?
We conducted a Monte-Carlo simulation study to examine what the statistical properties are of the publication bias methods under conditions that were as similar as possible to the conditions observed in the analyzed homogeneous subsets. It turned out that the publication bias tests indeed had low statistical power to detect publication bias. One of the reasons for this low statistical power was the small number of primary studies in the homogeneous subsets (median number of studies was 6). However, low statistical power may also be caused by publication bias being absent in the homogeneous subsets. An indication for this is the small percentage of statistically significant effect sizes in the homogeneous subsets, which were 28.9% and 18.9% for homogeneous subsets representing meta-analyses from psychology and medicine, respectively. This percentage is expected to be substantially larger if publication bias is severe. Another indication is that overestimation caused by publication bias was hardly present in the homogeneous subsets.
A possible reason for not observing strong evidence for publication bias in our paper is that publication bias is less of an issue if the relationship of interest in a meta-analysis is not the main focus of the primary studies. That is, statistical significance of the main result in a primary study probably determines whether a result gets published, rather than whether a secondary outcome or supplementary result is significant.
RW: What has it been like for you training in science in the era of the so-called “reproducibility crisis”?
During my studies, the Stapel affair took place at my university resulting in an increased awareness for how research should be properly conducted. This awareness led to more emphasis on how high quality research should be conducted and helped me to critically evaluate research by others, but even more important also my own research. I believe that attention to topics like how to conduct a power analysis to determine the required sample size of a study and preregistration should be part of any studies that prepare students for a job in academia. Researchers can, of course, also improve the way they conduct research during their career, but it would be desirable if it is part of their training.
RW: Do you think younger scientists are developing new perspectives or methods for reproducibility?
My impression is that younger researchers have more attention for improving their research practices and using new methods. However, I may well be living in a bubble collaborating and speaking with like-minded researchers, whereas other (younger) researchers are not really changing the way they are doing research.
Like Retraction Watch? You can make a tax-deductible contribution to support our growth, follow us on Twitter, like us on Facebook, add us to your RSS reader, sign up for an email every time there’s a new post (look for the “follow” button at the lower right part of your screen), or subscribe to our daily digest. If you find a retraction that’s not in our database, you can let us know here. For comments or feedback, email us at email@example.com.