
President Trump recently issued an executive order calling for improvement in the reproducibility of scientific research and asking federal agencies to propose how they will make that happen. I imagine that the National Institutes of Health’s response will include replication studies, in which NIH would fund attempts to repeat published experiments from the ground up, to see if they generate consistent results.
Both Robert F. Kennedy Jr., the Secretary of Health and Human Services, and NIH director Jay Bhattacharya have already proposed such studies with the objective of determining which NIH-funded research findings are reliable. The goals are presumably to boost public trust in science, improve health-policy decision making, and prevent wasting additional funds on research that relies on unreliable findings.
As a former biomedical researcher, editor, and publisher, and a current consultant about image data integrity, I would argue that conducting systematic replication studies of pre-clinical research is neither an effective nor an efficient strategy to achieve the objective of identifying reliable research. Such studies would be an impractical use of NIH funds, especially in the face of extensive proposed budget cuts.
When I interviewed for my first job in scholarly publishing more than 30 years ago, one of the questions I was asked was, “Are you OK with the fact that half of what you publish will turn out to be wrong?” We didn’t call the pervasiveness of unreliable research a “reproducibility crisis” then. We just accepted as fact that science is an inefficient, iterative process. Researchers have never been under the illusion that everything published is correct. The enterprise of science progresses by means of a messy combination of error-correction, replication, and accumulation of corroborating lines of evidence, the latter of which is vital for any work with the potential to change medical care or health policy.
The “crisis” came to light with the publication of replication studies by authors from Bayer Healthcare in 2011 and Amgen in 2012. Bayer and Amgen reported on their efforts to replicate published pre-clinical research, primarily in the field of cancer biology, before initiating their own drug-development programs based on the work. The companies found that ~66 percent and 89 percent, respectively, of published studies could not be replicated. Those quantifications were higher than expected based on folklore, although a more recent study was closer to the 50 percent mark.
The consequences of irreplicability, in which studies gain attention or even traction despite the inability of others to replicate or meaningfully build on them, are serious, and addressing this issue is worthwhile. Improving replicability could save money, decrease time to discovery (which leads to decreased human suffering and health-care costs), and increase public trust in science.
But there are numerous problems with (and questions raised by) using systematic, post-publication efforts to improve replicability in biomedical research.
Sheer volume: Replicating a small number of studies will not make a dent in the problem when there are over 1 million articles per year published in biomedical journals. Granted, only a small percentage of published studies will interest other researchers to an extent that they will want to build on them and thus need the work to be replicable. How will those studies be identified for replication? Dr. Bhattacharya has alluded to a selection process “by the scientific community at large”— in other words, a post-publication peer-review process. Does NIH have the capacity to take on another review process? Can the review be done quickly as part of an initiative that is already time-sensitive?
Selection criteria: There is already an NIH replication initiative, under which researchers can elect to have their own studies replicated. This process is likely self-selecting for robust studies, creating a bias toward replicable work. Someone who does careless work is not likely to volunteer to have it replicated. It is notable that uptake of the existing program, which has been in place for a year, has been sparse. Will the replication studies envisioned by RFK Jr. and Dr. Bhattacharya be forced on authors? It may be hard to draw reliable scientific inferences from replication efforts that take place in the absence of author cooperation.
What constitutes replication? The criteria for defining “replication” are unclear. Do they entail remaking every recombinant DNA construct? Rerunning every assay? Resequencing every genome? Remaking every mouse knockout? And how many attempts at replication constitute failure? Failure to replicate once does not mean that you won’t replicate on the second, third, or fourth attempt.
Sufficient expertise: Many types of experiments require substantial expertise and experience, and it may be difficult to find researchers with the appropriate expertise who are willing to function as service providers to replicate studies. This could lead to reliable studies being misclassified as irreplicable.
Dissemination of replication studies: Will all NIH-sponsored replication studies have to be published in the new journal proposed by Dr. Bhattacharya? If not, how will the public be informed of the results of the replication studies? The results of the existing NIH replication initiative are not being made public.
Consequences for the original study: What are the consequences if a study cannot be replicated? Should the original paper be retracted? Should it at least get an “Expression of Concern”? Should the authors be denied funding in the future?
What happens if a study can be replicated. At the very least, the original publication should get some sort of gold star badge, indicating this fact and linking to the replication study, assuming it’s published somewhere. Either way, the public should be informed.
Timing is everything: If a study is deemed irreplicable, that information will probably arrive too late to be of use by any interested parties. By the time NIH has chosen the studies to replicate, and replication efforts have conclusively failed, it seems likely that any pharma/biotech scientist who is interested in the work will already have failed to replicate it and moved on to another approach. The same goes for any academic who is trying to build on the work. In other words, people will have already spent time figuring out for themselves whatever the findings of the replication initiative later reveal. Thus, timing is a fundamental problem with creating a systematic replication program post-publication.
Public trust: If history is any guide, failure to replicate will not serve to correct public misunderstandings of science. For example, the Wakefield study linking the MMR vaccine to autism has failed to replicate many times, yet a substantial portion of the public continues to believe it. How many replications are sufficient to convince people? Perhaps the focus should instead be on public communication around known failures to replicate.
What can be done to improve replicability in pre-clinical research? I believe the focus should be upstream of publication, before irreplicable studies get into the published literature.
There are many stages in the process of scientific research where flaws might be introduced, all of which are targets for intervention: Poor study design and execution. Poor characterization of materials such as antibodies or cell lines, and poor description of methods. Inappropriate statistical analysis. Selective data presentation. There’s outright fraud, including the new scourge of wholly fabricated articles for sale by “paper mills,” which is far more prevalent in biomedical research than the old “few bad apples” notion. And there’s the distorted incentive of publish-or-perish, which can cause people to deliberately cut corners to publish quickly and frequently by not replicating their own work enough times before deciding it is ready for publication.
In my own field of image forensics, I have advocated for more than 20 years for systematic and universal screening of image data before publication. There are now automated ways to perform such screens, which should lower the bar for participation in this type of quality control by all stakeholders, including funders, institutions, and publishers. These algorithms are still far from perfect, but they allow for rapid screening at scale.
There are also automated ways to screen for numerous other indicators of the robustness and reliability of a study — for example, inappropriate statistical analysis, use of generative AI, or evidence of paper mill activity — that should be deployed more widely. Screening, of course, is much easier to apply universally to all articles before publication compared to replication studies after publication.
Unfortunately, neither replication efforts nor screening of data pre-publication will change the underlying problem of the publish-or-perish culture. This perverse incentive costs taxpayers billions of dollars a year funding non-robust science for the personal gain of researchers in desperate need of funding, hiring, or promotion, to the detriment of science’s proper pursuit of truth.
Successfully reducing that incentive structure by rewarding quality over quantity will motivate researchers to replicate their own work more times before publication. Perhaps a minimum number of replications can be codified and formally paid for as a line item in grant funding. Showing all replications in a published paper and making all of the underlying data available to the public will help to address public trust in science. Funders, institutions, and publishers all have roles in addressing this problem and helping to reduce the number of irreplicable studies that are published.
Mike Rossner is the founder of Image Data Integrity, Inc.
Like Retraction Watch? You can make a tax-deductible contribution to support our work, follow us on X or Bluesky, like us on Facebook, follow us on LinkedIn, add us to your RSS reader, or subscribe to our daily digest. If you find a retraction that’s not in our database, you can let us know here. For comments or feedback, email us at [email protected].
Wise words Mike! When researchers cannot reproduce published findings, it is seldom possible to publish this in the same journal in which the original paper appeared. Publishers should have a policy of publishing papers that challenge major results that they have previously published – and they should also put that policy into practice.
These arguments are weak and the author doesn’t offer a better solution.
“In my own field of image forensics, I have advocated for more than 20 years for systematic and universal screening of image data before publication.”
That’s what leads to retractions and the correction of the scientific record. There is some replication study initiative at present (in addition to the NIH one), replications are performed, results published, but no retractions have happened. If I am wrong please correct me.
A bit left field. “Timing is everything”. What about testing centres, where the scientific claims in a paper had to be replicated BEFORE publication?
There is a famous precedent for this. It took 2 years for Friedrich Miescher to publish his work on the isolation of DNA. Because the discovery was so unlike anything seen at that time that the editor, the well-known chemist Hoppe-Seyler, repeated all Miescher’s work before publishing it. Editors as active scientists, not secretaries for business models/front organisations.
Science should be the central activity of science . This would have the added bonus of putting a stop to the highly lucrative publishing houses such as Elsevier, Springer Nature, and Wiley, let alone the so-called paper mills. I don’t think any of these highly lucrative publishing houses in fact perform a single experiment unlike Hoppe-Seyler, who performed many. How do they know if they were being systematically lied to?
Less left field. A national audit/screen of U.S. scientific image data? When the data are either duplicated or manipulated it means that the authors are not speaking/using the scientific language correctly. That alone is enough not to believe their results. By mistake, or because of misconduct, is beside the point. Gaspard de Prony’s 1791 work organising the production of logarithmic and trigonometric tables for the French Cadastre (geographic survey) provides a well documented and well-worn blueprint on how to achieve this.
> The goals are presumably to boost public trust in science, improve health-policy decision making, and prevent wasting additional funds on research that relies on unreliable findings.
This seems rather naive.
> Dr. Bhattacharya has alluded to a selection process “by the scientific community at large”— in other words, a post-publication peer-review process.
In practice, I suspect this means “selection by self-appointed representatives of the scientific community at large”.
Sensible points – I do sometimes wonder whether rather than being standalone replication studies, an element of replicating previous studies should be built into all new grants. In other words if a new project relies on some other recent finding, part of the funding for the new project should aim to replicate the previous work that it subsequently relies upon.
An easy solution to a significant part of the reproducibility crisis would be to credit only publication in journals where the bar for a positive result was set to a p value lower than 0.05. This can be done in grant evaluation with little further ado, saving huge amounts of the lab work otherwise needed. The present 5% value is an arbitrary number decided a century ago on a whim.
A reduction to a value of 0.02 will remove a mass of false positive. The increased requirement to statistics is a factor a little more than 6. Given the irreproducible ratios of 66 or 89 % quoted this seems like a decent number. Problems will remain, but the worst sneeze plots will be weeded out.
Problems go much deeper than this. The real issues revolve around default markers of quality (ie publishing in an elite journal, doesn’t matter about real outcomes), COIs between these journals and a hierarchical model of research institutions complicit media, and lack of democratic structures within the latter ie tenure for life, or lack of rotating PIs and self regulation. The system is so broken that revolutionary change is needed and defunding is a first step, brutal as it may appear
I agree with the last sentence.