Retraction Watch

Tracking retractions as a window into the scientific process

What does “reproducibility” mean? New paper seeks to standardize the lexicon

with one comment

Science Translational MedicineWhat is the difference between “reproducible” and “replicable”? And how does each relate to results that are “generalizable” and “robust”?

Researchers are using these terms interchangeably, creating confusion over what exactly is needed to confirm a scientific result, argues a new paper published today in Science Translational Medicine.

Here’s how the US National Science Foundation (NSF) defines “reproducibility,” according to the authors:

…reproducibility refers to the ability of a researcher to duplicate the results of a prior study using the same materials as were used by the original investigator. That is, a second researcher might use the same raw data to build the same analysis files and implement the same statistical analysis in an attempt to yield the same results…. Reproducibility is a minimum necessary condition for a finding to be believable and informative.

And here’s how the NSF defines “replicability,” the authors say:

…the ability of a researcher to duplicate the results of a prior study if the same procedures are followed but new data are collected.

But, the authors note, these definitions are not universally used, and are in some cases reversed, adding:

If one looks at the terminology being used across the scientific literature, one finds similar variation and intermingling of concepts.

The authors propose splitting the most widely used term — “reproducibility” — into three underlying terms: methods reproducibility, results reproducibility, and inferential reproducibility.

First author Steven Goodman from the Meta-Research Innovation Center at Stanford (METRICS) at Stanford University summarized the three subcategories to us as follows:

We need to clearly distinguish issues related to transparency and complete reporting of methods (methods repro..) vs. production of new evidence (results repro…) and drawing the same conclusions (inferential).

Here’s how the authors define “methods reproducibility:”

Methods reproducibility is meant to capture the original meaning of reproducibility, that is, the ability to implement, as exactly as possible, the experimental and computational procedures, with the same data and tools, to obtain the same results.

As Goodman explained:

Ensuring that the original study was performed as described adds no additional evidence because one is not collecting new data, only ensuring that the data and results already described would in theory have been obtained by independent observers if they followed exactly the same procedures as the first experiment…for this kind of reproducibility – which we call methods reproducibility – one does not gather new data, and hence there is no new evidence, although one can enhance the trust one has of the evidence in hand.  One is merely confirming that all the methods used in the study are accurately described, and could in theory be employed again in a new study. Thus, this kind of reproducibility is about completeness and transparency in reporting, including data sharing.

Here is how Goodman and his co-authors — John Ioannidis and Daniele Fanelli, both also at Stanford — define “results reproducibility:”

Results reproducibility refers to what was previously described as “replication,” that is, the production of corroborating results in a new study, having followed the same experimental methods

As Goodman explained:

Results reproducibility refers to following the same procedures as an original study in a new and independent study, and finding results that are “the same”. We point out in the article how definition of “the same” is neither standardized, nor in many cases possible, so in a sense reproducibility is not a good paradigm to use for results reproducibility.

There has been a lot of confusion around these terms, Goodman noted:

The point is that some people use replication to mean methods reproducibility, and others to be results reproducibility. And the term “reproducibility” by itself is similarly used in both senses by different groups or people.

Finally, they outline the meaning of “inferential reproducibility:”

Inferential reproducibility, not often recognized as a separate concept, is the making of knowledge claims of similar strength from a study replication or reanalysis. This is not identical to results reproducibility, because not all investigators will draw the same conclusions from the same results, or they might make different analytic choices that lead to different inferences from the same data.

Although the authors apply the terms in the biomedical field, they argue that their underlying principles have “utility across many domains of science.”

“Inferential reproducibility,” according to the authors, is under recognized:

…scientists might draw the same conclusions from different sets of studies and data or could draw different conclusions from the same original data, sometimes even if they agree on the analytical results.

Goodman told us “inferential reproducibility” may be the most important element:

Because what people actually conclude/recommend after a study is often the only thing that is paid attention to.

Ultimately, as he and his colleagues note in the article, inferential reproducibility might be an “unattainable ideal.” Goodman explained:

Science doesn’t always produce 100% agreement about claims based on data; the vigorous debate about any givien interpretation is part of science, and what makes it better.

“Robustness” and “generalizability” are two terms that are sometimes interchangeably used with reproducibility, the authors note. But they are different, as Goodman explained to us:

Robustness refers to the sensitivity of results to mild change research design. Generalizability is whether the results apply in non-experimental situations or in persons unlike those in the study.

For more, see today’s column about the vocabulary of reproducibility by co-founders Adam Marcus and Ivan Oransky in STAT.

Like Retraction Watch? Consider making a tax-deductible contribution to support our growth. You can also follow us on Twitter, like us on Facebook, add us to your RSS reader, sign up on our homepage for an email every time there’s a new post, or subscribe to our new daily digest. Click here to review our Comments Policy. For a sneak peek at what we’re working on, click here.

Written by Dalmeet Singh Chawla

June 1st, 2016 at 2:20 pm

  • Jaime A. Teixeira da Silva June 1, 2016 at 2:58 pm

    A useful and timely paper, especially for resolving some issues in PPPR. The clarity between definitions allows specific problems in papers to be more accurately described in a PPPR report. Although most of the paper describes the redifferentiation of RRR (reproducibility, replicability, reliability), I felt that a detailed description of what differentiated most of the 11 terms used in Table 2 were not described, or defined, in the text, nor linked to the other parts of the paper, or other aspects of RRR, leaving the reader with the only option to try and access the original sources/citations where these terms were coined to better understand them. Despite this, a practically useful paper.

  • Post a comment

    Threaded commenting powered by interconnect/it code.