Time for a scientific journal Reproducibility Index

labtimes july 2013Retraction Watch readers are by now more than likely familiar with the growing concerns over reproducibility in science. In response to issues in fields from cancer research to psychology, scientists have come up with programs such as the Reproducibility Initiative and the Open Science Framework.

These sorts of efforts are important experiments in ensuring that findings are robust. We think there’s another potential way to encourage reproducibility: Giving journals an incentive to publish results that hold up.

As we write in our latest LabTimes column, we have already called for a Transparency Index

to supplement the standard “Impact Factor” – a gauge of how often papers are cited by other papers, which journals use to create a hierarchy of prestige.

As we note, the “Transparency Index won’t solve the problem of bad data.” But we’d like to suggest another that could help: the Reproducibility Index.

Rather than rate journals on how often their articles are cited by other researchers, let’s grade them on how well those papers stand the most important test of science: namely, does the work stand up to scrutiny?

The idea is to encourage “slow science” and careful peer review, whilst discouraging journals from publishing papers based on flimsy results that are likely to be heavily cited. Like the Transparency Index, the Reproducibility Index could supplement the impact factor. In fact, one way to judge average reproducibility would be to calculate what percentage of citations for a given paper shows replication versus inability to reproduce results.

As Brenda Maddox, wife of the late Nature editor John Maddox, said of her former husband:

Someone once asked him, ‘how much of what you print is wrong?’ referring to Nature. John answered immediately, ‘all of it.  That’s what science is about – new knowledge constantly arriving to correct the old.’

But journals — particularly the high-impact ones — seem to need an incentive to publish such correctives. The Reproducibility Index could “take into account how often journals are willing to publish replications and negative findings,” as Science did in a case we discuss in the new column. What else might it include? As we conclude:

There are, of course, a lot of details to work out and we look forward to help on doing that from readers of Lab Times and Retraction Watch. Isn’t the reproducibility of science worth it?

44 thoughts on “Time for a scientific journal Reproducibility Index”

  1. Perhaps some funding bodies should club together to get a lab and group of scientists to reproduce important experiments of major papers? This “reproducibility institute” (c)(tm) could could rank/grade papers and act like an F1000 for the actual experiments, rather than for the papers in which they are presented.

    Doubtless funding bodies will be queuing up to throw money at this 🙂

    1. Reproducibility is only side of a hexagon in the publishing scene. Reproducibility would take years to achieve, depending on the discipline. It could be one measure to enforce quality control, but it wouldn’t work accross the board. For example, it would be rare to find any group who would be willing to repeat 3-5 season/year agronomic experiments. An RI would make sense for studies like organic chemistry where analyses can be performed and repeated several times fairly quickly to ensure precision and which use equipment that can ensure a high level of accuracy. So, that makes sense, but the practical issue of WHO would conduct such repetitions remains the biggest stumbling block. There are bigger issues that need to be pursued first, I believe, both related to the fraud by scientists, and to the fraud by publishers as determined by the Predatory Score (http://www.globalsciencebooks.info/JournalsSup/images/2013/AAJPSB_7(SI1)/AAJPSB_7(SI1)21-34o.pdf). Personally, I just see the RI as another toy used by non-academics to add more noise to an already overly noisy situation.

    2. @xtaldave said: “Doubtless funding bodies will be queuing up to throw money at this..”

      In the past some were, they were known as Pharma shareholders, but their tolerance for research that only yields publications is at an all-time low.

      Consequently, at an individual level, there is little incentive among Pharma scientists to publish dead ends, mainly because in-depth research is abandoned by most organizations once fatal flaw(s) become evident. furthermore, journal editors aren’t typically disposed to publish incomplete “stories” like those we tend to generate during target validation. Consequently, unless one of these studies uses new analytical methods or data analysis strategies that might be of some use to the community, I don’t even bother writing it up.

      Perhaps there needs to be a way to make the publication of pharmacological “dead ends” profitable to individual scientists, and in turn increase the reputational damage of junk science?. Maybe a negative impact score tied to both the initial impact factor and the speed of refutation? A bonus for those who quickly overturn “big splash” papers?

      I suspect that the latter would quickly devolve into a mess, but I think it would be useful for the editors of higher impact journals to agree to redefine the “least publishable unit” to allow for shorter papers when they are focused on debunking previous “big splash” reports. I know my CV would thank them.

  2. Reblogged this on lab ant and commented:
    One way to encourage slow science. Remember it should be about (solid) science not only about news especially in fields where peoples life’s are at risk.

  3. I think a Reproducibility Index for journals is a great idea Ivan! Have you got any feedback from journals that they would be interested? We have spoken to several journals about independently replicating important studies using the Independent Validation service (www.scienceexchange.com/validation). So far the funding of the validation studies seems to be the main issue preventing adoption.

  4. Great idea, but this is totally unworkable in reality….

    What would be the score/status of a paper in which one figure cannot be reproduce but the others can? It then boils down to the relative importance of the different figures, but who gets to decide which figures are important or not? Ask 20 different people about a paper and they’ll all have different opinions on which pieces of the data are most important. This is highly subjective and not even constant over time… I’ve often realized many years later that an obscure figure in the supplement of one of my papers is far more important than I initially gave it credit for.

    Discounting the huge resources necessary to establish a ranking system for which bits of data matter, and which conclusions do not, the whole issue of reproducibility is a hot potato that no publisher’s legal division will want to touch… I foresee litigation in which an author sues a journal for defamation because they down-ranked a paper based on inability to reproduce, when actually the experiment was perfectly sound and the “reproducers” were just not paying attention to experimental details. Having the occasional letter-to-the-editor discounting a paper’s findings is one thing, but wholesale ranking/metrics on this issue is going to result in a lot of displeased authors, and lawsuits.

    And as others have mentioned, this is likely expensive and not something that can be automated. At the very least it would require journals pay for Mechanical Turk or other such services to sift through all those citations and decide which ones are affirmative (see above for problems). With consistent 30-40% profit margins, there’s some weight to the argument they can afford it, but generally speaking large corporations with big margins are not in the business of diverting profit toward increased accountability unless they get hit in the face with a frying pan (vis a vis NIH public access policy).

      1. Agreed. Holy cow, do you know how much time it takes to perform and submit a study? These things can take years and can require the skills of a very experienced postdoc. If the experiments were easy, the whole thing would be a non-issue, but good science, even well-documented good science, is not something that comes out of a cookbook.

        In addition, the idea that journals will somehow willingly accept manuscripts that conflict with established results or confirm what is already known is laughable. Such journals, like the Journal of Neurogenetics, already exist, but nobody reads them. Honestly, would any of you read papers from such a journal, unless a particular paper had a direct impact on your ongoing experiments? That is a pretty narrow readership.

        Perhaps the view that science is self correcting is naïve, and possibly wrong, but until a system develops that takes into account reader interests and the reward system of scientists institutions, it is what we’ve got.

        1. PLOS ONE has signed on as a publishing partner for the Reproducibility Initiative, and both PeerJ, eLife, and F1000 Research have agreed to publish direct replications. Additionally, Science, Nature, and Cell have all said they have nothing against publishing a direct replication, even if it conflicts with established results, ESPECIALLY if it was published by a competing journal. Publishing confirmations is less likely, but many journals, such as The Lancet, have indicated interest in hypothesis pre-registration: http://www.cogsci.nl/blog/miscellaneous/215-the-pros-and-cons-of-pre-registration-in-fundamental-research

          So I would suggest that maybe there’s more change and momentum going on than you’re aware of and if you’re interested in getting on board, there are many ways to get involved. The various projects would all be much improved for your efforts.

          1. I wish you luck, but will not hold my breath. The dubious editorial policies, such as overriding reviewers’ concerns regarding the quality of data, at PLOS One have largely undermined the journal’s credibility. With regard to the Big 3, “they have nothing against publishing a direct replication” sounds like a bit of a non-statement.

            It would not be the first time I have missed a trend, so I could be wrong.

          2. Do you have a link re: “overriding reviewers’ concerns regarding the quality of data” at PLOS ONE? I know PLOS ONE frequently has to remind reviewers that they’re looking for methodological rigor and not novelty or high impact, but your comment sounds like this was something else, so I’d like to know more.

          3. drgunn: just curious to know whether you are associated with PLoSONE in any capacity?

          4. I’m afraid there’s no link, just specific instances of colleagues being upset that their valid concerns regarding data and analysis were ignored. My own experience with the editors ay PLOS One has given me the impression that they are over their heads.

            I keep hearing the proponents talking about how it will bypass the “old guard,” but it looks to me as though it is becoming a dumping ground. JMO.

          5. stpnrzr: yes, i agree with your point. I had similar experience with PLoS ONE- i did even write to the handling editors on this on several occasions. I stopped submitting manuscripts…as simple as that.

          6. No, I don’t work for PLOS. I just like a lot of what they do. They’re pushing for innovation in scholarly communication from the publisher side of things, which is really valuable. Before PLOS ONE, there was little traction with OA, and now every major publisher has launched a clone & are falling all over themselves for Gold OA.

            I’ll ask some people I know there if someone wants to comment on this thread re: editorial issues,

          7. Whenever I think of PLoS and George Soros, I get a stomach ache. Not because the project isn’t good. No, it is because it is a cash cow. And who wouldn’t be able to start a good OA publisher if someone threw them a few million bucks for free? The very first sentence above stinks of dirty money and wasted funding. Someone is going to benefit financially here. There is definitely a deeper, more sinister story than we know about when already backroom deals have taken place without the participation (or knowledge) of the wider scientific public. In the words of an inspiring Eastern European philosopher, Slavoj Zizek: “http://www.youtube.com/watch?v=hpAMbpQ8J7g”:

          8. It’s actually a lot worse than I thought. According to Wikipedia, PLOS One published 23,468 articles in 2012. At $1350 a pop, that’s well over $31 million.

            The phrase “vanity press” comes to mind. How can this model not lead to a conflict of interest?

            Apologies for the PLOS One bashing hijack. I will stop now.

          9. @stpnrazr & JATdS
            PLOS ONE waved my fee. They do that for a lot of authors, due to things like the PI is between grants, or is in a nation who simply cannot afford it.

            That being said, I’m likely to go with PeerJ next time around; for various reasons.

  5. As a scientist, I’m really excited to see this issue get wider recognition. I struggled for quite some time against a dominant paradigm that said I wasn’t supposed to be getting the results I was actually getting and if I could have gotten recognition for all the careful work I had to do to prove my work was done properly, I would have advanced much faster in my career relative to the people who were working without the burden of having to overturn established dogma.

    I think we have to be very careful to avoid running into the same problems that we ran into with the impact factor and to make sure everyone’s voice is heard, but it’s been far too long that the people who do slapdash highly novel stuff get all the glory and anything that could balance the scales towards more careful work would be an improvement, in my opinion.

    Elizabeth Iorns and I are actually doing something about it, too. Last year, we started the Reproducibility Initiative. We started it on our own because we just wanted to actually do something beyond just talking about it, but we don’t think we have all the answers and that everyone should just follow our lead. Rather, we’d like to propose this as a starting point. Many of the points made above are commonly raised when we talk about the Initiative, and here’s what the current thinking is:

    It really can only work for experiments that are feasibly reproducible. Those involving analysis of soil samples from Jupiter or 100 year longitudinal studies just couldn’t be included, but that still leaves the vast majority of published reports. Who conducts the research is another important issue. All the labs that are most qualified to reproduce a given bit of work are most likely competitors of the lab who did the work being reproduced, so we think it’s best to avoid that issue entirely by using independent service providers who work on a fee-for-service basis. Their only incentive will be to do quality work and get paid, not to get a specific result. A nice side effect of using the service providers is that they are professionals who do their specific technique all day, making it both much more cost effective and much harder to argue with the results they obtain. The validation service that Science Exchange has made available fits these requirements quite well. The other point raised about what to do when some experiments reproduce and some don’t is a scenario that will definitely arise and we couldn’t possibly say we have all the answers to every situation, just that we feel like it’s time to start doing something and that we think that we can solve these issues as we go.

    Already nearly 2000 researchers have indicated their interest in having their own study reproduced and many more validation studies are under way, so with enough voices of researchers paying attention to this and making their voices heard, we’ll be able to address the very real crisis in reproducibility with a solution that is of our own design.

    1. Without trying to seem pessimistic or discouraging, 2000 is an infinitesimally small number. Let’s put that into perspective. PubMed = 20 Mill records. Sciencedirect = 11.7 Mill records. SpringerLink = ~6-7 Mill records and about the same number for Wiley-Blackwell and Taylor and Francis/Informa. So, we are looking at a global total of at least 60 million from core main-stream publishers alone. And we could consider these to be the “safe” options. What about the total mess in the OA world with such a deep well of fraudulent non-peer reviewed results in predatory publishers (www.scholarlyoa.com)? It is evident that 2000 studies are of scientists who know that their work is excellent and that who want to fortify that quality through repetition and collaboration. By getting their results reproduced in 2-3 labs around the world, they basically guarantee that the paper is never retracted, or challenged. Kind of like post-publication peer review of the lab. It’s a farce into thinking that this system could work because you would only get the elite involved. The plebs of science would treat it as irrelevant. So, let’s assume that there are globally about 100 Million+ papers (all sources and publishers). 2000/100,000,000 = 0.00002%. Hardly significant (even if we are off by a factor of 10 in the estimate of total papers).

  6. You don’t necessarily have to “do” anything. Why can’t you just analyze citations of the work to determine if others have reproduced or attempted to reproduce it? Of course this won’t cover everything, but highly cited work will be reproduced often enough to have a footprint in the literature. Of course the biggest road block here is curating such an index because it’d probably require a lot of reading (or a genius to write an accurate text reading algorithm…key word being accurate)

    1. Brian,

      Analyzing citations will only show you work that has been produced by other academic labs, and subject to the influence that stpnrazr mentions above. It won’t be a direct replication and it won’t be as solid and indisputable as a specific experiment replicated by an independent fee-for-service lab.

    2. The work published from Bayer (1) and Amgen (2) showed that numbers of citations are not associated with reproducibility. Many studies that were not reproducible had hundreds (and in some cases thousands) of citations. In addition, the citations were not because someone failed to reproduce a finding and then published that failure. The Mobley survey (3) indicated that the majority of researchers who can’t replicate published findings do not publish them.

      1. http://www.nature.com/nrd/journal/v10/n9/full/nrd3439-c1.html
      2. http://www.nature.com/nature/journal/v483/n7391/full/483531a.html
      3. http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0063221

  7. When governments are ready to multiply the cost of funding science by a factor of N (where N is the number of independent replications needed before a finding is considered “replicated”), then it will happen. Until then, none of these initiatives will get any real traction. Sure, there will be some web sites where a few people may report the work of their enemies as work they think should replicated, but that’s about it.

    1. Actually, one of the key strengths of the Reproducibility Initiative is that it leverages the Science Exchange network, so replications can be done for a small fraction of the cost of the original experiment. The reason it’s cheaper is scale – professional labs do this all the time, so they have all the materials & expertise on hand & market forces drive the price down – and because they already know what they’re looking for, there’s no need to repeat all the trial-and-error and exploratory work that happened in the course of the original study. Also, you wouldn’t need to audit 100% of the experiments to get the positive incentive system working. Just a few good examples of funders saying they’re looking for reproducibility and consider validation when they review grant proposals would be enough. Combine that with the deterrence effect (ex the IRS audits about .005% of tax returns but gets high compliance) and you have a very efffective system. If you could just move the 60% irreprodicibility rate to 50%, that would mean about 5 billion of US federal research funding was spent on more robust research. Surely any funder would love to peel off a tiny fraction of that to get such a return, and to get hard data on what they’re actually getting in return for their outlays as well.

      1. That’s why people do formal meta-analyses of the literature. Besides, the best way to stick it to your enemy is to volunteer to replicate their work, give it to some incompetent graduate student and sue the failure to replicate to smear your enemy. Of course, they can do the same to you. Pretty gloomy scenario, but it WILL happen, if anything of this sort starts to unfold. There is always a percentage of cheaters out there, and they will destabilize any system that has components based on trust.

        1. Using a research audit system to smear a colleague is totally a possibility, which is why we designed the system to use professional fee-for-service labs. These guys don’t care about who’s competing with who because they’re not in the game. Their only incentive is to do quality work and get paid for it. Also, it’s not just anyone who can replicate an experiment and declare it to be crap. Researchers are largely volunteering their own studies for replication, and they don’t get to choose which facility does the work. So certainly someone could nominate their enemy’s work for replication, but they have no control over the result and it could easily backfire. They’re far more likely to target the weakest work, if they take that approach, and likewise to nominate their own strongest work, and that’s exactly the dynamic we want to foster. If we have people volunteering which of their work is the best and most robust, well, that collection in itself is valuable, right?

          1. There is a simple solution, rather than tooing and froing……use CROs.

            There are hundreds of CROs with staff more than capable of reproducing any experiment.

            Science that can change peoples lives is a profession, its not for amateurs, or politics, or gamesmanship. The problem with mainstream science as it is is clear for all to see: There are too many non-scientists involved in science.

            Real scientists don’t do fraud.

  8. “The idea is to encourage slow science”: of course slow science and tortoises always are winners, and harebrained science never succeed.

  9. This is a marvelous idea. I would love to see it happen. However, of course, reproducability can be in the eye of the beholder.

    For instance, consider medical research in the area covered by a company like Genentech.They have hinted that reproducability is one of their concerns. But what happens if they take an leading role in rating papers or even journals for reproducability?

    First, they may be shot at for their opinions.

    Second, they are helping their own competitors, who don’t have to invest in checking those particular papers or results.

    Third, they do often have the expense and trouble of reproducing large bodies of work.

    Fourth, aren’t we always more motivated to speak up about problems rather than positive results?

    Solutions? Perhaps rating journals based on a sample of the papers therein. That sample would probably be highly selective. It’s far from easy to satisfy everybody in a question like this.

  10. If we were just to use the existing literature as our source, I can think of one distinction among citations that may be relevant: some papers are cited for technical reasons, while others are cited for rhetorical reasons. The technical citations (in the methods) typically indicate that someone has expanded on previous work, therefore implying that they were able to reproduce the results. The rhetorical citations (intro and discussion only) typically are not validated.

    Another issue that comes to mind is that sometimes researchers make it excessively difficult to reproduce their data. I’ve encountered a situation where someone used an unnecessarily complicated software package to perform statistical analyses of a dataset (required the use of large computer clusters and a lot of time optimizing the software), which made it very hard for me to evaluate their analysis.

    Finally, I’ve seen people reach wrong conclusions simply because they didn’t understand the literature well and therefore misinterpreted their analysis.

  11. A far easier solution is for journals and funding agencies to insist — require — that methods be clearly documented and data and code be accessible for examination.

      1. Methods are documented for publication, codes are written to facilitate the research, the data have to exist for the research to be done. Except for very large data sets (which might be prohibitive/impractical to share), the incremental cost of access for examination is very small.

    1. Nice thought Alice,
      .
      Here’s a policy for journals to consider if they are serious about rooting out science-fraud:

      Put all standard operating procedures (SOPs) online. If an author is unwilling to do that, they can publish elsewhere.

  12. With the tenure track and NIH paylines as brutal as ever, the likelihood of anything called “slow science” succeeding seems about as likely as bringing a “slow food” concept to success at McDonalds’ corporate HQ. I can just hear my APT committee now, when I tell them I spent the last 3 years of my tenure eligibility doing “slow science,” in order to systematically reproduce or refute the already published, major findings of my particular discipline, with only a few papers to show for it in the lowest of the low-impact journals. The entire academic scientific enterprise is built on positive reinforcement of so-called “high-risk, high-reward” science in which you are publishing the highest profile (e.g. novel) science you can get done, ahead of your competitors. Thus, I see no way for this movement to achieve any significant traction, unless of course the entire academic scientific culture is rebuilt.

    About the only way I can see a systematic effort to document reproducibility implemented is if a small army of post-docs and junior faculty who were culled from the ranks (after doing years of careful science without the sort of big-splash success to land or keep a PI job) could be assembled and paid by philanthropic sources to form a “Reproducibility Institute” of the sort proposed above by xtaldave above, possibly with an in-house publishing wing. Even for a modest operation, you’d need a hefty self-perpetuating endowment that would generate $1-2 million USD per year for operating costs. And given some of the responses we’ve seen to public domain-based information on this site and now shuttered sites like Abnormal Science, this entity would also require some sort of endowment for legal defense. But given philanthropy’s general tendency to only support the newest, sexiest science that can be done, I don’t see a Gates Foundation “Scientific Reproducibility Institute” happening anytime soon.

    1. No need for a crystal ball in this case. Such an institute is already here. The Reproducibility Initiative (http://reproducibilityinitiative.org ) is using academic core facilities and fee-for-service research labs to do exactly this, in a cost-effective and efficient way. This means we have no need to assemble an army of out-of-work postdocs & we can start right away. In fact, we already have. We’re working with the Center for Open Science in this, which has already received significant funding for their efforts.

      1. It appears that people with funding for it can hire, through the Reproducibility Initiative, a lab to verify their results. Is that correct? And then what? If the fee-for-service research lab’s results agree with the hiring researcher’s, the hiring client can say, “My results are reproducible; *this* lab says so.” Good for the results and I presume good for the fee-for-service research lab; they have a happy client. If the results from the fee-for-service research lab’s don’t agree with the paying client’s, then what? How would anyone know, unless the paying client — the researcher — lets people know?

        1. Alice, you’re correct. Basically we’re doing what Stewart, further up, has suggested. If a lab has their own research replicated and it doesn’t replicate, they may choose not to publicize that data, but I would argue they’re still getting value in a few ways. Maybe the PI wasn’t watching too closely what was going on with a particular grad student’s work and this will allow them to see that there may be an issue before investing too much more time in an approach or area that isn’t working out, and maybe they’ll even protect themselves from being on the wrong side of a blog post at this here blog. It will certainly look good if there’s ever any attention from research integrity investigators, and at the very least, they’ll know themselves that it didn’t replicate and can start working to figure out why. Almost no one is deliberately trying to “get away” with publishing stuff that doesn’t replicate. The vast majority of stuff that doesn’t replicate is simply sampling error or methodological error and not fraud.

          The other side of the issue is that it might not always be a researcher themselves that has their own work replicated. A funding agency might take a sample of studies they’ve funded in the past and run them through the Initiative. On a more atomic level, a company might take their products and have them run through assays by an independent lab and use that in their marketing – we’ve independently verified that this antibody works for western blotting! – so there’s a lot of ways this could go.

  13. The “reproducibility problem” has a far simpler solution. Paying others to simply reproduce is fraught with difficulties, which other commentators have described. Many journal publishers, and particularly those who have over the last 20-30 years twisted the impact factor to serve their commercial purposes, whilst the science community actively collaborated in this enterprise will not change their ways – you only have to read David Vaux’s excellent guest post on Retraction Watch to see that they speak with a forked tongue.
    The solutions are quite simple.
    1. Can I see your raw data/reagents please. If these are not available, then the paper is retracted immediately. If the paper is reasonably recent, say 5-10 years old, then you cannot apply for funding for, say 5 years, because you need to learn how to maintain data and manage data. Back to school for you.
    2. An intriguing way forward when work is deemed not be reproducible is for the PI to be given a sabbatical to re-do the key experiments over a year. After all, if you are supervising an experiment, you should be capable of doing it yourself. Otherwise you are a manager, who could not possibly quality control the work of graduate students and postdocs and who is also incapable of training a graduate student.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.