Does peer review ferret out the best science? New study tries to answer

scienceGrant reviewers at the U.S. National Institutes of Health are doing a pretty good job of spotting the best proposals and ranking them appropriately, according to a new study in Science out today.

Danielle Li at Harvard and Leila Agha at Boston University found that grant proposals that earn good scores lead to research that is more cited, more published, and published in high-impact journals. These findings were upheld even when they controlled for notoriously confounding factors, such as the applicant’s institutional quality, gender, history of funding and experience, and field.

Taking all those factors into consideration, grant scores that were 1 standard deviation lower (10.17 points, in the analysis) led to research that earned 15% fewer citations and 7% fewer papers, along with 19% fewer papers in top journals.

Li tells Retraction Watch that, while some scientists may not be surprised by these findings, previous research has suggested there isn’t much of a correlation between grant scores and outcomes:

Regardless of whether people thought there was a correlation between scores and outcomes or not, I think there are 2 common critiques that our paper to some extent refutes.

First, is that the NIH responds only to prestige — we find that even if you compare people from the same institutions, with similar grant and publication histories, who are both funded and who both get similar amounts of grant funding, the person with the better NIH score tends to produce more highly cited work. It may still be the case that peer reviewers overweight prestige (or underweight it — our study is not designed to address that question), what we can say is that reviewers have insights that are not just reflective of someone’s CV.

Second, is that with shrinking paylines, NIH funding is now essentially a lottery — that reviewers really can’t tell the difference between a 5th percentile grant and a 10th percentile grant. Our study shows that in fact percentile scores are more predictive of application quality at very low (i.e. very good) percentiles.

Interestingly, the researchers found one exception: Among the approximately 1% of grants that received poor scores (higher than 50, when lower numbers are better), grants with worse scores earned more citations. Normally, these grants would not be funded, but received money “out of order,” when a program officer steps in and awards the grant. The authors write:

We find higher average quality for this set of selected grants, suggesting that when program officers make rare exceptions to peer-review decisions, they are identifying a small fraction of applications that end up performing better than their initial scores would suggest.

Of course, these findings are predicated on the belief that citations, papers, and patents are the best measurements of research outcomes.

Li and Agha caution that their analysis does not rule out another criticism of peer review, that it “systematically rejects high-potential applications”:

Our results, however, suggest that this is unlikely to be the case, because we observe a positive relationship between better scores and higher-impact research among the set of funded applications.

Li and Agha analyzed 137,215 R01 grants funded between 1980 and 2008. More than half of the grants were for new projects, the remaining were renewals of existing grants.

Update 4/24/15 10:10 a.m. Eastern: We heard from Pierre Azoulay, an MIT economist who is acknowledged in the paper. He told us the study “unambiguously” validates the peer review process at NIH, and should cause scientists to take note.

In the current environment, where it is simply assumed that peer reviewers are not better than pagan priests reading pig entrails, I believe it is news.

But of course, we should always work to make peer review better, he added.

There is a relationship between scores and the outcomes we care about. But of course that does not mean that reform of the system could not make this relationship even more predictive. Moreover, the Li & Agha evidence points to study sections adding value *on average.* But there is heterogeneity – some of this committees actually do worse at discriminating between proposals than a robot would. Some are at par. And many do better. How do we get rid of this tail of “bad committees.” What makes a more or less effective committee? That is the next step of the agenda using these basic data.

Like Retraction Watch? Consider supporting our growth. You can also follow us on Twitter, like us on Facebook, add us to your RSS reader, and sign up on our homepage for an email every time there’s a new post.

8 thoughts on “Does peer review ferret out the best science? New study tries to answer”

  1. “We looked at the job and work that we do, and we think we’re doing a good job. But we will publish all kinds of editorials saying that there is a problem and wring our hands desperately in trying to formulate a so-called “solution” that will ultimately only solidify our own security… And then we’ll tell you again what a great job we’re doing!”

  2. Here’s another way of interpreting the main conclusion of this paper, i.e. “Taking all those factors into consideration, grant scores that were 1 standard deviation lower (10.17 points, in the analysis) led to research that earned 15% fewer citations and 7% fewer papers, along with 19% fewer papers in top journals.”:

    A large difference in grant scores leads to slight to trivial differences in citations, papers etc.

    Such small differences, could amount to, say differences in writing ability/tolerance (or the amount of money they have to pay someone to help them with writing).

    Does anyone expect Science to be a bastion of impartiality in assessing this type of study? The title of the editor’s summary is “Proof that peer review picks promising proposals”. The conclusions seem like rather weak proof, at best. Note that title of the original research article (Li and Agha 2015) poses a question, which honestly, doesn’t seem to have been answered.

    Fig. 1. of (Li and Agha 2015) is shocking: given the number of data points displayed, it’s impossible to determine densities or trends in citations or number of publications vs. peer review percentile scores from what’s shown. It’s mystifying why it was even shown given the subsequent analyses.

    Fig. 2A. Shows that researchers with relatively bad review scores (< 60th percentile) garner more citations than those with intermediate review scores (~ 20-60th percentile). This is disturbing.

    In the associated "In Depth" article at Science, co-author Danielle Li is quoted as saying "Experts add value" (to the review process, presumably). Perhaps, the more important question, is: at what cost?.

    From the same "In Depth" article, there's a paragraph on Richard Nakamura's views, which seem more tempered with realism:

    'The head of NIH's massive grant-review enterprise, Richard Nakamura, agrees that the research appears to bolster the case for enlisting thousands of scientists as reviewers. But the data are hardly definitive, he says. The Science paper “says that, unlike what other studies have found, there is a relationship between scores and outcome measure if you look at enough grants,” Nakamura says. “But it's a very noisy measure. And the debate over how to measure the outcome of grants remains very much alive.”'

    I eagerly await post-publication peer review of this article from Pubpeer et al.

  3. I have some serious disagreement with how these results are being interpreted. It is almost trite to show that priority scores from peer reviewers correlate with publication and citation productivity. No one disputes that reviewers can distinguish good grants from mediocre or poor ones. The question is, with paylines hovering around 10%, can reviewers accurately rank highly meritorious grants, or do the rankings end up largely arbitrary and prone to bias? Previous studies such as those by Berg, Lauer, Johnson and Kaplan, suggest the latter. In this study, Li is claiming that ‘percentile scores are more predictive of application quality at very good percentiles’. However this only pertains to a rarified group of applications that had very high numbers of publications that could not be accounted for by other factors such as field of research, year and applicant qualifications. If one looks at the total data in figure 1 of the paper, a very different conclusion is reached– the grants in the top 10% and the next 10% exhibit a wide range of productivity, with highly productive projects in both deciles. Rather than ‘Peer Review Ferrets Out the Best Science,’ the headline could just as easily read ‘Current Paylines Leave Some of the Best Science Behind’.

    Other important caveats to keep in mind:
    – Citation productivity and patents are not the only or necessarily the most important benefits of research, they just happen to be easier to measure than other outputs.
    – This study lumps together data from 1980-2008, but success rates and the peer review process have changed markedly over that time period. This is probably the reason that some grants with very poor percentile scores were able to receive funding. Notably, the results are not directly relevant to current peer review criteria in place since 2009.
    – The number of publications attributed to each grant seems markedly higher than what has been reported by Berg (median ~6). Some grants in this study report more than 100 attributable publications; this is difficult to understand, and the relevance of such projects to ordinary research grants is questionable. If Berg’s figures are closer to the truth, then ‘a one-standard deviation worse peer-review score… associated with 7% fewer publications’ amounts to a fraction of a paper. This seems trivial.
    – Studies of NIH peer review must rely on the analysis of funded grants, as lack of funding adversely affects productivity. However such an analysis cannot determine how much research opportunity is missed because it never receives funding. For this one must use indirect approaches. There is a concern that the current peer review system encourages researchers to be conservative and to conform to convention. In support of this notion, Tatsioni found that about 30% of papers that led to Nobel Prizes did not have funding support, and Nicholson and Ioannidis found a poor correlation between very high-impact papers and NIH funding.

    Ignoring the hype, I don’t find the results of this study to be particularly surprising. Peer review can discriminate between good and bad grants. But at current levels of funding, funding decisions remain rather arbitrary. Many grants in the top 10%ile exhibit ordinary productivity, while many among the next 10-15% show high productivity. Productivity starts to noticeably drop off after the 25th-30th %ile. Failure to accurately predict which of the good grants will be most productive is not the fault of the reviewers or the researchers. It is simply not possible, given the unpredictability of science. Most importantly, like earlier studies, these data suggest that current paylines are inadequate to provide support for many research projects that could be highly productive if funded.

  4. I would like to know (broadly):
    a) how many of the papers that originated from those grants were subjected to errata, corrigenda, expressions of concern, or retractions.
    b) how many papers that were awarded grants were published in so-called “predatory” OA journals listed at .
    c) how the IF scores were factored into this study when they had not existed or only started to exist during 1980-2008.

    I would also like to know, based on Li’s comment “the NIH responds only to prestige”:
    a) where does the NIH define “prestige”.
    b) if impact factors did not exist, how could “prestige” be calculated?

  5. There is potential for circular logic here. What this really says is that reviewers of grants and people who review and/or cite papers in the same field seem to like the same studies. It seems to me that those two populations will comprise many of the same people. So what this really means is that people who give a study proposal a good score are also more likely to cite that study when it is published (or even to give the manuscript a good review to get it published in the first place), which is exactly what one would expect if they liked it enough to give it a good score when it was a proposal. Basically, this just shows that reviewers are consistent in liking something, not that they can actually judge whether it’s good.

  6. Ferric Fang
    Nicholson and Ioannidis found a poor correlation between very high-impact papers and NIH funding.

    No, they didn’t.

    First of all, for clarity, they did not look at whether the “high-impact” (highly cited) papers were funded by NIH. They looked at whether the first and last authors had NIH grants years after the publication.

    Second, in no sense did they look at a “correlation” between these papers and funding. They did not compare funding rates of these authors to anybody else. They just looked at the fraction of them with current funding, which they declared to be too low. If an author had left academic science, or had returned to his or her home country to pursue science there, or was funded by HHMI and so did not seek NIH funding, this was somehow a failure of NIH.

    Not only was the premise of this study flawed, but the execution can be criticized on several counts, such as the inclusion of review articles, which tend to be highly cited.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.