Grant reviewers at the U.S. National Institutes of Health are doing a pretty good job of spotting the best proposals and ranking them appropriately, according to a new study in Science out today.
Danielle Li at Harvard and Leila Agha at Boston University found that grant proposals that earn good scores lead to research that is more cited, more published, and published in high-impact journals. These findings were upheld even when they controlled for notoriously confounding factors, such as the applicant’s institutional quality, gender, history of funding and experience, and field.
Taking all those factors into consideration, grant scores that were 1 standard deviation lower (10.17 points, in the analysis) led to research that earned 15% fewer citations and 7% fewer papers, along with 19% fewer papers in top journals.
Li tells Retraction Watch that, while some scientists may not be surprised by these findings, previous research has suggested there isn’t much of a correlation between grant scores and outcomes:
Regardless of whether people thought there was a correlation between scores and outcomes or not, I think there are 2 common critiques that our paper to some extent refutes.
First, is that the NIH responds only to prestige — we find that even if you compare people from the same institutions, with similar grant and publication histories, who are both funded and who both get similar amounts of grant funding, the person with the better NIH score tends to produce more highly cited work. It may still be the case that peer reviewers overweight prestige (or underweight it — our study is not designed to address that question), what we can say is that reviewers have insights that are not just reflective of someone’s CV.
Second, is that with shrinking paylines, NIH funding is now essentially a lottery — that reviewers really can’t tell the difference between a 5th percentile grant and a 10th percentile grant. Our study shows that in fact percentile scores are more predictive of application quality at very low (i.e. very good) percentiles.
Interestingly, the researchers found one exception: Among the approximately 1% of grants that received poor scores (higher than 50, when lower numbers are better), grants with worse scores earned more citations. Normally, these grants would not be funded, but received money “out of order,” when a program officer steps in and awards the grant. The authors write:
We find higher average quality for this set of selected grants, suggesting that when program officers make rare exceptions to peer-review decisions, they are identifying a small fraction of applications that end up performing better than their initial scores would suggest.
Of course, these findings are predicated on the belief that citations, papers, and patents are the best measurements of research outcomes.
Li and Agha caution that their analysis does not rule out another criticism of peer review, that it “systematically rejects high-potential applications”:
Our results, however, suggest that this is unlikely to be the case, because we observe a positive relationship between better scores and higher-impact research among the set of funded applications.
Li and Agha analyzed 137,215 R01 grants funded between 1980 and 2008. More than half of the grants were for new projects, the remaining were renewals of existing grants.
Update 4/24/15 10:10 a.m. Eastern: We heard from Pierre Azoulay, an MIT economist who is acknowledged in the paper. He told us the study “unambiguously” validates the peer review process at NIH, and should cause scientists to take note.
In the current environment, where it is simply assumed that peer reviewers are not better than pagan priests reading pig entrails, I believe it is news.
But of course, we should always work to make peer review better, he added.
There is a relationship between scores and the outcomes we care about. But of course that does not mean that reform of the system could not make this relationship even more predictive. Moreover, the Li & Agha evidence points to study sections adding value *on average.* But there is heterogeneity – some of this committees actually do worse at discriminating between proposals than a robot would. Some are at par. And many do better. How do we get rid of this tail of “bad committees.” What makes a more or less effective committee? That is the next step of the agenda using these basic data.
Like Retraction Watch? Consider supporting our growth. You can also follow us on Twitter, like us on Facebook, add us to your RSS reader, and sign up on our homepage for an email every time there’s a new post.