Last year, two psychology researchers set out to figure out whether the statistical results psychologists were reporting in the literature were distributed the way you’d expect. We’ll let the authors, E.J. Masicampo, of Wake Forest, and Daniel Lalande, of the Université du Québec à Chicoutimi, explain why they did that:
The psychology literature is meant to comprise scientific observations that further people’s understanding of the human mind and human behaviour. However, due to strong incentives to publish, the main focus of psychological scientists may often shift from practising rigorous and informative science to meeting standards for publication. One such standard is obtaining statistically significant results. In line with null hypothesis significance testing (NHST), for an effect to be considered statistically significant, its corresponding p value must be less than .05.
When Masicampo and Lalande looked at a year’s worth of three highly cited psychology journals — the Journal of Experimental Psychology: General; Journal of Personality and Social Psychology; and Psychological Science — from 2007 to 2008, they found:
…p values were much more common immediately below .05 than would be expected based on the number of p values occurring in other ranges. This prevalence of p values just below the arbitrary criterion for significance was observed in all three journals.
What could this mean? In a post about those findings for the British Psychological Society’s Research Digest, Christian Jarrett noted:
The pattern of results could be indicative of dubious research practices, in which researchers nudge their results towards significance, for example by excluding troublesome outliers or adding new participants. Or it could reflect a selective publication bias in the discipline – an obsession with reporting results that have the magic stamp of statistical significance. Most likely it reflects a combination of both these influences. On a positive note, psychology, perhaps more than any other branch of science, is showing an admirable desire and ability to police itself and to raise its own standards.
Now, a group of researchers have looked further back in time and have found evidence of the same issues going back to 1965. Here’s part of the abstract of the new paper, “The life of p: ‘Just significant’ results are on the rise,” by Flinders University graduate student Nathan Leggett and colleagues, which appears the Quarterly Journal of Experimental Psychology, the same journal as the earlier paper:
Articles published in 1965 and 2005 from two prominent psychology journals were examined. Like previous research, the frequency of p values at and just below .05 was greater than expected compared to p frequencies in other ranges. While this over-representation was found for values published in both 1965 and 2005, it was much greater in 2005. Additionally, p values close to but over .05 were more likely to be rounded down to, or incorrectly reported as, significant in 2005 compared to 1965. Modern statistical software and an increased pressure to publish may explain this pattern.
In other words, the “just significant” p value problem has been around for a while in psychology, and it’s gotten worse, or at least it did until 2005. How to fix? The authors write:
The problem may be alleviated by reduced reliance on p values and increased reporting of confidence intervals and effect sizes.
Danielle Fanelli, of the Université de Montréal, has studied bias among behavioral science researchers. We asked him for his take:
It is a small but certainly significant piece of the puzzle, whose picture consistently suggest a worsening of biases in the scientific literature. In addition to pressures to publish and file-drawers effects, the authors repeatedly suggest that the abuse of software might be to blame, and I think they make a good point.
However, if I understand the results correctly, the authors observed spikes of “almost significant findings” in both journals examined, but only in one of them such spike had increased since 1965. This to me suggest that the picture might be more complex than what can be captured by any single narrative. The fact that we are increasingly unable to give space to such subtleties might be just another symptom of the problem that this study, and others before it, suggest to be worsening.
Bonus: For an unrelated but broader discussion of statistical significance and related issues, see Hilda Bastian’s post on the subject at Scientific American.