Last year, two psychology researchers set out to figure out whether the statistical results psychologists were reporting in the literature were distributed the way you’d expect. We’ll let the authors, E.J. Masicampo, of Wake Forest, and Daniel Lalande, of the Université du Québec à Chicoutimi, explain why they did that:
The psychology literature is meant to comprise scientific observations that further people’s understanding of the human mind and human behaviour. However, due to strong incentives to publish, the main focus of psychological scientists may often shift from practising rigorous and informative science to meeting standards for publication. One such standard is obtaining statistically significant results. In line with null hypothesis significance testing (NHST), for an effect to be considered statistically significant, its corresponding p value must be less than .05.
When Masicampo and Lalande looked at a year’s worth of three highly cited psychology journals — the Journal of Experimental Psychology: General; Journal of Personality and Social Psychology; and Psychological Science — from 2007 to 2008, they found:
…p values were much more common immediately below .05 than would be expected based on the number of p values occurring in other ranges. This prevalence of p values just below the arbitrary criterion for significance was observed in all three journals.
What could this mean? In a post about those findings for the British Psychological Society’s Research Digest, Christian Jarrett noted:
The pattern of results could be indicative of dubious research practices, in which researchers nudge their results towards significance, for example by excluding troublesome outliers or adding new participants. Or it could reflect a selective publication bias in the discipline – an obsession with reporting results that have the magic stamp of statistical significance. Most likely it reflects a combination of both these influences. On a positive note, psychology, perhaps more than any other branch of science, is showing an admirable desire and ability to police itself and to raise its own standards.
Now, a group of researchers have looked further back in time and have found evidence of the same issues going back to 1965. Here’s part of the abstract of the new paper, “The life of p: ‘Just significant’ results are on the rise,” by Flinders University graduate student Nathan Leggett and colleagues, which appears the Quarterly Journal of Experimental Psychology, the same journal as the earlier paper:
Articles published in 1965 and 2005 from two prominent psychology journals were examined. Like previous research, the frequency of p values at and just below .05 was greater than expected compared to p frequencies in other ranges. While this over-representation was found for values published in both 1965 and 2005, it was much greater in 2005. Additionally, p values close to but over .05 were more likely to be rounded down to, or incorrectly reported as, significant in 2005 compared to 1965. Modern statistical software and an increased pressure to publish may explain this pattern.
In other words, the “just significant” p value problem has been around for a while in psychology, and it’s gotten worse, or at least it did until 2005. How to fix? The authors write:
The problem may be alleviated by reduced reliance on p values and increased reporting of confidence intervals and effect sizes.
Danielle Fanelli, of the Université de Montréal, has studied bias among behavioral science researchers. We asked him for his take:
It is a small but certainly significant piece of the puzzle, whose picture consistently suggest a worsening of biases in the scientific literature. In addition to pressures to publish and file-drawers effects, the authors repeatedly suggest that the abuse of software might be to blame, and I think they make a good point.
However, if I understand the results correctly, the authors observed spikes of “almost significant findings” in both journals examined, but only in one of them such spike had increased since 1965. This to me suggest that the picture might be more complex than what can be captured by any single narrative. The fact that we are increasingly unable to give space to such subtleties might be just another symptom of the problem that this study, and others before it, suggest to be worsening.
Bonus: For an unrelated but broader discussion of statistical significance and related issues, see Hilda Bastian’s post on the subject at Scientific American.
I’m a little surprised that so trustworthy a blog as this would endorse Bastion’s “broader discussion” which incorrectly defines statistical significance and is full of much else that is misleading and careless.
Her name is spelled Bastian, sorry.
“Testing for statistical significance only estimates the probability of getting a similar result if you repeat the experiment, given the same circumstances.”
Made me think of this video: http://www.youtube.com/watch?v=yptXkLglKkA
(Who is he talking too?) But this stuff is taken as gospel in the land of criticisms of significance tests and such.
Even if Bastian retracts, 5 new articles will be published tomorrow with a variety of old and new howlers. Worse, the abusers are regularly defended because, after all, there’s so much confusion and scientific disagreement about the proper collection, modeling, and analysis of statistical data. Her article took the additional step, also common, of implying that a shift to reporting Bayesian beliefs would obviously be more sensible.
“On a positive note, psychology, perhaps more than any other branch of science, is showing an admirable desire and ability to police itself and to raise its own standards.” — am I glad that someone out there acknowledges that we’re trying to do better instead of gloating only about the likes of Stapel, Smeesters, and so on. (S)he who is without sin…
If you need further proof how serious psychology is about fixing the abuse of stats and other ways of fraud, look at these papers:
http://onlinelibrary.wiley.com/doi/10.1002/per.1919/abstract
and
http://psr.sagepub.com/content/early/2013/10/31/1088868313507536.abstract
And there are many more such papers by now. Let’s hope that things get better.
P.S.: I once got rejected by the journal Psychoneuroendocrinology with a paper in which two studies yielded essentially the same effect with very similar effect sizes. However, due to the lower test power in one study, the effect was just barely over the .05 significance level, whereas it was fully significant in the other, larger study. This prompted one reviewer to state that “Trends are not significant findings. The correct conclusion is that no relationship has been established”. Mostly based on this reviewer’s response the paper was rejected by Psychoneuroendocrinology. Small wonder that when faced with this type of response, or learning from it for future studies, some researchers will start becoming “creative” in pushing their effects just below the .05 level. My coauthors and I didn’t do that, of course. Instead, we eventually published the paper more or less unchanged in Hormones and Behavior.
Yes that’s the trouble with every “reform” I’ve seen–including the journals that “ban” the use of significance tests. They tend to stem from a shallow understanding of the interpretation of statistics. I’m writing a book now on “How to tell what’s true about statistical inference” from a deeper perspective.
Ivan, isn’t this just typical of the statistics saga? I really wish there were an easier way for lay persons to distinguish good from bad statistical data. I hate the mushiness of it all.
I usually first look at exactly the number of times they repeated the experiment. Now, bio experiments can get pricy and time-consuming- as a chemist, I was used to just ordering up a batch from Sigma and doing 100 measurements in a day! Can’t do that with mice, the Ethics Committees get annoyed (and with good reason!). So, I usually am willing to let bio folks do five-ish replicates and publish in working journals; but I always look at the raw data for myself to see the spread/variance in the measurements.
Reblogged this on Smooth Pebbles and commented:
Retraction Watch, which does such a splendid job covering retractions and other signs of trouble in the scientific literature, has done a fine round-up on recent studies suggesting that the psychology literature seems to set a low bar for ‘significance’ in many of its publications.
Quite an interesting post. However I wonder if the overall P-value distribution bias could be partially explained by the following thing: researchers find some effect with a moderate size, but still not significant in their cohort => they increase the cohort size => the effect does not diminish, but the P-value gets just below .05 => they stop at this cohort size. In this case researchers should provide some statistical ground on their cohort size choice based on statistical power criteria. And of course multi-center studies should resolve most of ambiguity here.
As an experimental psychologist, I agree with Mike in this, though I was told that this increase-participants-to-get-significance approach is troublesome in itself. That is, some researchers have some idea (probably following previous studies or using power calculation) of how many ppts they would need, and they would normally run more if there is a promising but >.05 trend. When they get p < .05, they stop, and then there is a just-below .05 p-value. Another practice is that researchers have no idea how many participants they would need. So they just try a bunch of ppts (say 24), and based on the results they get, they increase the ppts till they get the p-value below .05.
Some of the most egregious flaws I see in applying statistics to psychology experiments, especially social psychology, could not be remediated by better statistics because of the huge gap between what they’re studying and what they purport to infer. There is nothing in significance tests that licenses the jump from statistical to substantive. And the gap permits all manner of latitude to enter. The researchers need to show they have done even a fairly decent job of self-criticism as they fill those huge gaps with a lot of (often) flabby, just -so stories. errorstatistics.com
There’s a psychology paper exactly on this issue and why it’s completely wrong and can lead to erroneous conclusions if you fill up your sample until you finally “hit” .05:
http://pss.sagepub.com/content/22/11/1359.abstract
Recently, there was also a special issue of Current Directions in Psychological Science that dealt with this and other issues and recommendations on how to improve research and reporting practices in psychology. I wonder whether other disciplines that are also frequently featured on Retraction Watch are showing a similar push to improve their practices. Perhaps someone from those other disciplines could comment on that…?
The evaluation of an hypothesis involves firstly the translation of a complicated problem to a accept or reject decision and secondly the determination of a number ‘P’ that can quantify it. The quantification could be questioned, but at least it captures a lot of statistical background. The basic problem comes in when a threshold needs to be specified to come to this yes/no decision. What the study shows, is that authors and reviewers look at the result and are selective in what they want to bring to the general public: an interesting result, so a P-value just below a threshold is attractive. This is human, and I think this happens everywhere – at least I can talk about physical sciences. Without data manipulation, these ‘interesting results’ are simply selected for publication. Nothing wrong with that.
The real problem is when active data manipulation is done to achieve these ‘interesting results’, including the continuation of an experiment until a desired outcome is achieved. The mentioned studies cannot give good estimates for that.
Can one do something about it? Maybe discourage the expression ‘statistically (in)significant and only report P-values leaving it to the audience to decide what they find significant, after all mushiness (as one of the respondents mentions) is also a fact of life?
I would welcome Miguel Roig’s comments on this. The ORI should also provide some formal response to this issue of the P value.
Psychologists Masicampo et al. (2012) claim: “Instead,our aim in the current paper was to contribute a new consideration to all sides of the debate.” However, the method employed was originally introduced by Alan S. Gerber an Neil Malhotra, 2008. Publication Bias in Empirical Social Research. Sociological Methods and Research 37: 3-30. In this and in another paper Gerber and Malhotra analyzed the distribution of p-values in various journals. They call their method “caliper test” because the peculiar shape of the distribution around p = 0.05. I wonder why Masicambo et al. (as well as Leggett et al. 2014) do not cite the inventors of this method?