We’re using a common statistical test all wrong. Statisticians want to fix that.

After reading too many papers that either are not reproducible or contain statistical errors (or both), the American Statistical Association (ASA) has been roused to action. Today the group released six principles for the use and interpretation of p values. P-values are used to search for differences between groups or treatments, to evaluate relationships between variables of interest, and for many other purposes.  But the ASA says they are widely misused. Here are the six principles from the ASA statement:

1. P-values can indicate how incompatible the data are with a specified statistical model.
2. P-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone.
3. Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold.
4. Proper inference requires full reporting and transparency.
5. A p-value, or statistical significance, does not measure the size of an effect or the importance of a result.
6. By itself, a p-value does not provide a good measure of evidence regarding a model or hypothesis.

We spoke with Ron Wasserstein, ASA’s executive director, about the new principles.

Ron Wasserstein: We were inspired to act because of the growing recognition of a reproducibility crisis in science (see, for example, the National Academy of Sciences recent report) and a tendency to blame statistical methods for the problem.  The fact that editors of a scholarly journal – Basic and Applied Social Psychology — were so frustrated with research that misused and misinterpreted p-values that they decided to ban them in 2015 confirmed that a crisis of confidence was at hand, and we could no longer stand idly by.

Retraction Watch: Some of the principles seem straightforward, but I was curious about #2 – I often hear people describe the purpose of a p value as a way to estimate the probability the data were produced by random chance alone. Why is that a false belief?

Ron Wasserstein: Let’s think about what that statement would mean for a simplistic example. Suppose a new treatment for a serious disease is alleged to work better than the current treatment. We test the claim by matching 5 pairs of similarly ill patients and randomly assigning one to the current and one to the new treatment in each pair. The null hypothesis is that the new treatment and the old each have a 50-50 chance of producing the better outcome for any pair. If that’s true, the probability the new treatment will win for all five pairs is (½)5 = 1/32, or about 0.03. If the data show that the new treatment does produce a better outcome for all 5 pairs, the p-value is 0.03. It represents the probability of that result, under the assumption that the new and old treatments are equally likely to win. It is not the probability the new treatment and the old treatment are equally likely to win.

This is perhaps subtle, but it is not quibbling.  It is a most basic logical fallacy to conclude something is true that you had to assume to be true in order to reach that conclusion.  If you fall for that fallacy, then you will conclude there is only a 3% chance that the treatments are equally likely to produce the better outcome, and assign a 97% chance that the new treatment is better. You will have committed, as Vizzini says in “The Princess Bride,” a classic (and serious) blunder.

Retraction Watch: What are the biggest mistakes you see researchers make when using and interpreting p values?

Ron Wasserstein: There are several misinterpretations that are prevalent and problematic. The one I just mentioned is common. Another frequent misinterpretation is concluding that a null hypothesis is true because a computed p-value is large.  There are other common misinterpretations as well.  However, what concerns us even more are the misuses, particularly the misuse of statistical significance as an arbiter of scientific validity. Such misuse contributes to poor decision making and lack of reproducibility, and ultimately erodes not only the advance of science but also public confidence in science.

Retraction Watch: Do some fields publish more mistakes than others?

Ron Wasserstein: As far as I know, that question hasn’t been studied.  My sense is that all scientific fields have glaring examples of mistakes, and all fields have beautiful examples of statistics done well. However, in general, the fields in which it is easiest to misuse p-values and statistical significance are those which have a lot of studies with multiple measurements on each participant or experimental unit. Such research presents the opportunity to p-hack your way to findings that likely have no scientific merit.

Retraction Watch: Can you elaborate on #4: “Proper inference requires full reporting and transparency”?

Ron Wasserstein: There is a lot to this, of course, but in short, from a statistical standpoint this means to keep track of and report all the decisions you made about your data, including the design and execution of the data collection and everything you did with that data during the data analysis process.  Did you average across groups or combine groups in some way? Did you use the data to determine which variables to examine or control, or which data to include or exclude in the final analysis? How are missing observations handled?  Did you add and drop variables until your regression models and coefficients passed a bright-line level of significance? Those decisions, and any other decisions you made about statistical analysis based on the data itself, need to be accounted for.

Retraction Watch: You note in a press release accompanying the ASA statement that you’re hoping research moves into a “post p<0.05” era – what do you mean by that? And if we don’t use p values, what do we use instead?

Ron Wasserstein: In the post p<0.05 era, scientific argumentation is not based on whether a p-value is small enough or not.  Attention is paid to effect sizes and confidence intervals.  Evidence is thought of as being continuous rather than some sort of dichotomy.  (As a start to that thinking, if p-values are reported, we would see their numeric value rather than an inequality (p=.0168 rather than p<0.05)). All of the assumptions made that contribute information to inference should be examined, including the choices made regarding which data is analyzed and how.  In the post p<0.05 era, sound statistical analysis will still be important, but no single numerical value, and certainly not the p-value, will substitute for thoughtful statistical and scientific reasoning.

Retraction Watch: Anything else you’d like to add?

Ron Wasserstein: If the statement succeeds in its purpose, we will know it because journals will stop using statistical significance to determine whether to accept an article. Instead, journals will be accepting papers based on clear and detailed description of the study design, execution, and analysis, having conclusions that are based on valid statistical interpretations and scientific arguments, and reported transparently and thoroughly enough to be rigorously scrutinized by others.  I think this is what journal editors want to do, and some already do, but others are captivated by the seeming simplicity of statistical significance.

Like Retraction Watch? Consider making a tax-deductible contribution to support our growth. You can also follow us on Twitter, like us on Facebook, add us to your RSS reader, sign up on our homepage for an email every time there’s a new post, or subscribe to our new daily digest. Click here to review our Comments Policy.

15 thoughts on “We’re using a common statistical test all wrong. Statisticians want to fix that.”

1. PJTV says:

These discussions misses the advantage of using p-values: no knowledge of underlying distributions required, and all expressed in terms of probabilities which has for all a certain feeling. Further, decisions have to be taken in engineering and that means yes/no, while continuity is not an option. So, let us keep the p-values, but with due care. The number 3 ‘prinicple’, I propose to paraphrase: business or policy decisions should not be based p-values or \$-values alone.

1. ARGalloni says:

I think the point is that the feeling of simplicity you get from the p-value to make a secure yes/no decision is false comfort. If you make yes/no decisions from nothing but the p-value you will often make the wrong decision. The yes/no decision can still be made, but you need to make that decision based on an evaluation of the evidence from the paper as a whole (including effect sizes etc.), of which the p-value should be only one of many things that you need to consider.

2. Ron Wasserstein says:

Thanks for your comment. Decisions are often dichotomous, but strength of evidence is usually not.

3. VM says:

What do you mean by “no knowledge of underlying distributions required”? A p-value means the likelihood that the observed result, or a result that is even more extreme, would be produced if the null hypothesis were true. That is of course based on a probability distribution chosen for the null hypothesis.

4. What if the p-value were 0.051 or 0.049?
What would you conclude in each case?

What if the p-value = 0.0001? But, for example, the difference between your means were just 0.1% so there are not practical implications.

1. Alison McCook says:

2. Sylvain Bernès says:

For those interested, I warmly recommend the Wikipedia article in English on the Null hypothesis, one of the best articles available for this topic (including references and the Talk section):

http://en.wikipedia.org/wiki/Null_hypothesis

I feel that too many researchers are under the impression that their article will appear more “positive”, and will thus be accepted, if they successfully “prove” the null hypothesis. At least in a pure Fisher’s approach, such a strategy makes no sense. I’m not sure, however, if the “Princess Bride” blunder mentionned by Dr. Wasserstein is related to this (wrong) strategy.

3. Kevin Costello says:

For point #2, one thing I like to bring up when teaching hypothesis testing is the XKCD “jelly bean” comic at https://xkcd.com/882/ . The comic also ties in with the problems caused in general by a research culture that publishes “statistically significant” results while ignoring negative ones.

4. Ed Rigdon says:

Thanks a lot for the interview! Much appreciated.

5. The p-value debate is confounded with the null/nil hypothesis debate. Why are we always doing a ‘nil’ hypothesis as null hypothesis (the hypothesis to be tentatively accepted unless a result so improbable and so different conditional on the null hypothesis being true) is observed? In structural equation modeling we have the freedom of asserting null hypotheses with non-zero values for parameters, and if we do assert zero values by hypothesis for some of the parameters, we do not do so for all, so that the particular choice of which parameters get fixed zero values is theoretically driven (substantively) rather than as a hypothesis to be rejected. We want our null hypotheses to be accepted.

In physics they hypothesize certain specific values for physical constants some derived from theory and others from results of prior studies with other experimental contexts. The idea that you believe you are testing the alternative hypothesis, with a nil hypothesis (parameter = 0), and will accept the theory you hold under the alternative hypothesis (p 0) is based on a fallacy of forgetting that the alternative hypothesis could be satisfied by any model that does not fix the parameter to zero. So, merely rejecting the nil hypothesis does not establish that the alternative substantive hypothesis being accepted means you have uniquely established that your non-zero (but unknown) value for your theory is true. It could be that some other model (among infinitely many in some substantive cases asserting nonzero value) could be true. p-values tell us little other than that the results are so deviant (or not) under the assumption that the null hypothesis is true as to be improbable (if less than .05 or smaller) as to stretch the bounds of credibility beyond credibility. Remember the null hypothesis could be that b1 = .75 (a specific nonzero value). When sample sizes get very large the power to reject the null hypothesis becomes quite large, meaning a small difference may be detected with very small p-value as ‘significantly different under the assumed sampling distributions. But the small size of the significant difference means that it may well be that we cannot satisfy the distributional assumptions and yet the null hypothesis may still be true since the parameter fixed by hypothesis to some, say, nonzero value may be true while the sampling distribution in actuality in truth does not meet our assumptions (e.g. we have multivariate normality), which is more readily detected in huge samples. A null hypothesis is complex, meaning it combines sampling distribution assumptions with substantive model parameter assumptions. One could be false while the other is true.

6. PJTV says:

There are some interesting publications by David Colquhoun on this issue among them: http://rsos.royalsocietypublishing.org/content/1/3/140216. Some numerical examples show how wrong one could be, and he also points at an approach that gives a better level of significance. Roughly said, don’t apply the 2 sigma rule, but a 3 sigma rule.

7. Andy chimz says:

Thanks for the article ,but am more confused, we need to have a base line for statistical judgement, Most constant values in physics are not constants, but their variability are considered insignificant because we must have a base line to say move this way or the other, if p value is no longer p*,and Ho not nil, it will be difficult to make statistical summary and subject most decisions to subjective reasoning.

8. Samer Faraj says:

Thank you for the courage of tackling this issue, or at least problematizing it. I was wondering if the problem has gotten worse in the age of big data. Too often, we see studies published where a (very) large N necessarily generates high significance levels for most model variables. Indeed, the whole research enterprise devolves into testing models (all significant) until finding one that the author prefers and chose to wrap the paper around. The problem is not easily solvable by asking for effect size. It becomes a debate: authors will argue that their effect size is big enough which leaves reviewers/editors with no “objective” basis to argue the opposite.

9. Samuel Zagarella says:

Great article. For a deeper understanding still of P values and the fallacies , I strongly recommend a book called ” Statistics Done Wrong ” by Alex Reinhart, especially the chapter on the base rate fallacy. this will complete most peoples’ understanding of the fallacy of assuming the P value is the “chance this result was a false positive”.

This site uses Akismet to reduce spam. Learn how your comment data is processed.