Science and the significant trend towards spin and fairytales

Simon Gandevia

What do fairytales and scientific papers have in common? Consider the story of Rumpelstiltskin. 

A poor miller tries to impress the king by claiming his daughter can spin straw into gold. The avaricious king locks up the girl and tells her to spin out the gold. She fails, until a goblin, Rumpelstiltskin, comes to her rescue.  

In science, publishers and editors of academic journals prefer to publish demonstrably new findings – gold – rather than replications or refutations of findings which have been published already. This “novelty pressure” requires presentation of results that are “significant” – usually that includes being “statistically significant.”  

In the conventional realm of null-hypothesis testing of significance, this means using a threshold probability. Usually in biology and medicine the accepted cutoff is a probability of 0.05 (a chance of 5%, or one in 20) and its use is explicitly written into the description of the methods section of publications. Some branches of science, such as genetics and physics, use more stringent probability thresholds. But the necessity of having a threshold remains.  

How do researchers create the illusion of novelty in a result when the finding has a probability value close to, but on the wrong side of, the stated probability threshold – for example, a probability of 0.06? Talk it up, spin out a story! It’s the fairytale of Rumpelstiltskin in modern garb.  

Here are more than 500 examples of pretzel logic researchers have used to make claims of significance despite p values higher than .05. It would be comical if not for the serious obfuscation of science which the stories cause.  

In recent years, the practice of claiming importance and true significance for such results has been termed “spin.” More formally, we call it “reporting that could distort the interpretation of results and mislead readers.”

Increasingly, scholars are quantifying and analyzing the practice of spinning probability values. Linked to our development of a “Quality Output Checklist and Content Assessment” (i.e. QuOCCA) as a tool for assessing research quality and reproducibility, my colleagues and I have measured how often spin occurs in three prestigious journals, the Journal of Physiology, the British Journal of Pharmacology and the Journal of Neurophysiology.  

We found when probability values were presented in the results section of the publication, but were not quite statistically significant (greater than 0.05 but less than 0.10), authors talked up the findings and spun out a story in about 55%-65% of publications. Often, they wrote results “trended” to significance. Thus, results of straw can become results of gold! Attractive to the researchers, editors, publishing houses and universities.  

Putting spin on insignificant probability values is an egregious and shonky – that’s dubious, for our friends outside of Australia –  scientific practice. It shows the authors’ failure to appreciate the requirement of an absolute threshold for claiming the presence (or not) of an effect, or for supporting (or not) a hypothesis. It reveals an entrenched and incorrigible capacity for bias. Furthermore, the authors seem unaware of the fact that a probability value of, say, 0.07 is not even justifiable as a trend: The addition of further samples or participants does not inexorably move the probability value below the 0.05 threshold.  

The number of instances of spin within a publication has no theoretical limit; any probability value above 0.05 could be talked up. However, while our previous audits of publications in three journals have occasionally found more than one example of spin within a single publication, such casess seemed rare.  

A 2022 paper in the British Journal of Pharmacology titled “Deferiprone attenuates neuropathology and improves outcome following traumatic brain injury,” has obliterated this impression. On at least 25 occasions, the authors overhype results linked to a probability value exceeding 0.05. Some of the offending explanations use phrases such as: “did not reach significance but showed a strong trend (p=0.075);” “a trending yet non-significant preservation of neurons was seen;” “no significant changes were seen in proBDNF despite an increased trend.”

In the publication, by Daglas and colleagues, many probability values between 0.05 and 0.10 were spun, but even values above 0.10 were considered “trendy.” These included values of 0.11, 0.14, 0.16, 0.17, 0.23 and 0.24. The authors have not responded to my request for comment, 

As  the 2024 Paris Olympics get underway, it is tempting to ask: Does the featured publication set a World Record for scientific spin? Comment with your entries, please.

What should be done about the prevalence of spinning probability values? This question is part of a bigger dilemma. All levels of the “industry” of science know the problems caused by perpetuating shonky science, but their attempts at regulation and improvement are fraught with difficulty and impeded by self-interest. Education about science publication and mandatory requirements before publication are potentially helpful steps. 

The messages from Rumpelstiltskin should be that spinning straw can lead to trouble, and science is not a fairytale.  

Simon Gandevia is deputy director of Neuroscience Research Australia.

Like Retraction Watch? You can make a tax-deductible contribution to support our work, follow us on Twitter, like us on Facebook, add us to your RSS reader, or subscribe to our daily digest. If you find a retraction that’s not in our database, you can let us know here. For comments or feedback, email us at [email protected].

Processing…
Success! You're on the list.

65 thoughts on “Science and the significant trend towards spin and fairytales”

  1. Let’s be honest. The idea that a finding that’s 95% likely to be true rather than caused by variance is “gold” but a finding that’s 94% likely to be true is dross is a polite fiction. It’s useful to have a cutoff, but it’s also purely arbitrary. In a better world, more people would understand p values, and scientists would be free to mention their p value results but talk about their findings anyway without worrying about misleading people.

    Maybe we need a surgeon general warning on every media report and social media post about science to the effect of “Even if this study was performed flawlessly, there is a X% chance it is still wrong.”

    https://xkcd.com/882/

    1. A p-value of .05 doesn’t mean a finding is “95% likely to be true.” A p-value doesn’t tell you the percent chance of a finding being right or wrong.

      1. I dimly recall a movement(?) to make the standard more stringent than 5%. An article on addressing the inadequacies of this rule would be more interesting.

  2. MISLEADING and INCORRECT on so many levels. Obviously, the author of this essay (Simon Gandevia) doesn’t know the basics of P values and hypothesis testing. The very fact that he thinks a threshold that is originally set haphazardly and without any real reason behind it (i.e., 0.05) suddenly makes a result “gold” versus “not gold” shows that he is very unqualified to even write about this. Not to mention the irrelevant example of gold and goblin he provides as well as the irrelevant link he suggests between statistical significance and NOVELTY.

    1. Agreed. The author’s premise that the p < 0.05 is the ultimate test of "significance" or veracity is entirely unfounded.

      The p-value threshold is indeed arbitrary. There are also plenty of examples on the other side of this coin where results with p<0.05 are touted as "gold" when in reality, they are statistically significant but clinically meaningless duds.

      Effect sizes, confidence intervals, and nuanced clinical interpretation of these estimates are what we need. Not more dogmatic adherence to arbitrary p-value thresholds.

    2. YES, Thank you! You are 100% Correct (not statistically speaking of course). I’m about to write an email to this chap to point out his confusion with interpretation of statistics, p-values, and there value overall.

      1. Awesome! Perhaps you can write an open letter to be published here and send the link to him.

    3. Agreed, as long as the p-values themselves are correctly reported, there really is no cause for complaint.

  3. It is pretty surprising to see such a ‘dodgy’ interpretation of data in a scientific paper – even more amazing that this spin passed muster with reviewers & the editor also…

    Australia has traditionally had a reputation of training excellent scientists. As an ex-pat is saddens me to see the interpretation of results in publications such as the one featured in this article finding their way into the mainstream literature. I think we need to go back to basics in not only training scientists for interpreting scientific data, but also for reviewers & editors about thinking more critically about what is served up to them in a manuscript.

    All that glitters is not gold…

  4. “Some of the offending explanations use phrases such as: “did not reach significance but showed a strong trend (p=0.075);” “a trending yet non-significant preservation of neurons was seen;” “no significant changes were seen in proBDNF despite an increased trend.””

    These are not offending phrases to me. I think describing comparisons in the data that are close to, but not quite 0.05 as “trending toward significance” is an acceptable phrase to me. Its up to the reader to understand that this comparison does not meet the accepted standard of significance, and “trending” suggests this to the reader, IMO.

    What would be wrong is if they didnt distinguish this data from p values less than 0.05, or they committed fraud, which they apparently have not done.

    1. The phrase “trending toward significance” is absurd. A p-value of .075 is no more trending “toward” significance than trending “away” from significance.

      1. Yet a pvalue of 0.075 does not have any more or less meaning than a pvalue of 0.000001 if you’re operating at the level of the author.
        There is nothing in the article or the supportive comments except a desperate need to validate oneself.
        This is so, so bad, Retraction Watch.

    2. Therein lies a problem, semantics and grammar.

      Trending towards is in the present continuous sense. It should not be used in this context.

      Maths and science education and training will only get you so far if you cannot communicate them accurately using the art of language.

  5. P values are nice because they are simple, but behind that magic number are several assumptions. While I agree with the premise of the article, changes to biology is too complex to be defined by a single number.
    A factor often ignored is the study power, and in small sample sizes a few extra replicates can make a huge difference to p values. And sometimes sample numbers are inadequate for valid reasons, yet the findings are instructive.
    Do they warrant a long discussion? That may be an entirely different matter.

  6. All peer-review records should be published and open access, including rejected manuscripts. I’ve rejected papers for having no significant findings, only to see them later published in even higher impact journals. The shameful reviewers that perpetuate this should have their names permanently attached to their sins-against-science.

      1. I would presume that Penelope is mentioning how some authors perform false modifications of manuscripts to make them more acceptable. I have seen some of them in the wild too! There was an already completed case control study miraculously birthing more cases with results favoring the alternative hypothesis in the revised manuscript, just after my first decision as major revision due to insignificance of results and obvious confounding factors. Some authors truly are shameless about manipulating their data. And unifying-publicizing peer reviews would be the perfect way to prevent trickery by resubmitting or selecting another journal with the false manuscript containing fabricated data.

    1. Penelope: “I’ve rejected papers for having no significant findings, only to see them later published in even higher impact journals. The shameful reviewers that perpetuate this should have their names permanently attached to their sins-against-science.”

      WOW! In this scenario, YOU are the shameful reviewer not them. YOUR name should be permanently attached to your sins against science. What you did there is called PUBLICATION BIAS, which is very disruptive and indeed a sin against science. https://en.wikipedia.org/wiki/Publication_bias

      1. You misunderstand the Wikipedia article. Also, Wikipedia has never been a valid citation.
        You are suggesting that we litter the scientific literature with junk, because you don’t understand that replication failure is significant.

        1. Laurie Coombs: “You misunderstand the Wikipedia article.”

          Me:

          Have I misunderstood the Wikipedia paper? that reads:

          “Positive-results bias, a type of publication bias, occurs when authors are more likely to submit, or editors are more likely to accept, positive results than negative or inconclusive results.[15]”

          or says:

          “Definition: Publication bias occurs when the publication of research results depends not just on the quality of the research but also on the hypothesis tested, and the significance and direction of effects detected.[10] The subject was first discussed in 1959 by statistician Theodore Sterling to refer to fields in which “successful” research is more likely to be published. As a result, “the literature of such a field consists in substantial part of false conclusions resulting from errors of the first kind in statistical tests of significance”.[11] In the worst case, false conclusions could canonize as being true if the publication rate of negative results is too low.[12] ”

          ———————————–

          Laurie Coombs: “Also, Wikipedia has never been a valid citation.”

          Me:

          OK, what about a JAMA paper?

          “Publication bias is the tendency on the parts of investigators, REVIEWERS, and editors to submit or accept manuscripts for publication based on the direction or strength of the study findings.”

          K. Dickersin (March 1990). “The existence of publication bias and risk factors for its occurrence”. JAMA. 263 (10): 1385–9. doi:10.1001/jama.263.10.1385. PMID 2406472. https://pubmed.ncbi.nlm.nih.gov/2406472/

          ——————————————-
          Or for example this one saying:

          “Publication bias is defined as the failure to publish the results of a study on the basis of the direction or strength of the study findings.[1] This may mean that only studies which have statistically significant positive results get published and the statistically insignificant or negative studies does not get published. Of the several reasons of this bias the important ones are rejection (by editors, REVIEWERS), lack of interest to revise, competing interests, lack of motivation to write in spite of conducting the study.[2]”

          https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6573059/

          ———————————–

          Laurie Coombs: “You are suggesting that we litter the scientific literature with junk, because you don’t understand that replication failure is significant.”

          Me:

          1. Where did I suggest that we litter the scientific literature with junk?! Please quote me implying that because I can’t find such a suggestion.

          2. Where did I or Penelope talk about REPLICATION?! Please quote us implying anything related to REPLICATION.

          Before embarrassing yourself with an idiotic comment, at least make sure (1) you carefully read the conversation and the Wikipedia’s article you are criticizing, (2) you know what the publication bias is, and (3) overall, you know what you are talking about –eg, you are not for example, a 15-yr-old junior student or someone without any expertise.

  7. Poor study design, improperly applied statistical testing, and overgeneralization of results are to blame for the vast majority of scientific inflation. The distinction between results that yield p-values just below and just above 0.05 is a minor contributor. Perhaps the author is unaware that such results are usually not significantly different *from each other.*

    1. Agreed.
      But the Church of the Alpha, the article author, and their acolytes in the other comments might never understand this.

  8. This is by far the most empty-headed, self-righteous insistence upon a frequentist alpha I’ve seen anywhere in a long time. I’m very disappointed to see it published here.
    The author’s most basic point might almost make some sense if he weren’t so painfully, arrogantly, dumbly wedded to an arbitrary alpha and to all the pats on the head he got in grad school for telling everybody his favorite number is 0.05. But he makes clear that his allegiance is to the number 0.05 alone, and that nothing else will matter unless and until we all join him in his Church of the Alpha. He has nothing to say about where he thinks the universal, unalterable, pure meaning that he sees in 0.05 comes from (if he considered this, he might have reconsidered what he was writing).
    I am going to use this article in my teaching as an example of what happens when people FORGET what a p value is.
    This is very, very bad, Retraction Watch.

    1. What Dr. Gandevia said is not even frequentist. It is simply under-educated and ill-informed.
      Even frequentists do agree that Alpha (even if we *ASSUME for a moment* that an arbitrary 0.05 alpha is a valid threshold) should not be treated as a dogmatic all-or-nothing thin cutoff line. This has nothing to do with overselling the article, let alone faking NOVELTY!; it is simply the correct way to report statistical results.
      Dismissing a result as “garbage” just because it was slightly above 0.05 derails scientific progress by hindering potentially useful information that deserved to be later re-tested in larger and better studies.
      Not to mention again that the very 0.05 threshold is itself irrelevant and groundless. Perhaps Retraction Watch should *retract* this incorrect article.

    1. You shouldn’t reference your own paper for this statistic, since you provide no explanation or evidence for it beyond a citation. You should reference the article you cited for it.

  9. This should not have been published by Retraction Watch and is an example of poor peer review and scientific miscommunication for which Retraction Watch supposedly, amongst other things, stands against. I am seeing too many weak/inaccurate/shoddy “research pieces” (read like ideological opinion pieces) being reported by Retraction Watch by people with a poor background in both research and science communication.

  10. It’s almost ironic that such a piece is published in 2024, after more than 10 years (at least) of hand-wringing about all the things that go wrong in science, of which NHST with its fetishized .05 cutoff is just one problem. After all the discussions that have happened in the last decade (e.g., false-positive publications, failures to replicate, inadequate powering of samples, gardens of forking paths, Bayesian approaches, open science and open data) — are we back to square one now?

    1. I don’t think it’s ironic. Retractionwatch seems to have been moving toward publish uncontroversial tepid content like this for some time now. I think it is remarkable in its banality. Academic research is in utter crisis. Pieces like this merely recycle age-old, once-contentious debate in the scientific community rather than risk provoking or offending.

  11. As a mathematician I often wonder how science got to this sorry state of affairs.
    The scientific method as it is used today has never been revised since the advent of general relativity and quantum mechanics.
    High energy physics, astronomy and cosmology, medical science, biology, social sciences and even neuroscience use statistical methods for determining the reliability of measured data and results, yet we are only now starting to understand the true nature, scope and magnitude of bias in the pursuit of science.
    The use of p-values is just another bias to add to the list used by artificial intelligence research community.
    And because artificial intelligence is increasingly used in all fields of science, this use of p-values only magnifies the bias and inherently renders the results questionable.
    It’s not only chatbots that hallucinate, but a lot of modern day scientists as well.
    Fortunately parts of global academia are starting to discard or discourage the use of p-values, but the bias intrinsic to its use seems a hard habit to kick even faced with a growing body of literature showing its inadequacy.

  12. The post certainly is rather poorly written and certainly deserves some of the criticism it’s getting. But I think some of the comments are overly harsh and may be misinterpreting what the author is saying. Some commenters seem to suggest the author is endorsing the use of .05 as a SCIENTIFICALLY meaningful cutoff that separates important results from unimportant results. But I don’t think that’s what the author is saying. The author is correct that researchers are often under pressure to find statistically significance due to publication bias that favors statistically significant results. Thus, statistical significance is “gold” in the sense that it is highly sought after because it makes results more PUBLISHABLE (which of course doesn’t necessarily make the results scientifically INTERESTING or IMPORTANT). The author is also correct that the pressure to obtain novel and statistically significant can lead to researchers making silly post-hoc statements in attempt to rationalize unimpressive results that didn’t live up to a priori hopes (e.g., referring to a p-value of .29 as a “trend” in the hypothesized direction). That said, the author should have emphasized that passing the .05 cutoff, though often a necessary condition for considering an effect notable (like it or not, we have to draw the line somewhere when binary decision-making is required), is not a SUFFICIENT condition for considering an effect notable. Other statistics, such as confidence intervals, as well as non-statistical information (e.g., plausibility of mechanism of action) should be considered also. In short, while I agree with others that this post is not one of Retraction Watch’s best, lambasting the author as a fool who doesn’t understand basic frequentist statistics may be unfair.

    1. This would be a fair defense if the author didn’t spend most of the article focusing on p-values between 0.05 and 0.1, e.g. claiming that “the authors seem unaware of the fact that a probability value of, say, 0.07 is not even justifiable as a trend” and “for example, a probability of 0.06? Talk it up, spin out a story!” The article should have focused on the attempt to spin p-values above 0.1 or even 0.2 beyond the brief mention they currently get. The author clearly has an unfounded respect for 0.05, which betrays a flawed understanding of frequentist statistics.

      1. I don’t think it’s about a “respect for 0.05” per se. It’s about respect for whatever evidential criteria were designated a priori, either explicitly (preferably in a preregistered study protocol and analysis plan) or implicitly (due to standard conventions in the field). The purpose of null hypothesis significance testing (NHST) is to control the Type I error rate at no higher than the predesignated maximum level, thus screening out findings that don’t even meet the bare minimum standard of evidence. If researchers fudge the standard of evidence post-hoc out of convenience, so that their results can squeak by even when they didn’t mean the bare minimum standard, then that defeats the purpose of NHST. If you don’t think NHST testing is the right approach for your circumstance, that’s fine. But if you’re using NHST, you should use it correctly. The conventional .05 standard is fairly lenient already, since it allows statistical significance 1 out of every 20 tries when there’s no actual effect at all. So for most research contexts, I think it’s hard to justify claims that even p-values of .07 or .08—let alone p-values as high as .24 (which occur nearly 1 in 4 tries when there’s no effect)—should be considered as indicative of some kind of “trend.”

        1. Andrew, you have a couple of errors here, the second one being very serious. Your 1st error is thinking that the hypothetical researcher is calling p = 0.06 “significant”, which is not the case. He is calling it something else, eg, trending towards significance, etc. So he doesn’t violate the rule set for the reserved keyword “significant”.

          ps. I am not talking about p = 0.24 which is in no way even “trending” and is irrelevant to even be called trending; I am talking about p-values between 0.05 and 0.1 or about that, especially (ESPECIALLY) in a small sample and where the power is low. The researcher’s good `practice serves 2 purposes: 1. He does agree that the result did not reach the level of significance by not calling it “significant”; 2. He still highlights it as potentially useful for future research.
          ——————————————-
          Your 2nd serious error is that you mistakenly think that the alpha should be followed strictly and dogmatically as an all-or-nothing rule, once it is set a priori. No, the alpha should be treated more like a band, a continuum, not a thin line, even in NHST framework.

          That is p-values should be interpreted and highlighted POST HOC depending on the situation.

          Even in NHST, one should definitely look at the EFFECT SIZE measures together with the p value. This again means that a p-value should be INTERPRETED post hoc.

          Even if a p is slightly below 0.05 (say, p = 0.04) but the sample is too large and the effect size is small, the researcher should properly interpret p = 0.04 as “significant but a possible type I (false positive) error”.

          Of course, p > alpha should not be named “significant” which is a keyword reserved for p =< alpha but it should definitely be named something like "likely to become significant in better circumstances".

          When a researcher sees a p = 0.06 in a pilot study with a small sample, he should definitely highlight it with whatever name except "statistically significant", because it is very likely that that variable might have become significant if the sample was slightly larger or if the confounding variables had been controlled better, etc.

          1. “MISLEADING and INCORRECT”: So your position is that it is appropriate to refer to a p-value between .05 and .1 as “trending toward significance” but not appropriate to refer to a p-value above .1 the same way? That is simply a rule you made up. And for someone who criticized “arbitrary” cutoffs elsewhere in the thread, I find that quite peculiar.

            You say I made an error in “thinking that the hypothetical researcher is calling p = 0.06 ‘significant.'” I made no such claim.

            You also stated the following: “Your 2nd serious error is that you mistakenly think that the alpha should be followed strictly and dogmatically as an all-or-nothing rule, once it is set a priori. No, the alpha should be treated more like a band, a continuum, not a thin line, even in NHST framework.” If you want to interpret the p-value in a “Fisherian” way, i.e., as a continuous representation of evidence against the null hypothesis, that’s fine. It’s certainly justifiable to say a p-value of .000004 is generally stronger evidence than a p-value of .04. But when we’re using an alpha level to make a decision—which is the context the post is referring to—we’re basically saying “beyond this point, the evidence doesn’t mean the bare minimum standard required to support a claim.”

            You also say that “even in NHST, one should definitely look at the EFFECT SIZE measures together with the p value. This again means that a p-value should be INTERPRETED post hoc.” You are absolutely correct that inferences should not generally be based on p-values alone. But to clarify, one should generally look at the CONFIDENCE INTERVAL, since the p-value and the point estimate of the effect size don’t provide a margin of error for the estimation. For example, when the p-value is nonsignificant, a confidence interval that is tight around zero suggests an effect that is either zero or too close to zero to merit further investigation, whereas a wide confidence interval is less conclusive and suggests that the effect may or may not look interesting in higher-powered investigations.

            Perhaps your most concerning claim is this: “Of course, p > alpha should not be named “significant” which is a keyword reserved for p =< alpha but it should definitely be named something like 'likely to become significant in better circumstances'." That view is exactly what the author was rightly (albeit clumsily) warning against with regard to the abused word "trending." A p-value between .05 and .1 does not inherently imply that with a larger sample significance will likely be obtained.

            You similarly claimed that when a p-value is .06 in a pilot study, "it is very likely that that variable might have become significant if the sample was slightly larger or if the confounding variables had been controlled better, etc." But that is not generally justifiable. Although it is possible that the effect will be significant in a larger follow-up study, it is also quite possible that the p-value will be even higher, since the p-value is uniformly distributed under the null hypothesis.

            Note also that pilot studies are a completely different context than addressed by the post, since the post is referring to studies for publication. When a lab conducts an exploratory pilot study with a small sample for internal purposes to determine which variables to explore in the future, a totally different type of inference is being made (in fact, formal NHST may often not be desired at all in such cases).

    2. I think you are sugarcoating Dr. Gandevia’s article, due not to any other reason than simply being nice and kind, which is very good but has no place in science, especially in medicine. Science should be fair but firm. Such a huge mistake made by Dr. Simon Gandevia is not much forgivable, especially when he looks as a legit knowledgeable and experienced professor and more importantly, especially when his huge mistake appears in a site like Retraction Watch that is considered some sort of authority or point of reference for fighting against scientific misconduct.
      I re-read Dr. Gandevia’s article, and found it completely against your glossed-over depiction of him. He is not ONLY saying that researchers do this out of their concern about publication bias (the way you try to paint him). Instead, he HIMSELF is repeatedly asserting that he DOES BELIEVE that all p-values smaller than 0.05 are gold and all those above 0.05 are trash (as he puts, Alpha is an “ABSOLUTE threshold”). Now **THIS** is where I for one call him an under-educated ill-informed researcher who is not even qualified to talk about P values, let alone lecturing others about them as if he really knows something. (the Dunning-Kruger effect?)
      ——————————————–
      I didn’t try to make his article look bad. It was terrible, in the first place. To quote his exact sentences that 100% completely confirms my HARSH view at several points, and rejects 100% of your view at several points, he says the following.
      He writes (all caps and emphases by me): “Here are more than 500 examples of pretzel logic researchers have used to make claims of significance despite p values higher than .05. It would be COMICAL if not for the SERIOUS OBFUSCATION OF SCIENCE which the stories cause.
      In recent years, the practice of claiming importance and TRUE significance for such results has been termed “spin.” More formally, WE call it “reporting that could distort the interpretation of results and MISLEAD readers.””
      Me: Quite the opposite: Failing to report a p-value like 0.06 is the misleading practice, not the other way around. When someone says 0.06 is BORDERLINE significant, he does not mean it as significant. Yet he properly highlights it as being worth future research. On the contrary, if someone simply dismisses p = 0.06 as garbage, now THIS is misleading.
      ———————————-
      He continues to confirm that he too believes in this erroneous view: “WE found when probability values were presented in the results section of the publication, but were not quite statistically significant (greater than 0.05 but less than 0.10), authors talked up the findings and spun out a story in about 55%-65% of publications. Often, they wrote results “trended” to significance. Thus, results of STRAW can become results of GOLD! Attractive to the researchers, editors, publishing houses and universities. ”
      Me: (1) He doesn’t know that straw does not SUDDENLY flip to gold or vice versa, but that there is a continuum. (2) He doesn’t know that they are NOT results of straw, to begin with. He simply doesn’t know. That is being ill-informed. (3) Finally, he doesn’t know that there is no real reason behind using 0.05 to separate gold from straw!
      ———————————————
      Again he 100% rejects your varnished view of him, when talking about HIS very OWN view that:
      “Putting spin on insignificant probability values is an EGREGIOUS and shonky – that’s DUBIOUS, for our friends outside of Australia – scientific practice. It shows the authors’ failure to appreciate the requirement of an **ABSOLUTE THRESHOLD** for claiming the presence (or not) of an effect, or for supporting (or not) a hypothesis. It reveals an entrenched and incorrigible capacity for bias. Furthermore, the authors seem unaware of the fact that a probability value of, say, 0.07 is not even justifiable as a trend: The addition of further samples or participants does not inexorably move the probability value below the 0.05 threshold.
      The number of instances of spin within a publication has no theoretical limit; any probability value above 0.05 could be talked up. However, while our previous audits of publications in three journals have occasionally found more than one example of spin within a single publication, such casess seemed rare. ”
      ———————————–
      Bottom line, he indeed is not qualified at all to talk about p values, let alone research them or lecture about them.

    3. I re-read Dr. Gandevia’s article. Unlike what you suggested in his defense, he is not ONLY saying that researchers do this out of their concern about publication bias. Instead, he HIMSELF is repeatedly assuring the readers that he DOES BELIEVE that all p-values smaller than 0.05 are gold and all those above 0.05 are trash (as he puts, Alpha is an “ABSOLUTE threshold”). Now **THIS** is where he seems not so knowledgeable. Please re-read Simon Gandevia’s following paragraphs and you will agree with me:
      He writes (all caps and emphases by me): “Here are more than 500 examples of pretzel logic researchers have used to make claims of significance despite p values higher than .05. It would be COMICAL if not for the SERIOUS OBFUSCATION OF SCIENCE which the stories cause.
      In recent years, the practice of claiming importance and TRUE significance for such results has been termed “spin.” More formally, WE call it “reporting that could distort the interpretation of results and MISLEAD readers.””
      Me: Quite the opposite: Failing to report a p-value like 0.06 is the misleading practice, not the other way around. When someone says 0.06 is BORDERLINE significant, he does not mean it as significant. Yet he properly highlights it as being worth future research. On the contrary, if someone simply dismisses p = 0.06 as garbage, now THIS is misleading.
      ———————————-
      He continues to confirm my view about him: “WE found when probability values were presented in the results section of the publication, but were not quite statistically significant (greater than 0.05 but less than 0.10), authors talked up the findings and spun out a story in about 55%-65% of publications. Often, they wrote results “trended” to significance. Thus, results of STRAW can become results of GOLD! Attractive to the researchers, editors, publishing houses and universities. ”
      Me: (1) He doesn’t know that straw does not SUDDENLY flip to gold or vice versa, but that there is a continuum between gold and straw. (2) He doesn’t know that those are NOT results of straw, to begin with. (3) Finally, he doesn’t know that there is no real reason behind using 0.05 to separate gold from straw!
      ———————————————
      Again he 100% rejects your defense of him:
      “Putting spin on insignificant probability values is an EGREGIOUS and shonky – that’s DUBIOUS, for our friends outside of Australia – scientific practice. It shows the authors’ failure to appreciate the requirement of an **ABSOLUTE THRESHOLD** for claiming the presence (or not) of an effect, or for supporting (or not) a hypothesis. It reveals an entrenched and incorrigible capacity for bias. Furthermore, the authors seem unaware of the fact that a probability value of, say, 0.07 is not even justifiable as a trend: The addition of further samples or participants does not inexorably move the probability value below the 0.05 threshold.
      The number of instances of spin within a publication has no theoretical limit; any probability value above 0.05 could be talked up. However, while our previous audits of publications in three journals have occasionally found more than one example of spin within a single publication, such casess seemed rare.”

  13. I think the core problem the author brings up is an important one: there is pressure to spin every paper into gold, to oversell the results even if that means shifting the goalposts. Setting an alpha of 5% and then entertaining p>.05 may have seemed like the most blindingly obvious form of such shifting goalposts. However, as many of the other comments point out, having such rigid and arbitrary goals in the first place is also problematic.
    When addressing the dangers of spin, trending p-values are at the bottom of the list of problems. Authors may be overselling their results, but readers will not take them seriously; after all, their results were only trending. The real danger comes when authors celebrate p < .05 despite not having sufficient power for the effect sizes they have, or obfuscating that this result was one of 20 tests conducted. These are the results that make it into high impact journals, get described in the abstract as "statistically significant", and effectively mislead science.
    If anything, trending p-values offers the opportunity for healthy skepticism, which we could use more, not less of, in science.

    1. Sophia, I agree with most of your comment. Just about this part, where you said: “I think the core problem the author brings up is an important one: there is pressure to spin every paper into gold, to oversell the results even if that means shifting the goalposts. Setting an alpha of 5% and then entertaining p>.05 may have seemed like the most blindingly obvious form of such shifting goalposts.”

      Your “shifting the goalposts” analogy is not correct because the researcher never calls his result “significant”, in the first place. He calls his result “trending to significance” etc., which is never the same as “significant”.

      Calling a p value = 0.06 “trending towards significance” is NOT “moving the goalposts” or even overselling. The more accurate analogy is saying “Oh I missed it by an inch”.

      This never means that “Oh I scored a goal (shifts the goalpost secretly!)”. It simply means that “if I were lucky enough, I might have scored a goal, instead”.
      ———————————————-
      Not to mention that in statistics, there is NO goalpost. The alpha should not be treated as an ABSOLUTE threshold the way Simon Gandevia says.

      Accordingly, people should actually take p = 0.06 seriously, because they are important and likely false negative results. They suggest “room for improvement” in future studies.

  14. I agree. The whole significant/non-significant binary is frustrating. I would say, let’s stop using this language. Report the results as matters of fact with the p value stated and let the reader decide how much confidence they are going to place in that result. In the world of pre-clinical and basic research, most people are not hanging their careers on the results of one paper anyways. Personally, I generally hold novel publications as “operational knowledge” until they are ultimately validated BY USE in future studies (or not). Treating any one paper or single experiment within a paper as Truth is not wise, regardless of the p-value.

  15. No an arbitrary threshold is _not_ necessary, and conveys much less information than the actual numerical p-value or other metric used.

    What can’t be accurately expressed by a number is the assessment of
    (i) the statistcal proceedure applied,
    (ii) the experimental proceedure and measurements that generated the data,
    (iii) the state of theoretical knowledge surrounding the experment.

  16. There’s a huge amount of serious research in statistics and the philosophy of statistics dealing with the problems with p-values, how setting any kind of threshold is ultimately subjective – and alternatives on how to be open and honest about your results and how you communicate them in face of that.

    I’d highly recommend taking a good look at that when talking about such things. Oversimplification often does as much harm as ‘hunting for publishable results’. Possibly even more.

  17. @Andrew Coggan, many thanks for your good and elaborate response. I will try to break down your points into a couple of comments.
    ——————- 1 ——————–

    Andrew: “So your position is that it is appropriate to refer to a p-value between .05 and .1 as “trending toward significance” but not appropriate to refer to a p-value above .1 the same way? That is simply a rule you made up. And for someone who criticized “arbitrary” cutoffs elsewhere in the thread, I find that quite peculiar.”

    Me:

    Well actually I call it “MARGINALLY significant” or “BORDERLINE significant”. But this “trending toward significance” thing is fine too.

    Regarding “between .05 and .1”, if you re-read my comments here, you’ll see my emphasize on the word “about” and on the phrase “especially with a small sample or low power”. Of course, if the sample is too small, even a p greater than 0.1 can be considered worthwhile and a potential case for a false negative error; hence, my stress on “about” and “depending on situation”.

    Then about it being made up. Of course, it is made up. Any threshold is made up. But at least, my version is a made-up “continuum” and not an absolute exact “goal post”. Not to mention that I did say that p values are to be INTERPRETED post hoc along with other stats parameters.

    ——————- 2 ——————–

    Andrew: And for someone who criticized “arbitrary” cutoffs elsewhere in the thread, I find that quite peculiar.

    Me:

    Why do you find it peculiar? As a matter of fact,

    1. I never talked about any strict cut-off, in the first place, in any of my comments. Please carefully re-read my words; you will always see the word “about”, which blurs the cut-off.

    2. Also you should see my 4 or 5 comments about the cut-off thing not being absolute but a continuum.

    3. Furthermore, you should see my emphasis on that we ASSUME 0.05 as the threshold. So I do know that it is something assumed, or made up if you will.

    So despite all the above clues, why did you find my comment quite peculiar? Perhaps because you didn’t carefully read the comments, in which case, you should find your own level of attention peculiar.

    ——————– 3 ———————–

    Andrew: You say I made an error in “thinking that the hypothetical researcher is calling p = 0.06 ‘significant.’” I made no such claim.

    Me:

    I think you did when writing this:
    Andrew: “If researchers fudge the standard of evidence post-hoc out of convenience, so that their results can squeak by even when they didn’t mean the bare minimum standard, then that defeats the purpose of NHST. If you don’t think NHST testing is the right approach for your circumstance, that’s fine. But if you’re using NHST, you should use it correctly.”

    I think that by “fudge the standard of evidence post-hoc” or “that defeats the purpose of NHST” you meant something like shifting the goalposts post hoc. Otherwise, you would agree that those researchers do NOT fudge the standard of evidence and what they do does NOT defeat the purpose of NHST because they do NOT call their results “significant”. The standard of evidence (i.e., the alpha) is NOT changed at all. Those researchers are just suggesting that their p = 0.06 might be a false negative error (hence, trending toward significance).

    If I am mistaken, please correct me. Also apologies if hypothetically, I were mistaken.

  18. @ Andrew Coggan
    ——————- 4 ————————

    Andrew: If you want to interpret the p-value in a “Fisherian” way, i.e., as a continuous representation of evidence against the null hypothesis, that’s fine. It’s certainly justifiable to say a p-value of .000004 is generally stronger evidence than a p-value of .04. But when we’re using an alpha level to make a decision—which is the context the post is referring to—we’re basically saying “beyond this point, the evidence doesn’t mean the bare minimum standard required to support a claim.”

    Me:

    Agreed on the first part. The last part is AGAIN implying that people who call their p = 0.06 as “trending toward significance” are calling it “significant”. No, they aren’t. They too know that their result is NOT significant.

    But their result CAN be EVIDENCE. Though NOT significant, it is worthwhile because it might be actually a false negative result, deserving future investigation.

    Look at the next post for much more useful info.

    ——————- 5 ————————

    Andrew: “You also say that “even in NHST, one should definitely look at the EFFECT SIZE measures together with the p value. This again means that a p-value should be INTERPRETED post hoc.” You are absolutely correct that inferences should not generally be based on p-values alone. But to clarify, one should generally look at the CONFIDENCE INTERVAL, since the p-value and the point estimate of the effect size don’t provide a margin of error for the estimation. For example, when the p-value is nonsignificant, a confidence interval that is tight around zero suggests an effect that is either zero or too close to zero to merit further investigation, whereas a wide confidence interval is less conclusive and suggests that the effect may or may not look interesting in higher-powered investigations.”

    Regarding your note on CI:

    1. Wow! You are “teaching” me CI. Man, this is a given; it’s stats 101. I already DID TALK about CI in my comments, without mentioning its name. Whenever someone talks about a continuum over alpha (instead of an absolute threshold), he is automatically referring to *UNCERTAINTY*. And the most popular parameter of UNCERTAINTY is CONFIDENCE INTERVAL. If you double-check my comments here, you would see that I talked about a band or a continuum over alpha, perhaps 10 times.

    Now let me tell you (teach you?) an important point that is missed by you and Simon Gandevia:

    2. What you said about CI is exactly true. Now allow me to use it to lecture you about something closely related: How and Why p-values should not be treated as thin absolute thresholds, but as bands or continuums.

    If we assume an effect size or a statistic that can have both negative and positive values (like a simple Mean Difference):

    When a CI is heavily asymmetrical around zero, with most of the CI being on one side of zero and a tiny part of CI on the other side, this form of CI mirrors and parallels a MARGINALLY significant p value, e.g., p = 0.060 or p = 0.070.

    These are schematic examples of BORDERLINE significant p-values; for example, p = 0.73 or 0.067, whatever. In below “shapes”, 0 means zero and | mean CI bounds:

    Lower CI bound |-0——————————————| Upper CI bound

    Lower CI bound |————– 0| Upper CI bound

    Lower CI bound |——————————– 0-| Upper CI bound

    The above examples are very likely false negative errors (not 100% sure but very likely still). Therefore, instead of incorrectly throwing them out as garbage (the way you and Simon Gandevia do), I do report them. They should be definitely highlighted as MARGINALLY SIGNIFICANT, warranting future research with better methods and larger samples.

  19. @Andrew Coggan:
    ——————- 6 ————————
    Compare the two examples below. The lines are 95% CI ranges, and zero is 0 obviously.

    |———————————-|0

    In the above example, zero is outside the 95% CI but very close to the CI bound. The above 95% CI parallels p ~ 0.045 or so.

    |———————————0|

    In the above example, zero is inside the 95% CI but very close to the CI bound. The above 95% CI parallels p ~ 0.055 or so.

    So what makes you and Simon Gandevia think that this the first scenario (p ~ 0.045) is GOLD while the second one (p ~ 0.055) is WORTHLESS and STRAW?

    The only imaginable answer can be “Wow! I didn’t know that!” Otherwise, it is impossible for someone to know the above CI examples, yet he considers p > 0.05 as GARBAGE while at the same time, considers p ≤ 0.05 as GOLD.

  20. Andrew Coggan:
    ——————- 7 ————————

    Andrew: Perhaps your most concerning claim is this: “Of course, p > alpha should not be named “significant” which is a keyword reserved for p =< alpha but it should definitely be named something like 'likely to become significant in better circumstances'." That view is exactly what the author was rightly (albeit clumsily) warning against with regard to the abused word "trending." A p-value between .05 and .1 does not inherently imply that with a larger sample significance will likely be obtained.

    Andrew: You similarly claimed that when a p-value is .06 in a pilot study, "it is very likely that that variable might have become significant if the sample was slightly larger or if the confounding variables had been controlled better, etc." But that is not generally justifiable. Although it is possible that the effect will be significant in a larger follow-up study, it is also quite possible that the p-value will be even higher, since the p-value is uniformly distributed under the null hypothesis.

    Me:

    Sorry but what you said above is very wrong on so many levels. Sadly I don’t have much more time left to elaborate. And I think I made myself clear. Sadly you don’t pay enough attention and miss details.

    I did not say that it DOES IMPLY that with a larger sample, it will necessarily become significant. I said, repeatedly, that it suggests that it MIGHT become significant with a better methodology (larger samples, etc). In the English language, MIGHT means “a possibility”, even a “weak possibility”. MIGHT does not mean “necessity”, nor “DOES”, “can”, “may”, “would”.

    I repeatedly used the keyword MIGHT in my various comments. MIGHT means “with larger samples, a borderline significant variable MIGHT or MIGHT NOT turn significant”.

    Unfortunately, I have to correct you about not just stats but also about your misunderstandings in own native language (English) which is not my first or second tongue!

    Perhaps, we are lost in translation.

    ——————- 8 ————————

    Andrew: Note also that pilot studies are a completely different context than addressed by the post, since the post is referring to STUDIES FOR PUBLICATION. When a lab conducts an exploratory pilot study with a small sample for internal purposes to determine which variables to explore in the future, a totally different type of inference is being made (in fact, formal NHST may often not be desired at all in such cases).

    Me:

    Wow! Pilot studies are not only those exploratory ones who are conducted before a study in order to configure the study parameters and sample size etc.

    The name “pilot study” is ALSO used for studies that are the first ever, and/or studies that lack sample size calculations. Pilot studies are too studies for publication! I myself have published tens of pilot studies. And they are not one you think.

    If you search Pubmed for Pilot Study, you will find 125,000 papers!

    https://pubmed.ncbi.nlm.nih.gov/?term=%22pilot+study%22

  21. The START of the solution to this issue is to STOP using the term “significance” altogether. Please see Wasserstein, R. L., Schirm, A. L., & Lazar, N. A. (2019). Moving to a World Beyond “p < 0.05”, The American Statistician, DOI: 10.1080/00031305.2019.1583913

    1. Excellent read, thank you. Two points:

      (1) This paper completely backs up our comments (looking at Andrew Coggan). Simon Gandevia should read it, saying:

      “Therefore, a label of statistical significance does not mean or imply that an association or effect is highly probable, real, true, or important. Nor does a label of statistical nonsignificance lead to the association or effect being improbable, absent, false, or unimportant. Yet the dichotomization into “significant” and “not significant” is taken as an imprimatur of authority on these characteristics. In a world without bright lines, on the other hand, it becomes untenable to assert dramatic differences in interpretation from inconsequential differences in estimates. … … therefore, whether a p-value passes any arbitrary threshold should not be considered at all when deciding which results to present or highlight.”

      Wasserstein, R. L., Schirm, A. L., & Lazar, N. A. (2019). Moving to a World Beyond “p < 0.05”, The American Statistician, DOI: 10.1080/00031305.2019.1583913

      ——————————————

      (2)

      Yet my critique to the above-cited paper is that it completely rejects the idea of ANY threshold or ANY categorization at all; though I completely understand what they say, I know that it is not practical at all for us humans in most cases. Perhaps future AI can operate on a basis of continuous-looking spectrums of fuzzy logic, but we humans can't. Not to mention that even AI too defuzzifies!

      But for us humans, at the end of the day, we need to ASSUME some thresholds for the sake of practicality and simplicity. It would be better if there are more categories than 2 (more flexibility) and the thresholds are blurred and not absolute. But we still need categories to escape complexity. We need categories for the sake of simplicity and practicality.

      For example, a doctor can't tell his patient that you have 63% cancer and 37% non-cancer, then the malignancy of that possible 63% cancer is 82%. The doctor needs to categorize the diagnosis and malignancy of cancer into cancer stages and grades so that he doesn't get confused by all the complexity, and be able to treat the patient according to flow charts and guidelines. Any other (human) decision making as well relies on flow charts and categorization, which allows simplicity and practicality.

      1. After reading an article that opposes any reference to statistical “significance” when reporting p-values, the commenter then concludes that the article “completely backs” up the practice of referring to p-values as “marginally significant” or “trending toward significance.”

        1. Andrew, your have not even read most of my comments, that article’s text, or at least the excerpt I quoted from it!

      2. I think Wasserstein et al. were referring to decision thresholds around a “point” null hypothesis, but your 2nd comment about that article is important. Concerning your 2nd comment, Blume is refining a second-generation p-value that relies on an interval null instead of a point null. Researchers may find this useful in addressing the “bright light” or significance-nonsignificance problem that Wasserstein et al. mention.
        Blume JD et al. (2018). Second-generation p-values: Improved rigor, reproducibility, & transparency in statistical analyses. PLoS ONE 13(3). Link: journals.plos.org/plosone/article?id=10.1371/journal.pone.0188299
        Stewart TG, & Blume JD (2019). Second-Generation p-Values,
        Shrinkage, and Regularized Models. Front. Ecol. Evol. 7:486.
        doi: 10.3389/fevo.2019.00486

        1. Thanks for this awesome “interval null” to be used instead of a point null. In my previous comments here (criticizing Simon Gandevia and Andrew Coggan), I repeatedly talked about an area / spectrum / continuum / band of uncertainty over alpha –instead of a thin absolute alpha line or an alpha point.
          I didn’t know that the concept of an interval null is pretty fresh published just recently. Thought it was already introduced eons ago.
          Of course a very similar approach had been already introduced decades ago; the “marginally significant INTERVAL” already serves a very similar purpose to the interval null you cited.

  22. What a hot topic!
    So sorry for a naive comment – isn’t the actual hypothesis testing with the p-value a binary outcome? In a good experiment you set the power, calculate the number of needed data points and decide at which p-value the results are “significant”. If the analysis of the data computes that the p-value is larger than chosen – the results are simply “not significant”. The p-value may be incremental but the hypothesis testing is still binary – or what have I’ve been missing [probably not grammatically corret] during my three decades of evaluating scientific endevours?

    1. I do agree that if the sample size is calculated properly (in many studies, it is not) and if the sample remains as large as needed (in many studies, attrition or missing is too much), Now if everything is ideal, as you said: “if the analysis of the data computes that the p-value is larger than chosen – the results are simply “not significant””.

      No one disputes your statement, when they report a marginally significant result. A marginally significant p-value is NOT the same as a “significant” one; that is, it is still “NOT significant”. So it doesn’t violate the NHST standard that you are practicing for the last 30 years.

      But at the same time, by accentuating the marginally significance, the researcher highlights the possibility or even probability of a result (e.g., p = 0.062) to be a FALSE NEGATIVE error, and thus highly deserving future studies.

      This is much better than simply dismissing p = 0.062 altogether as garbage. Dismissing a result just because it is slightly above the level of significance is misleading. (Because it ignores the high likelihood of FALSE NEGATIVE error).

      The below example paper shows that the “marginally significant” is becoming more and more popular. It was NOT as popular back then 30 years ago when you finished your university education and started your practice. But nowadays, it is quite popular and it is for a reason:

      Marginally Significant Effects as Evidence for Hypotheses: Changing Attitudes Over Four Decades https://pubmed.ncbi.nlm.nih.gov/27207874/

      Also please see comments by J. Reed here.

      1. Thanks for the explanation! I still agree with the blogpost though as I do not see the same interpretation among neither authors, collegues or students as you do. I see the interpretation purported in the blogpost.
        However, I do agree that (and I never said non-significant results should simple be dismissed!) a non-significant p-value automatically should make you look at all the in-put data and the methodology of the experiment at hand again to evaluate a possible false negative. As my field is in medicine and biology this usually leads to tweeking and further studies – sometimes giving statistical significance and sometimes not.

        1. Pharmacist: “Thanks for the explanation! I still agree with the blogpost though as I do not see the same interpretation among neither authors, collegues or students as you do. I see the interpretation purported in the blogpost.”

          Me:

          You’re welcome. I don’t know what you mean by “interpretation”. Regardless, you actually seem to DIS-agree with the blog post, as per your next paragraph, saying:

          ——————————————

          Pharmacist: “However, I do agree that (and I never said non-significant results should simple be dismissed!) a non-significant p-value automatically should make you look at all the in-put data and the methodology of the experiment at hand again to evaluate a possible false negative. As my field is in medicine and biology this usually leads to tweeking and further studies – sometimes giving statistical significance and sometimes not.”

          Me:

          Nice! Well the blog post did say the opposite of your view: that non-significant results should be dismissed as garbage, or as Dr. Simon Gandevia put, straw. So if you do not believe in this dogmatism, then you do NOT agree with the blog post. Right? Am I missing something?

          1. Again we seem to interpret things differently. I agree with the blogpost that wordings such as “marginally significant” and “trending towards significant” is spin and should not be used – as my experience is that it is overinterpreted as positive support. The results that are non-significant but deemed worthy of publication should be published as such. I do not read into the blogpost that non-significance disqualifies such results for having other potential merit, such as [revision in] hypothesis generation.

      2. The fact that authors are touting their “marginally significant” effects with increased frequency doesn’t mean it’s a good idea. Once again, the “MISLEADING and INCORRECT” commenter appears to misinterpret an article (https://pubmed.ncbi.nlm.nih.gov/27207874/) as supporting his or her position, when the article in fact does the opposite (which is ironic given the commenter’s propensity for accusing others of failing to read or comprehend things).

        Indeed, the article cited by the commenter says “we suspect that any attempt to articulate prescriptions for the use of marginal significance will reveal that this practice is rooted in serious statistical misconceptions” and “the concept of marginal significance is dubious.” The article goes on to conclude that “researchers’ increased willingness to describe marginally significant effects as evidence for hypotheses owes to a tacit relaxation of the criterion employed to control the Type I error rate, which may lead to an increased prevalence of findings that provide weak evidence, at best, against the null hypothesis.”

        Note also this passage from the cited article, which I think articulates the point better than Gandavia did: “The use of marginal significance is not just a violation of statistical orthodoxy. Researchers often claim that near-threshold p values are approaching significance, apparently assuming that the p value associated with their statistical test will trend toward zero as data are collected. However, this will be true only if the population effect is nonzero. Thus, this reasoning is circular: Inferring that an effect exists on the basis of a p value approaching significance presumes that the effect exists, which is, of course, the very question at issue. Yet even experienced psychologists are liable to make this mistake.” That is in direct opposition to the commenter’s position that nonsignificant p-values “should definitely be named something like ‘likely to become significant in better circumstances’.”

        Moreover, it is difficult to find any logical consistency in the proposition that “statistical significance” is meaningless yet “marginal statistical significance” is important. Indeed, if we take the position that alpha levels are completely meaningless thresholds that shouldn’t be used, then it is nonsensical to also insist on celebrating so-called marginally significant p-values merely because they are near that meaningless threshold.

        1. Andrew Coggan, thanks for your comment. I see you have again incorrectly accused me of something without even reading my comments. Here are my responses to your 4 paragraphs:

          ————————- 1 ——————————

          Andrew’s 1st paragraph: “Once again, the “MISLEADING and INCORRECT” commenter appears to misinterpret an article (https://pubmed.ncbi.nlm.nih.gov/27207874/) as supporting his or her position, when the article in fact does the opposite (which is ironic given the commenter’s propensity for accusing others of failing to read or comprehend things).”

          Me:

          Once again? When was the other times? Regarding your claim that I misinterpreted this article: no Andrew, your claim is INCORRECT again. Please see my next comment for details.

          —————— 2 ————————

          Regarding Andrew’s 2nd paragraph:

          Me:

          1. Well, actually these lines you quoted were not in that article’s CONCLUSIONS; they are in its introduction. But I do agree with you that this article is actually AGAINST the notion of marginally significant results.

          2. Yet, you fail to see the point: I did not misinterpret it, and that article did NOT violate what I cited it for. I cited it for the increased popularity of the marginally significance concept. Which is an accurate interpretation:

          Its Discussion reads “We observed a large increase in the proportion of articles describing p values as marginally significant over the last four decades”.

          Its Conclusions part reads: “It appears that statistical standards have indeed changed, echoing calls for the critical evaluation of the statistical practices used in psychological science.”

          That article’s results and conclusions are indeed IN LINE with my claim to “Pharmacist in Exile”. My claim was: “The below example paper shows that the “marginally significant” is becoming more and more popular. It was NOT as popular back then 30 years ago when you finished your university education and started your practice. But nowadays, it is quite popular and it is for a reason”

          So, you are incorrect again in accusing me of misinterpretation. It is actually you who misinterpreted me.

          ————————- 3 ——————————

          Regarding your 3rd paragraph.

          Me:

          Exactly. Well, again, you and Simon Gandevia and this article are all assuming incorrectly that all those who report marginally significant results assume that with increasing the sample size, the updated result necessarily becomes significant.

          Well some of us may think so, which is indeed incorrect. But I don’t assume so. And I had told you this before, but apparently you have not even read my comment.

          On this very page, please search for the word “might” and you will find my and other commenters’ previous comments that used the keyword “might” (if the sample size increase, the results MIGHT turn significant).

          Then you will also find my response to your previous similar criticism, where I said: “In the English language, MIGHT means “a possibility”, even a “weak possibility”. MIGHT does not mean “necessity”, nor “DOES”, “can”, “may”, “would”.”

          ————————— 4 —————————–

          Regarding your 4th paragraph.

          Me:

          Well, again you have NOT read my previous response to your similar position. In my previous comments, I did say that the alpha can be used as an assumption, but a loose and blurred one, not a dogmatic and strict hard and thin line.

          I also said that in the same sense, a blurred and loose area of marginally significant is good.

  23. God that is so cringe to be like “It wasn’t significant but it was like going towards it”

    it’s so embarrassing and desperate and ugh I can’t imagine typing that and then submitting it and not just like crying into my cereal at 2 in the morning.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.