Weekend reads: Arguments for abandoning “statistically significant,” boorish behavior, and useless clinical trials

booksThe week at Retraction Watch featured developments in the retraction of a paper claiming the dangers of GMOs, and claims of censorship by a Nature journal. Here’s what was happening elsewhere:

Like Retraction Watch? Consider making a tax-deductible contribution to support our growth. You can also follow us on Twitter, like us on Facebook, add us to your RSS reader, sign up on our homepage for an email every time there’s a new post, or subscribe to our new daily digest. Click here to review our Comments Policy. For a sneak peek at what we’re working on, click here.

14 thoughts on “Weekend reads: Arguments for abandoning “statistically significant,” boorish behavior, and useless clinical trials”

  1. Re “despite the BMJ‘s strong data sharing policy, sharing rates are low,” is there a link to the article?

  2. “Predatory Conferences Undermine Science And Scam Academics,” say Madhukar Pai and Eduardo Franco. (The Huffington Post)

    From the original post:

    we hope to raise awareness about the growing menace of bogus conferences, organized by predatory publishers as well as specialized conference groups such as BIT Congress Inc, Conference Series Ltd, Event Series (both owned by OMICS International), PSC Conference and many others.

    The authors seem to mean PCS Conference (Pioneer Century Science), who’ve been turning up in my spamtray a lot lately. At their website the scammers go by a different alias (Global Century Science Group), which helps to build up trust in them, as does their general appearance of being primarily a travel agency.

    Anyway, the coordinator / signatory only goes by a first name and an initial, in the manner of Victorian pornographic novels:

    Ms. Maria E. (Coordinator)

    and she is evidently unable to coordinate her spam templates, as the message alternates between inviting me to a “Mental Health Forum”, a “Health Care Conference”, and an “International Conference of Neuroscience” (i.e. they have booked one room which they are advertising under three different labels).

  3. To be clear, Rickettsia helvetica is NOT a potential cause of Lyme disease, and Willy Burgdorfer rejected it as the possible Lyme agent soon after discovering and testing Borrelia burgdorferi. R. helvetica also has never been found in North America and does not infect the blacklegged tick, responsible for Lyme disease transmission, under experimental conditions.

  4. “The Problem with P-values:

    The arbitrary cut-off currently used is inadequate, argues David Colquhoun.”

    Oh dear, here we go again. The compilers of the Weekend Reads have a penchant for posting links to dodgy discussions of statistical methodology. At a time when many problems in scientific discourse and publishing need fixing, confused discussions of statistical methodologies only serve to delay progress in patching up these problems.

    DC: “The aim of science is to establish facts, as accurately as possible.”

    Indeed – a good starting premise. Statistics has proven useful not because it tells the truth, but because we can know our error rates when we apply statistical methods to data and interpret the findings with appropriate statistical philosophical competence. It is this latter step that evades so many.

    DC: “Clearly the smaller the p-value, the less plausible the null hypothesis, so the more likely it is that there is, in fact, a real effect. ”

    This is an erroneous argument, unfortunately one that is frequently stated.

    From: The ASA’s Statement on p-Values: Context, Process, and Purpose

    “The smaller the p-value, the greater the statistical incompatibility of the data with the null hypothesis, if the underlying assumptions used to calculate the p-value hold.”

    “2. P-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone.

    Researchers often wish to turn a p-value into a statement about the truth of a null hypothesis, or about the probability that random chance produced the observed data. The p-value is neither. It is a statement about data in relation to a specified hypothetical explanation, and is not a statement about the explanation itself.”

    Small p-values indicate that the observed data were unlikely to have been generated by the data distribution posited in the null hypothesis. Statements such as the one by this author are often found in treatises proclaiming the uselessness of frequentist or error statistical methods and the benevolence of the only possible alternative, Bayesian based methods. My Spidey-senses are piqued.

    DC: “All you have to do is to decide how small the p-value must be before you declare that you’ve made a discovery. ”

    No, you must also decide how large of a departure from the null hypothesis is meaningful. Small p-values can always be obtained by collecting larger sets of data. Eventually a large set of data will yield a small p-value, even for departures from the null so small as to be of no scientific, or biological, or medical relevance.

    DC: “The problem is that the p-value gives the right answer to the wrong question. ”

    Actually the problem is loose statements about p-values.

    DC: “But the dichotomy between ‘significant’ and ‘not significant’ is absurd. There’s obviously very little difference between the implication of a p-value of 4.7 per cent and of 5.3 per cent, yet the former has come to be regarded as success and the latter as failure. And ‘success’ will get your work published, even in the most prestigious journals.”

    Dichotomies are forced upon us in all kinds of circumstances. Do I take that pill for my medical condition? Do I buy an airplane ticket and climb on board that contraption or not? Do we build that bridge or not?

    We have to make absurd dichotomous decisions all the time. Statistics affords us the opportunity to assess how often we will make erroneous decisions based on the available data. Regardless of whether the error rate is set at 0.05, or 0.001, or 0.000001, we have to make some decision as to acceptable risk rates, and get on with a decision. We can’t guarantee that every bridge built will never fall down. But we can perform engineering analyses that will allow us to have confidence that only one in a thousand, or one in a million bridges will fall down. The lower the type I error rate, the more expensive the bridge will be, so we have to make tradeoffs.

    DC: “P-values of less than 5 per cent have come to be called ‘statistically significant’, a term that’s ubiquitous in the biomedical literature, and is now used to suggest that an effect is real, not just chance.”

    The term is ubiquitous. If and when it is interpreted to mean that an effect is real, such interpretations need to be countered. When people erroneously interpret the outcome of a useful methodology, the right approach is to correct the erroneous interpretation, not decry the methodology as useless.

    DC: “The problem of induction was solved, in principle, by the Reverend Thomas Bayes in the middle of the 18th century. He showed how to convert the probability of the observations given a hypothesis (the deductive problem) to what we actually want, the probability that the hypothesis is true given some observations (the inductive problem). But how to use his famous theorem in practice has been the subject of heated debate ever since. ”

    Ah – here it is. With a good thrashing of frequentist or error statistical methodology stated, some other methodology is suggested, with the underlying implication that the other methodology stated solves all the ills associated with the initial thoroughly whipped methodology. This is unfortunately a fallacious line of reasoning that all too many readers fall for.

    DC: “Take the proposition that the Earth goes round the Sun. It either does or it doesn’t, so it’s hard to see how we could pick a probability for this statement. Furthermore, the Bayesian conversion involves assigning a value to the probability that your hypothesis is right before any observations have been made (the ‘prior probability’). Bayes’s [sic] theorem allows that prior probability to be converted to what we want, the probability that the hypothesis is true given some relevant observations, which is known as the ‘posterior probability’. ”

    Ah – another dichotomy. (Never mind the subtle distinction that the Sun and Earth actually both orbit a common central focal point.)

    Now when people are convinced, as so many were centuries ago, that the Sun goes around the Earth, what Bayesian prior would be accepted as a reasonable starting prior? Most people would only accept a prior with probability 1 for “the Sun goes around the Earth” so where would a Bayesian analysis get us? There’s a certain arrogance to assigning probabilities to outcomes before any observations have been made. It’s one thing to set up a prior based on previously observed data, then update our position via convoluting the prior with more current data. But to snatch priors out of thin air in the absence of prior data does little to get us out of such situations.

    If one examines data on the position of stars, our moon and the other planets, the data are in discordance with the hypothesis that the Sun is closer to the centre of our solar system to a far lesser degree than their discordance with the hypothesis that the Earth is closer to the centre of our solar system. No Bayesian prior required to settle the issue. One would have to entertain a variety of prior distributions to clearly demonstrate that the Sun is closer to the centre using Bayesian methods. If minds are closed enough so that the degenerate prior distribution signifying an Earth-centric system is the only one tolerated, as a certain church attempted to impose centuries ago, then Bayesian methods aren’t going to give any better of an answer.

    The author will here skip any discussion about which flavour of Bayesian methodologies is the correct one – Objective Bayes, Subjective Bayes, or some other? Any number of prior probability distributions can be posited at the outset of a Bayesian statistical analysis, and the outcome will be affected by the choice of prior. But that won’t be discussed here, because that’s not the point of this article.

    Now it’s time for another fallacious discussion:

    DC: “An example should make the idea more concrete. Imagine testing 1,000 different drugs, one at a time, to sort out which works and which doesn’t. You’d be lucky if 10 per cent of them were effective, so let’s proceed by assuming a prevalence or prior probability of 10 per cent. Say we observe a ‘just significant’ result, for example, a P = 0.047 in a single test, and declare that this is evidence that we have made a discovery. That claim will be wrong, not in 5 per cent of cases, as is commonly believed, but in 76 per cent of cases. That is disastrously high. ”

    Indeed if one is testing 1000 drugs one should set an overall error rate for the whole exercise. Doing so will reduce the threshold for declaring a rejection of the null hypothesis for a given drug. If testing 1000 drugs and desiring an overall 5% type I error rate, the smallest p-value observed should be less than 0.05/1000 = 0.00005, the commonly used and highly useful Bonferroni correction. Benjamini and Hochberg have developed excellent methodologies allowing us to correct the other 999 p-values so that we maintain an overall error rate of 5% or 1% or whatever rate we deem acceptable, as we must accept some error rate in this uncertain universe.

    So the “disaster” that is labeled here is not a property of frequentist / error statistical methodolgy, but rather a description of the interpretive capabilities of the evaluator of said body of evidence.

    DC: “What is to be done? For a start, it’s high time that we abandoned the well-worn term ‘statistically significant’. The cut-off of P < 0.05 that's almost universal in biomedical sciences is entirely arbitrary – and, as we've seen, it's quite inadequate as evidence for a real effect."

    Shall we also abandon the well-worn term 'anti-biotic'? Sometimes there's a reason that a term is well-worn, the term is appropriate and highly useful. The real problem here is conflating 'statistical significance' with 'scientific relevance'. That's the problem that needs to be addressed.

    Of course an error rate of 5% is arbitrary. The fix here is to have better discussion about what rates of erroneous decisions can be tolerated in this or that circumstance, not throw out an entirely useful methodology because someone doesn't know how to use it properly. We don't ban automobiles because someone has a crash. We look for ways to educate automobile users so they will have fewer crashes, and we look for improvements to the automobile.

    The improvement here for frequentist / error statistical methodology is to insist that scientific discussions include evaluation of how large an effect needs to be, to be of scientific or medical or biological value, and to describe error rates accurately by performing multiple corrections methodologies appropriately.

    Journal editors banishing negative findings and other such problems are aptly described in John Ioannidis' excellent discussions of this topic of current importance. The problem is not with a well-fleshed-out analysis methodology, but with improper application and interpretation of results therefrom. Fallacious arguments pitting one statistical methodology against another will not solve this problem, merely shift it about. There's nothing magical about Bayesian methods that solve all our problems – they are just another useful tool in a statistical toolbox, and can be misapplied as readily as any other.

  5. Oh dear. I fear that Steven McKinney’s criticisms represent the sort of carping that gets statisticians a bad name. I guess that anyone who tries to steer a line between frequentists and bayesians can expect a bashing from both sides.

    For a start, it’s odd that he should choose to criticise a popular magazine article, rather than the paper on which it’s based: http://rsos.royalsocietypublishing.org/content/1/3/140216

    “The author will here skip any discussion about which flavour of Bayesian methodologies is the correct one – Objective Bayes, Subjective Bayes, or some other?” It was a magazine article for the public, not a mathematical dissertation!

    MacKinney says

    DC: “Clearly the smaller the p-value, the less plausible the null hypothesis, so the more likely it is that there is, in fact, a real effect. ”

    This is an erroneous argument, unfortunately one that is frequently stated.

    From: The ASA’s Statement on p-Values: Context, Process, and Purpose

    “The smaller the p-value, the greater the statistical incompatibility of the data with the null hypothesis, if the underlying assumptions used to calculate the p-value hold.”

    Apart from the self-evident bit about assumptions holding, the ASA statement and mine say essentially the same thing. Why describe my version as erroneous?

    My 1000 drug example is described as “fallacious”. It was an attempt to give an example in which, in principle, a normal frequentist interpretation could be given to the prior probability, by analogy with the prevalence of the condition in screening tests. Surely it’s obvious that the reason I say “in a single test” is to rule out the problem of multiple comparisons. I allude to Benjamini & Hochberg in the original paper, but the whole object of the paper was to discuss the problems of interpretation of P values before we even get to mutliple comparisons. I’m sorry if that was not clear to Steven McKinney. It seems to have been obvious to most people.

    McKinney’s criticisms read as though I was trying to abolish P values which, of course, I’m not. It would be more helpful if he said what he thought is wrong about the simulations which give rise to my statements about false postive rates. As it is, his little rant sounds exactly like the sort of internecine warfare among statisticians that has given rise to the present unfortunate situation.

    I recall talking to a statistician at a recent meeting of the Royal Statistical Society. He was involved in analysis of clinical trials. I asked why he allowed the paper to make claims of discoveries based on P values close to 0.05. His answer was that if he didn’t allow that he’d lose his job and the clinician would hire a more compliant statistician. That lies at the heart of the problem.

  6. Dr. Colquhoun,

    I chose to criticise the writings at web URL aeon.co, and not rsos.royalsocietypublishing.com, because the aeon URL is the one provided in the Weekend Reads above. I see no link here, nor on the aeon webpage, to the Royal Society, so how is it odd that I am commenting on a linked web page, and not an unlinked one? Odd rejoinder indeed. But I’ll critique the Royal Society paper as well.

    Statisticians do indeed get a bad name when they describe disparate scenarios. If we as statisticians can’t even get this stuff right, then how will the rest of the scientific community get it right? I respectfully request that you modify your writing style by dropping statements crafted for shock and surprise value, in order to help bring an end to all this confusion.

    An example is your opening sentence in the Royal Society version.

    “If you use p=0.05 to suggest that you have made a discovery, you will be wrong at least 30% of the time.”

    Such a sentence is provocative, and of course opens an article with a “pop”. But what on earth does it mean? It conflates multiple concepts in a way that will mislead non-statisticians into thinking that a type I error rate of 5% is really 30% at least. How is this helpful in the context of our current conundrums?

    The second sentence adds more conflation to the confusion.

    “If, as is often the case, experiments are underpowered, you will be wrong most of the time”.

    The type I error rate is fixed, regardless of the sample size and the associated type II error rate, upon which power is based. Power has nothing to do with the type I error rate (save that the type I error rate is the power under the null hypothesis). Underpowered studies have large type II error rates, nothing to do with stated type I error rates.

    If you use p = 0.05 to suggest you have made a discovery, then under the scenario that no discovery was there to be made, you will be wrong 5% of the time. That’s what a type I error means, and it is entirely accurate under the conditions to which it applies.

    Your own Royal Society article demonstrates this fact. Your Figure 3 shows results of 100,000 simulated t-tests, when the null hypothesis is true. Your Figure 3 caption even states “(b) The distribution of the 100,000 p-values. As expected, 5% of the tests give (false) positives (p <= 0.05), but the distribution is flat (uniform)." Proven by your own simulation – if you use p = 0.05 to suggest that you have made a discovery, you will be wrong exactly 5% of the time, under the stated condition of the null hypothesis. Not more than 30% of the time, as your opening sentence implies (conflating false discovery rate with type I error rate).

    I then read in section 5 "So there is the same number of p-values between 0.55 and 0.6, and in every other interval of the same width. This means that p-values are not at all reproducible: all values of p are equally likely."

    To state that p-values are not at all reproducible is entirely misleading. It's odd to state this after running a simulation that shows how rock-solidly p-values perform, yielding exactly 5% of them less than 0.05 under the null hypothesis. It is this phenomenon, not the individual values, that is entirely reproducible, and the very reason that statistical methods have proved so valuable when properly applied and interpreted. Indeed, if this phenomenon was not entirely reproducible, why would your most useful software be of any value? (I do indeed view your software as most valuable – readers can indeed use it to reproduce these phenomena and learn how to properly assess them, given proper guidance.)

    The fact that the p-values have a Uniform distribution under the null hypothesis is entirely well known in statistical distributional theory. So it is odd that your sentences describing the distribution of p-values under the null hypothesis use the word "but".

    There finally in Section 6 is the sentence "The false discovery rate is 36%, not 5%." 5% was never claimed to be the false discovery rate. 5% is the rate of falsely rejecting the null hypothesis when in fact it is true. That is the type I error rate, not the false discovery rate. So that is what I think is wrong with your statements about "false positive rates". You conflate false discovery rates with false rejection rates, and refer to them as false positive rates in your reply. Lack of clarity leads to all this confusion. I'll continue with my "little rants" as long as people keep writing unclear articles that falsely portray statistical phenomena.

    If you want to help people truly understand these entirely reproducible phenomena, you need to describe them accurately, and not conflate concepts to yield arousing sound bites. That is my issue with your articles. Such writings will not make these concepts "obvious to most people".

  7. Dr. Colquhoun

    You are correct that I misinterpreted your statement “Clearly the smaller the p-value, the less plausible the null hypothesis, so the more likely it is that there is, in fact, a real effect. ” This is an accurate wording in concordance with the ASA statement. Likelihood is indeed the primary measure of statistical compatibility. My apologies for that misinterpretation.

  8. Thanks for admitting at least one mistake in your rather acerbic criticisms.

    You say that you saw no links in the Aeon piece the original paper. It is actually linked in no fewer than eight different places, so perhaps you should look again.

    I think that the paper makes it abundantly clear that its aim was to distinguish clearly between the Type 1 error rate, and the false positive rate The latter is clearly defined as the fraction of all “positive” tests that are false positives. It’s also described as the complement of the PPV (a term that I find less self-explanatory than false positive rate and so avoid).

    Some of the links in the Aeon piece lead directly to section 10 of the RS paper where I justify the opening statement, namely that if you observe P=0.047 in a single test, and claim on that basis to have made a discovery, you’ll be wrong in 26% of the time (rounded to one significant figure, 30%, in the abstract). if you think that result is wrong, please say why.

    (In the RS paper I called that the false discovery rate, though I noted in a comment that false positive rate might be a preferable term because FDR is used in the multiple comparison world and I wanted it to be clear that I wasn’t talking about multiple comparisons.)

    You say that “You conflate false discovery rates with false rejection rates, and refer to them as false positive rates”. The whole purpose of the paper was to distinguish between false positive rates (which is what experimenters want) from type 1 errors rates (whch are often mistakenly taken to false positive rates). I’m sorry if this was not clear to you. The paper has been out for almost two years now. It’s had over 140,000 full-text views, and 21,000 pdf downloads. So far you are the only person who has complained that it was not clear.

  9. Dr. Colquhoun,

    My labeling of the 1000 drug example as fallacious is motivated thusly:

    Any serious attempt to assess a collection of 1000 different drugs will of course necessitate using multiple comparisons methods. A single p-value of 0.047 is not an appropriate piece of evidence in this scenario. That’s why I call this example fallacious. To bash a single p-value with a scenario requiring handling 1000 p-values and assessing them appropriately so as to maintain an overall type I error rate of 5%, or 1%, or 0.1% is a fallacious argument with which to lay mockery upon a useful statistic. I also find it odd to back this discussion with a reference to your Royal Society paper in which you declare “I should clarify that this paper is not about multiple comparisons”.

    The problem is the common belief – not the performance of an entirely sound statistic when used appropriately.

    So the logical train of thought conveyed in the aeon article is

    – People commonly misuse a method designed to assess a single event, in a willy nilly fashion across multiple events

    – Therefore “it is time to pull the plug”, we should “abandon the well-worn term ‘statistically significant'”.

    I strongly disagree.

    Rather, it is time to redouble our efforts to educate those people who commonly do these things by conveying the appropriate concepts to them in understandable terms, rather than shocking soundbites.

    Conflating a significance level with a p-value with a false discovery rate isn’t going to get us there.

    As you state in your aeon article, Fisher said that p=0.05 was a low standard of significance and that a scientific fact should be regarded as experimentally established only if repeating the experiment rarely fails to give this level of significance. The fact that today a single experiment that gives p=0.045 will get a discovery published in the most glamorous journals is no reason to denigrate a useful statistic and advocate for its abandonment (as Trafimow has done at that brilliant journal Basic and Applied Social Psychology) or suppression of words necessary in describing the basic paradigm to which p-values belong (as you do in the title and body of this article).

    Rather it is time for scientists of good will to collectively publicly shame the most glamorous journals, and decry the bad business practices into which said journals have devolved. Journals used to be produced by learned societies, to distribute the writings of their talented members. Now they are glossy brochures designed to return double digit profits. Such journals need to be shunned, and well meaning scientists need to return to the good practices described by Fisher.

    There’s nothing wrong with chasing up an exploratory finding in a preliminary experiment with a reasonably small p-value combined with an effect size of biological or medical or other scientific relevance. Initial follow-up experiments should be conducted and described, and their statistical significance assessed. Then further validating experiments, preferably in other labs, should be undertaken, so that indeed the scientific community can assess whether the experiment rarely fails to give a statistically significant and scientifically relevant measure. Journals should insist upon such evidence in submitted papers. The scientific community should reward cooperative efforts that yield such repeated evidence, not reward fast findings by individuals, exploratory in nature only, and portrayed with pretty pictures of little relevance.

    So this is what we need to discuss, rather than repeating silly statements by some physicist in Birmingham. Fisher’s feats at the turn of the last century are nothing short of remarkable. He almost single-handedly constructed all of modern statistics, figuring out the mathematics that accurately portrayed concepts that dozens of other scientists were struggling with and describing poorly. To describe that effort as a machine for turning baloney into breakthroughs shows how little Robert Matthews understands about the value of statistics, properly applied.

    A significance level of 0.05 (one in twenty) is not entirely arbitrary. Significance levels in that territory were reasoned out a hundred years ago. Few people felt that flipping a coin was a reasonable way to decide on the validity of scientific phenomena – that’s one in two odds. One in three, one in five, even one in ten – just not a convincing level of odds for most serious evaluators of scientific phenomena. One in twenty? Now we’re getting somewhere. One in a hundred was also commonly entertained. This happened in an era where the number of experiments was small. People did not have the means to crunch numbers for tens of thousands of experiments at one time. So a significance level of one in twenty, or one in a hundred was not arbitrary, but rather agreed upon as the gateway to the arena of compelling evidence by many people involved in the scientific effort.

    Of course we need to advance our methodology in our current era, where it is possible to assess thousands of experiments quickly, and we have the computing infrastructure to crunch all those numbers. That requires multiple comparisons methodologies, and better education for scientists about such methodologies. This is something I care about deeply, hence my sharp, harsh, biting, acrid and scathing little rants. P-values aren’t the problem. Improper interpretation and description of them is.

  10. Dr. Colquhoun

    I’ve been using some R code similar to that which you make available from your Royal Society paper linked in the aeon article. (I’ll be happy to send the code to anyone who would like to check it over).

    For the 10000 drugs example, I used your 90/10 ratio, so that 90 percent of the drugs are placebos (null hypothesis is true) and 10 percent are drugs with a useful effect (alternative hypothesis is true).

    I simulated 10000 trials with 17 people per drug (so 34 people per trial), with an unpaired t-test to compare the data in the two groups. This gives me 80 percent power for the useful drugs, as you describe in your paper. (16 subjects gave me less than 80% power.) For each of the 10000 experiments, I reject the null hypothesis at the 5 percent significance level.

    I get rates very close to those discussed in your paper, not surprising, because statistics are reproducible in the aggregate even if randomness makes individual trials somewhat unique.

    From 9000 placebos I get 454 false positive statistically significant test results (5 percent type I error rate, as advertised).

    From 1000 useful drugs I get 817 true positive statistically significant test results (80 percent power, as advertised).

    So as you put it in your paper, “I make a fool of myself 454/(454+817) = 35.7 percent of the time”.

    I then used the Benjamini-Hochberg procedure to adjust all the 10000 p-values. Now,

    from 9000 placebos I get 19 false positives (19 adjusted p-values less than the advertised significance level of 0.05 from the 9000 placebo experiments)

    from 1000 useful drugs I get 355 true positives (355 adjusted p-values less than the advertised significance level of 0.05 from the 1000 useful drugs experiments).

    Thus “I make a fool of myself 19/(19+355) = 5.08 percent of the time”, exactly the type 1 error rate I wanted. I didn’t have to change my significance level to 0.001 from 0.05 (the Benjamini-Hochberg procedure essentially did that for me).

    P-values work beautifully when they are handled correctly. The problem is not p-values, the problem is people who do not process and interpret them properly. When properly handled, we can correctly control the error rates of our statistical procedures with remarkable and knowable precision. In this example, I was able to pick out a goodly number of the efficacious drugs, while keeping my type I error rate at the advertised 5 percent level.

    Science is difficult, and in trying to learn how the universe and everything in it works, people will make mistakes. I don’t consider them foolish when they make mistakes, if they make an honest effort to disclose the rate at which I can expect errors in their findings. Even Einstein didn’t get it right all the time, and I’m not going to call him a fool for making a mistake. I admire the courage that scientists display, able to pick themselves up after failure and carry on.

    Statistics has been such a useful tool in the arsenal of the capable scientist, allowing progress to happen by understanding the rate at which errors are made, and allowing scientists to plan proper studies to determine answers in arenas of the unknown.

    Long live the p-value, and proclamations of statistical significance, when properly carried out.

    1. I only just noticed that I never replied to your last comment.

      Since then, I did the math and the results agree exactly with my simulations -published in 2017
      and in 2019 I developed the ideas a bit further https://www.tandfonline.com/doi/full/10.1080/00031305.2018.1529622
      I now realise a bit better the assumptions I was making in the 2014 paper, First, the high false positive risk (FPR) occurs when testing a (near) point null hypothesis. The Sellke & Berger (1987) minimum BF approach, gives a similar answer to mine: for p=0.05 they suggest FPR = 0.29 (when prior odds are 1), compared with 0.26 in my approach. Of course the FPR will increase a lot if the hypothesis is implausible (prior odds on H1 less than 1).

      Given the low success rate with candidate drugs in several fields (like Alzheimer’s and cancer), the prior odds on H1 might well be less than one. This may be part of the reason why so many not-very-effective treatments reach the market.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.