Weekend reads: Arguments for abandoning “statistically significant,” boorish behavior, and useless clinical trials

The week at Retraction Watch featured developments in the retraction of a paper claiming the dangers of GMOs, and claims of censorship by a Nature journal. Here’s what was happening elsewhere:

“It’s high time that we abandoned the well-worn term ‘statistically significant’.” The arbitrary cut-off currently used is inadequate, argues David Colquhoun. (Aeon)
“We must actively teach our students and each other by example about responsibility and civility in relationships in research, not only because it makes life more pleasant but also because boorish behavior holds back the advancement of science and engineering,” says Tee Guidotti.
At the University of California, San Francisco, “Research is judged more on its quantity (numbers of investigators and federal and private dollars) than on its goals, achievements, or scientific quality,” say two retired faculty from the institution. (bioRxiv)
Researchers should conduct fewer clinical trials, argues Kirstin Borgerson. (The Hastings Center Report, sub req’d)
In politics, there’s too much negativity. But journals need more of it, argue our co-founders in STAT.
“Prioritizing replicability does not mean that you must prioritize replication attempts,” says Joe Simmons.
“There are no official boundaries on what could be a reason for a conflict of interest.” Instead, conflict of interest exists as a “continuum of moral jeopardy” for researchers, says Sylvia R. Karasu. (Psychology Today)
Bob Dylan just won the Nobel Prize for Literature. Maybe that’s because researchers love to quote his songs in paper titles. (David Malakoff, Science)
“Misuse and misinterpretation of [Null Hypothesis Significance Testing] soon turned to be regarded as a problem of NHST itself.” But NHST doesn’t deserve this reputation, argues Miguel A. Garcia-Perez. (Educational and Pyschological Measurement)
A potential cause of Lyme disease may have evaded notice because it was never mentioned by name when researchers published their results on a different potential cause. (Charles Piller, STAT)
“All the data, on all the trials.” OpenTrials is a new tool to link all available information on every trial conducted, and you can contribute.
Not content with a career in research, Mike Rossner carved out a niche by tackling the problem of image manipulation. (Jennifer Couzin-Frankel, Science) See our Q&A with Rossner from earlier in the year here.
After a dozen researchers in China say they can’t replicate a researcher’s “breakthrough” genetic engineering results, they call for an independent investigation. (Stephen Chen, South China Morning Post)
“I feel under pressure from this directive that if we don’t do what they say we won’t get it in the future, and I need this for my job.” British researchers voice concerns about a new rule that certain research must be sent to government officials before publication. (John Dickens, Schools Week)
Data sharing will improve treatment options and help develop new treatments, says a new paper in The BMJ.
Can highly selective open-access journals survive on article processing charges? asks David Crotty/ (Scholarly Kitchen)
Scenes from the replication crisis: John Borghi chronicles his experiences with psychology’s century-old problem with p-values. (Medium)
The NIH is requesting information on preprints and other “interim research products” to determine how they can improve the rigor and impact of NIH-funded research.
Clinical trials have improved over time — but fundamental changes are needed to improve efficiency, accountability, and transparency, a group of heavy hitters, including the director of the NIH, argues. (JAMA)
“Does better peer review reduce the number of retractions?” That’s one of the questions David Moher and Philippe Ravaud hope an international best practice journal network can answer. (BMC Medicine)
“Publishing papers and books for profit, without any genuine concern for content, but with the pretence of applying authentic academic procedures of critical scrutiny, brings about a worrying erosion of trust in scientific publishing.” Stefan Eriksson and Gert Helgesson discuss predatory publishing. (Medicine, Health Care and Philosophy)
“Despite the BMJ‘s strong data sharing policy, sharing rates are low.” (Anisa Rowhani-Farid, Adrian G Barnett, The BMJ)
“Predatory Conferences Undermine Science And Scam Academics,” say Madhukar Pai and Eduardo Franco. (The Huffington Post)
Papers by Chinese scientists are among the most-cited globally, according to a report from the Xinhua state news agency.
Have ideas for making research more reproducible? The American Statistical Association wants to hear them.
Why we need more transparency about the U.S. FDA approval process: An interview with Erick Turner. (Open Trials)
“What I’d like to say is that it is OK to criticize a paper, even it isn’t horrible,” says Andrew Gelman.
“Disquiet about the lack of high quality evidence cannot be dismissed as the grumblings of a disgruntled few.” The EBM – evidence-based medicine – manifesto.
A group of researchers at the Nordic Cochrane Center is continuing their complaint over how the safety of HPV vaccines was determined.
Undergraduate academic journals “provide valuable ways for students to acquaint themselves with academic writing, but also face continuing problems of relevance that mirror broader academic trends.” (Elaine Jiang, The Columbia Spectator)
Is scientific publishing “self-regulating so poorly that we invite external regulation?” asks Kent Anderson. (The Scholarly Kitchen)
What determines who gets access to data? Troubling findings from studies of the past several years. (Michael Krawczyk, The Replication Network)
“Everything Is Bogus At The Journal of Nature And Science,” says Jeffrey Beall.
“Much academic research is never cited and may be rarely read, indicating wasted effort from the authors, referees and publishers.” Who gets cited, by Mike Thelwall. (Scientometrics)
“Publishing and sharing data papers can increase impact and benefits researchers, publishers, funders and libraries,” writes Fiona Murphy. (LSE Impact Blog)
Which countries publish the most in which subjects? A visualization from Phil Davis. (The Scholarly Kitchen)

Like Retraction Watch? Consider making a tax-deductible contribution to support our growth. You can also follow us on Twitter, like us on Facebook, add us to your RSS reader, sign up on our homepage for an email every time there’s a new post, or subscribe to our new daily digest. Click here to review our Comments Policy. For a sneak peek at what we’re working on, click here.

14 thoughts on “Weekend reads: Arguments for abandoning “statistically significant,” boorish behavior, and useless clinical trials”

Chris Mebane says:

October 15, 2016 at 11:31 am

Re “despite the BMJ‘s strong data sharing policy, sharing rates are low,” is there a link to the article?

Reply
1. Ivan Oransky says:
  
  October 15, 2016 at 11:38 am
  
  The link was misbehaving, fixed now. Thanks. http://bmjopen.bmj.com/content/6/10/e011784.abstract?sid=1ade0ba9-fe85-4c35-9a17-3124dc302843
  
  Reply
herr doktor bimler says:

October 16, 2016 at 10:55 pm

“Predatory Conferences Undermine Science And Scam Academics,” say Madhukar Pai and Eduardo Franco. (The Huffington Post)

From the original post:

we hope to raise awareness about the growing menace of bogus conferences, organized by predatory publishers as well as specialized conference groups such as BIT Congress Inc, Conference Series Ltd, Event Series (both owned by OMICS International), PSC Conference and many others.

The authors seem to mean PCS Conference (Pioneer Century Science), who’ve been turning up in my spamtray a lot lately. At their website the scammers go by a different alias (Global Century Science Group), which helps to build up trust in them, as does their general appearance of being primarily a travel agency.
http://www.pcscongress.com/group/Destination.html

Anyway, the coordinator / signatory only goes by a first name and an initial, in the manner of Victorian pornographic novels:

Ms. Maria E. (Coordinator)

and she is evidently unable to coordinate her spam templates, as the message alternates between inviting me to a “Mental Health Forum”, a “Health Care Conference”, and an “International Conference of Neuroscience” (i.e. they have booked one room which they are advertising under three different labels).

Reply
Tick Appreciation Society says:

October 17, 2016 at 12:13 pm

To be clear, Rickettsia helvetica is NOT a potential cause of Lyme disease, and Willy Burgdorfer rejected it as the possible Lyme agent soon after discovering and testing Borrelia burgdorferi. R. helvetica also has never been found in North America and does not infect the blacklegged tick, responsible for Lyme disease transmission, under experimental conditions.

Reply
Steven McKinney says:

October 17, 2016 at 8:12 pm

“The Problem with P-values:

The arbitrary cut-off currently used is inadequate, argues David Colquhoun.”

Oh dear, here we go again. The compilers of the Weekend Reads have a penchant for posting links to dodgy discussions of statistical methodology. At a time when many problems in scientific discourse and publishing need fixing, confused discussions of statistical methodologies only serve to delay progress in patching up these problems.

DC: “The aim of science is to establish facts, as accurately as possible.”

Indeed – a good starting premise. Statistics has proven useful not because it tells the truth, but because we can know our error rates when we apply statistical methods to data and interpret the findings with appropriate statistical philosophical competence. It is this latter step that evades so many.

DC: “Clearly the smaller the p-value, the less plausible the null hypothesis, so the more likely it is that there is, in fact, a real effect. ”

This is an erroneous argument, unfortunately one that is frequently stated.

From: The ASA’s Statement on p-Values: Context, Process, and Purpose

“The smaller the p-value, the greater the statistical incompatibility of the data with the null hypothesis, if the underlying assumptions used to calculate the p-value hold.”

“2. P-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone.

Researchers often wish to turn a p-value into a statement about the truth of a null hypothesis, or about the probability that random chance produced the observed data. The p-value is neither. It is a statement about data in relation to a specified hypothetical explanation, and is not a statement about the explanation itself.”

Small p-values indicate that the observed data were unlikely to have been generated by the data distribution posited in the null hypothesis. Statements such as the one by this author are often found in treatises proclaiming the uselessness of frequentist or error statistical methods and the benevolence of the only possible alternative, Bayesian based methods. My Spidey-senses are piqued.

DC: “All you have to do is to decide how small the p-value must be before you declare that you’ve made a discovery. ”

No, you must also decide how large of a departure from the null hypothesis is meaningful. Small p-values can always be obtained by collecting larger sets of data. Eventually a large set of data will yield a small p-value, even for departures from the null so small as to be of no scientific, or biological, or medical relevance.

DC: “The problem is that the p-value gives the right answer to the wrong question. ”

Actually the problem is loose statements about p-values.

DC: “But the dichotomy between ‘significant’ and ‘not significant’ is absurd. There’s obviously very little difference between the implication of a p-value of 4.7 per cent and of 5.3 per cent, yet the former has come to be regarded as success and the latter as failure. And ‘success’ will get your work published, even in the most prestigious journals.”

Dichotomies are forced upon us in all kinds of circumstances. Do I take that pill for my medical condition? Do I buy an airplane ticket and climb on board that contraption or not? Do we build that bridge or not?

We have to make absurd dichotomous decisions all the time. Statistics affords us the opportunity to assess how often we will make erroneous decisions based on the available data. Regardless of whether the error rate is set at 0.05, or 0.001, or 0.000001, we have to make some decision as to acceptable risk rates, and get on with a decision. We can’t guarantee that every bridge built will never fall down. But we can perform engineering analyses that will allow us to have confidence that only one in a thousand, or one in a million bridges will fall down. The lower the type I error rate, the more expensive the bridge will be, so we have to make tradeoffs.

DC: “P-values of less than 5 per cent have come to be called ‘statistically significant’, a term that’s ubiquitous in the biomedical literature, and is now used to suggest that an effect is real, not just chance.”

The term is ubiquitous. If and when it is interpreted to mean that an effect is real, such interpretations need to be countered. When people erroneously interpret the outcome of a useful methodology, the right approach is to correct the erroneous interpretation, not decry the methodology as useless.

DC: “The problem of induction was solved, in principle, by the Reverend Thomas Bayes in the middle of the 18th century. He showed how to convert the probability of the observations given a hypothesis (the deductive problem) to what we actually want, the probability that the hypothesis is true given some observations (the inductive problem). But how to use his famous theorem in practice has been the subject of heated debate ever since. ”

Ah – here it is. With a good thrashing of frequentist or error statistical methodology stated, some other methodology is suggested, with the underlying implication that the other methodology stated solves all the ills associated with the initial thoroughly whipped methodology. This is unfortunately a fallacious line of reasoning that all too many readers fall for.

DC: “Take the proposition that the Earth goes round the Sun. It either does or it doesn’t, so it’s hard to see how we could pick a probability for this statement. Furthermore, the Bayesian conversion involves assigning a value to the probability that your hypothesis is right before any observations have been made (the ‘prior probability’). Bayes’s [sic] theorem allows that prior probability to be converted to what we want, the probability that the hypothesis is true given some relevant observations, which is known as the ‘posterior probability’. ”

Ah – another dichotomy. (Never mind the subtle distinction that the Sun and Earth actually both orbit a common central focal point.)

Now when people are convinced, as so many were centuries ago, that the Sun goes around the Earth, what Bayesian prior would be accepted as a reasonable starting prior? Most people would only accept a prior with probability 1 for “the Sun goes around the Earth” so where would a Bayesian analysis get us? There’s a certain arrogance to assigning probabilities to outcomes before any observations have been made. It’s one thing to set up a prior based on previously observed data, then update our position via convoluting the prior with more current data. But to snatch priors out of thin air in the absence of prior data does little to get us out of such situations.

If one examines data on the position of stars, our moon and the other planets, the data are in discordance with the hypothesis that the Sun is closer to the centre of our solar system to a far lesser degree than their discordance with the hypothesis that the Earth is closer to the centre of our solar system. No Bayesian prior required to settle the issue. One would have to entertain a variety of prior distributions to clearly demonstrate that the Sun is closer to the centre using Bayesian methods. If minds are closed enough so that the degenerate prior distribution signifying an Earth-centric system is the only one tolerated, as a certain church attempted to impose centuries ago, then Bayesian methods aren’t going to give any better of an answer.

The author will here skip any discussion about which flavour of Bayesian methodologies is the correct one – Objective Bayes, Subjective Bayes, or some other? Any number of prior probability distributions can be posited at the outset of a Bayesian statistical analysis, and the outcome will be affected by the choice of prior. But that won’t be discussed here, because that’s not the point of this article.

Now it’s time for another fallacious discussion:

DC: “An example should make the idea more concrete. Imagine testing 1,000 different drugs, one at a time, to sort out which works and which doesn’t. You’d be lucky if 10 per cent of them were effective, so let’s proceed by assuming a prevalence or prior probability of 10 per cent. Say we observe a ‘just significant’ result, for example, a P = 0.047 in a single test, and declare that this is evidence that we have made a discovery. That claim will be wrong, not in 5 per cent of cases, as is commonly believed, but in 76 per cent of cases. That is disastrously high. ”

Indeed if one is testing 1000 drugs one should set an overall error rate for the whole exercise. Doing so will reduce the threshold for declaring a rejection of the null hypothesis for a given drug. If testing 1000 drugs and desiring an overall 5% type I error rate, the smallest p-value observed should be less than 0.05/1000 = 0.00005, the commonly used and highly useful Bonferroni correction. Benjamini and Hochberg have developed excellent methodologies allowing us to correct the other 999 p-values so that we maintain an overall error rate of 5% or 1% or whatever rate we deem acceptable, as we must accept some error rate in this uncertain universe.

So the “disaster” that is labeled here is not a property of frequentist / error statistical methodolgy, but rather a description of the interpretive capabilities of the evaluator of said body of evidence.

DC: “What is to be done? For a start, it’s high time that we abandoned the well-worn term ‘statistically significant’. The cut-off of P < 0.05 that's almost universal in biomedical sciences is entirely arbitrary – and, as we've seen, it's quite inadequate as evidence for a real effect."

Shall we also abandon the well-worn term 'anti-biotic'? Sometimes there's a reason that a term is well-worn, the term is appropriate and highly useful. The real problem here is conflating 'statistical significance' with 'scientific relevance'. That's the problem that needs to be addressed.

Of course an error rate of 5% is arbitrary. The fix here is to have better discussion about what rates of erroneous decisions can be tolerated in this or that circumstance, not throw out an entirely useful methodology because someone doesn't know how to use it properly. We don't ban automobiles because someone has a crash. We look for ways to educate automobile users so they will have fewer crashes, and we look for improvements to the automobile.

The improvement here for frequentist / error statistical methodology is to insist that scientific discussions include evaluation of how large an effect needs to be, to be of scientific or medical or biological value, and to describe error rates accurately by performing multiple corrections methodologies appropriately.

Journal editors banishing negative findings and other such problems are aptly described in John Ioannidis' excellent discussions of this topic of current importance. The problem is not with a well-fleshed-out analysis methodology, but with improper application and interpretation of results therefrom. Fallacious arguments pitting one statistical methodology against another will not solve this problem, merely shift it about. There's nothing magical about Bayesian methods that solve all our problems – they are just another useful tool in a statistical toolbox, and can be misapplied as readily as any other.

Reply
David Colquhoun says:

October 18, 2016 at 5:29 am

Oh dear. I fear that Steven McKinney’s criticisms represent the sort of carping that gets statisticians a bad name. I guess that anyone who tries to steer a line between frequentists and bayesians can expect a bashing from both sides.

For a start, it’s odd that he should choose to criticise a popular magazine article, rather than the paper on which it’s based: http://rsos.royalsocietypublishing.org/content/1/3/140216

“The author will here skip any discussion about which flavour of Bayesian methodologies is the correct one – Objective Bayes, Subjective Bayes, or some other?” It was a magazine article for the public, not a mathematical dissertation!

MacKinney says

DC: “Clearly the smaller the p-value, the less plausible the null hypothesis, so the more likely it is that there is, in fact, a real effect. ”

This is an erroneous argument, unfortunately one that is frequently stated.

From: The ASA’s Statement on p-Values: Context, Process, and Purpose

“The smaller the p-value, the greater the statistical incompatibility of the data with the null hypothesis, if the underlying assumptions used to calculate the p-value hold.”

Apart from the self-evident bit about assumptions holding, the ASA statement and mine say essentially the same thing. Why describe my version as erroneous?

My 1000 drug example is described as “fallacious”. It was an attempt to give an example in which, in principle, a normal frequentist interpretation could be given to the prior probability, by analogy with the prevalence of the condition in screening tests. Surely it’s obvious that the reason I say “in a single test” is to rule out the problem of multiple comparisons. I allude to Benjamini & Hochberg in the original paper, but the whole object of the paper was to discuss the problems of interpretation of P values before we even get to mutliple comparisons. I’m sorry if that was not clear to Steven McKinney. It seems to have been obvious to most people.

McKinney’s criticisms read as though I was trying to abolish P values which, of course, I’m not. It would be more helpful if he said what he thought is wrong about the simulations which give rise to my statements about false postive rates. As it is, his little rant sounds exactly like the sort of internecine warfare among statisticians that has given rise to the present unfortunate situation.

I recall talking to a statistician at a recent meeting of the Royal Statistical Society. He was involved in analysis of clinical trials. I asked why he allowed the paper to make claims of discoveries based on P values close to 0.05. His answer was that if he didn’t allow that he’d lose his job and the clinician would hire a more compliant statistician. That lies at the heart of the problem.

Reply
Steven McKinney says:

October 19, 2016 at 12:05 am

Dr. Colquhoun,

I chose to criticise the writings at web URL aeon.co, and not rsos.royalsocietypublishing.com, because the aeon URL is the one provided in the Weekend Reads above. I see no link here, nor on the aeon webpage, to the Royal Society, so how is it odd that I am commenting on a linked web page, and not an unlinked one? Odd rejoinder indeed. But I’ll critique the Royal Society paper as well.

Statisticians do indeed get a bad name when they describe disparate scenarios. If we as statisticians can’t even get this stuff right, then how will the rest of the scientific community get it right? I respectfully request that you modify your writing style by dropping statements crafted for shock and surprise value, in order to help bring an end to all this confusion.

An example is your opening sentence in the Royal Society version.

“If you use p=0.05 to suggest that you have made a discovery, you will be wrong at least 30% of the time.”

Such a sentence is provocative, and of course opens an article with a “pop”. But what on earth does it mean? It conflates multiple concepts in a way that will mislead non-statisticians into thinking that a type I error rate of 5% is really 30% at least. How is this helpful in the context of our current conundrums?

The second sentence adds more conflation to the confusion.

“If, as is often the case, experiments are underpowered, you will be wrong most of the time”.

The type I error rate is fixed, regardless of the sample size and the associated type II error rate, upon which power is based. Power has nothing to do with the type I error rate (save that the type I error rate is the power under the null hypothesis). Underpowered studies have large type II error rates, nothing to do with stated type I error rates.

If you use p = 0.05 to suggest you have made a discovery, then under the scenario that no discovery was there to be made, you will be wrong 5% of the time. That’s what a type I error means, and it is entirely accurate under the conditions to which it applies.

Your own Royal Society article demonstrates this fact. Your Figure 3 shows results of 100,000 simulated t-tests, when the null hypothesis is true. Your Figure 3 caption even states “(b) The distribution of the 100,000 p-values. As expected, 5% of the tests give (false) positives (p <= 0.05), but the distribution is flat (uniform)." Proven by your own simulation – if you use p = 0.05 to suggest that you have made a discovery, you will be wrong exactly 5% of the time, under the stated condition of the null hypothesis. Not more than 30% of the time, as your opening sentence implies (conflating false discovery rate with type I error rate).

I then read in section 5 "So there is the same number of p-values between 0.55 and 0.6, and in every other interval of the same width. This means that p-values are not at all reproducible: all values of p are equally likely."

To state that p-values are not at all reproducible is entirely misleading. It's odd to state this after running a simulation that shows how rock-solidly p-values perform, yielding exactly 5% of them less than 0.05 under the null hypothesis. It is this phenomenon, not the individual values, that is entirely reproducible, and the very reason that statistical methods have proved so valuable when properly applied and interpreted. Indeed, if this phenomenon was not entirely reproducible, why would your most useful software be of any value? (I do indeed view your software as most valuable – readers can indeed use it to reproduce these phenomena and learn how to properly assess them, given proper guidance.)

The fact that the p-values have a Uniform distribution under the null hypothesis is entirely well known in statistical distributional theory. So it is odd that your sentences describing the distribution of p-values under the null hypothesis use the word "but".

There finally in Section 6 is the sentence "The false discovery rate is 36%, not 5%." 5% was never claimed to be the false discovery rate. 5% is the rate of falsely rejecting the null hypothesis when in fact it is true. That is the type I error rate, not the false discovery rate. So that is what I think is wrong with your statements about "false positive rates". You conflate false discovery rates with false rejection rates, and refer to them as false positive rates in your reply. Lack of clarity leads to all this confusion. I'll continue with my "little rants" as long as people keep writing unclear articles that falsely portray statistical phenomena.

If you want to help people truly understand these entirely reproducible phenomena, you need to describe them accurately, and not conflate concepts to yield arousing sound bites. That is my issue with your articles. Such writings will not make these concepts "obvious to most people".

Reply
Steven McKinney says:

October 19, 2016 at 2:47 am

Dr. Colquhoun

You are correct that I misinterpreted your statement “Clearly the smaller the p-value, the less plausible the null hypothesis, so the more likely it is that there is, in fact, a real effect. ” This is an accurate wording in concordance with the ASA statement. Likelihood is indeed the primary measure of statistical compatibility. My apologies for that misinterpretation.

Reply
David Colquhoun says:

October 19, 2016 at 9:40 am

Thanks for admitting at least one mistake in your rather acerbic criticisms.

You say that you saw no links in the Aeon piece the original paper. It is actually linked in no fewer than eight different places, so perhaps you should look again.

I think that the paper makes it abundantly clear that its aim was to distinguish clearly between the Type 1 error rate, and the false positive rate The latter is clearly defined as the fraction of all “positive” tests that are false positives. It’s also described as the complement of the PPV (a term that I find less self-explanatory than false positive rate and so avoid).

Some of the links in the Aeon piece lead directly to section 10 of the RS paper where I justify the opening statement, namely that if you observe P=0.047 in a single test, and claim on that basis to have made a discovery, you’ll be wrong in 26% of the time (rounded to one significant figure, 30%, in the abstract). if you think that result is wrong, please say why.

(In the RS paper I called that the false discovery rate, though I noted in a comment that false positive rate might be a preferable term because FDR is used in the multiple comparison world and I wanted it to be clear that I wasn’t talking about multiple comparisons.)

You say that “You conflate false discovery rates with false rejection rates, and refer to them as false positive rates”. The whole purpose of the paper was to distinguish between false positive rates (which is what experimenters want) from type 1 errors rates (whch are often mistakenly taken to false positive rates). I’m sorry if this was not clear to you. The paper has been out for almost two years now. It’s had over 140,000 full-text views, and 21,000 pdf downloads. So far you are the only person who has complained that it was not clear.

Reply
1. Steven McKinney says:
  
  November 22, 2016 at 12:53 am
  
  I’m not the only person who has deconstructed the statements in this paper. Other statisticians considerably above my pay grade have also weighed in:
  
  https://errorstatistics.com/2015/12/12/stephen-senn-the-pathetic-p-value-guest-post-3/
  
  Reply
David Colquhoun says:

October 19, 2016 at 2:09 pm

I should, of course, have said at least 26% of the time.

Reply
Steven McKinney says:

October 20, 2016 at 3:58 am

Dr. Colquhoun,

My labeling of the 1000 drug example as fallacious is motivated thusly:

Any serious attempt to assess a collection of 1000 different drugs will of course necessitate using multiple comparisons methods. A single p-value of 0.047 is not an appropriate piece of evidence in this scenario. That’s why I call this example fallacious. To bash a single p-value with a scenario requiring handling 1000 p-values and assessing them appropriately so as to maintain an overall type I error rate of 5%, or 1%, or 0.1% is a fallacious argument with which to lay mockery upon a useful statistic. I also find it odd to back this discussion with a reference to your Royal Society paper in which you declare “I should clarify that this paper is not about multiple comparisons”.

The problem is the common belief – not the performance of an entirely sound statistic when used appropriately.

So the logical train of thought conveyed in the aeon article is

– People commonly misuse a method designed to assess a single event, in a willy nilly fashion across multiple events

– Therefore “it is time to pull the plug”, we should “abandon the well-worn term ‘statistically significant'”.

I strongly disagree.

Rather, it is time to redouble our efforts to educate those people who commonly do these things by conveying the appropriate concepts to them in understandable terms, rather than shocking soundbites.

Conflating a significance level with a p-value with a false discovery rate isn’t going to get us there.

As you state in your aeon article, Fisher said that p=0.05 was a low standard of significance and that a scientific fact should be regarded as experimentally established only if repeating the experiment rarely fails to give this level of significance. The fact that today a single experiment that gives p=0.045 will get a discovery published in the most glamorous journals is no reason to denigrate a useful statistic and advocate for its abandonment (as Trafimow has done at that brilliant journal Basic and Applied Social Psychology) or suppression of words necessary in describing the basic paradigm to which p-values belong (as you do in the title and body of this article).

Rather it is time for scientists of good will to collectively publicly shame the most glamorous journals, and decry the bad business practices into which said journals have devolved. Journals used to be produced by learned societies, to distribute the writings of their talented members. Now they are glossy brochures designed to return double digit profits. Such journals need to be shunned, and well meaning scientists need to return to the good practices described by Fisher.

There’s nothing wrong with chasing up an exploratory finding in a preliminary experiment with a reasonably small p-value combined with an effect size of biological or medical or other scientific relevance. Initial follow-up experiments should be conducted and described, and their statistical significance assessed. Then further validating experiments, preferably in other labs, should be undertaken, so that indeed the scientific community can assess whether the experiment rarely fails to give a statistically significant and scientifically relevant measure. Journals should insist upon such evidence in submitted papers. The scientific community should reward cooperative efforts that yield such repeated evidence, not reward fast findings by individuals, exploratory in nature only, and portrayed with pretty pictures of little relevance.

So this is what we need to discuss, rather than repeating silly statements by some physicist in Birmingham. Fisher’s feats at the turn of the last century are nothing short of remarkable. He almost single-handedly constructed all of modern statistics, figuring out the mathematics that accurately portrayed concepts that dozens of other scientists were struggling with and describing poorly. To describe that effort as a machine for turning baloney into breakthroughs shows how little Robert Matthews understands about the value of statistics, properly applied.

A significance level of 0.05 (one in twenty) is not entirely arbitrary. Significance levels in that territory were reasoned out a hundred years ago. Few people felt that flipping a coin was a reasonable way to decide on the validity of scientific phenomena – that’s one in two odds. One in three, one in five, even one in ten – just not a convincing level of odds for most serious evaluators of scientific phenomena. One in twenty? Now we’re getting somewhere. One in a hundred was also commonly entertained. This happened in an era where the number of experiments was small. People did not have the means to crunch numbers for tens of thousands of experiments at one time. So a significance level of one in twenty, or one in a hundred was not arbitrary, but rather agreed upon as the gateway to the arena of compelling evidence by many people involved in the scientific effort.

Of course we need to advance our methodology in our current era, where it is possible to assess thousands of experiments quickly, and we have the computing infrastructure to crunch all those numbers. That requires multiple comparisons methodologies, and better education for scientists about such methodologies. This is something I care about deeply, hence my sharp, harsh, biting, acrid and scathing little rants. P-values aren’t the problem. Improper interpretation and description of them is.

Reply
Steven McKinney says:

October 21, 2016 at 12:17 am

Dr. Colquhoun

I’ve been using some R code similar to that which you make available from your Royal Society paper linked in the aeon article. (I’ll be happy to send the code to anyone who would like to check it over).

For the 10000 drugs example, I used your 90/10 ratio, so that 90 percent of the drugs are placebos (null hypothesis is true) and 10 percent are drugs with a useful effect (alternative hypothesis is true).

I simulated 10000 trials with 17 people per drug (so 34 people per trial), with an unpaired t-test to compare the data in the two groups. This gives me 80 percent power for the useful drugs, as you describe in your paper. (16 subjects gave me less than 80% power.) For each of the 10000 experiments, I reject the null hypothesis at the 5 percent significance level.

I get rates very close to those discussed in your paper, not surprising, because statistics are reproducible in the aggregate even if randomness makes individual trials somewhat unique.

From 9000 placebos I get 454 false positive statistically significant test results (5 percent type I error rate, as advertised).

From 1000 useful drugs I get 817 true positive statistically significant test results (80 percent power, as advertised).

So as you put it in your paper, “I make a fool of myself 454/(454+817) = 35.7 percent of the time”.

I then used the Benjamini-Hochberg procedure to adjust all the 10000 p-values. Now,

from 9000 placebos I get 19 false positives (19 adjusted p-values less than the advertised significance level of 0.05 from the 9000 placebo experiments)

from 1000 useful drugs I get 355 true positives (355 adjusted p-values less than the advertised significance level of 0.05 from the 1000 useful drugs experiments).

Thus “I make a fool of myself 19/(19+355) = 5.08 percent of the time”, exactly the type 1 error rate I wanted. I didn’t have to change my significance level to 0.001 from 0.05 (the Benjamini-Hochberg procedure essentially did that for me).

P-values work beautifully when they are handled correctly. The problem is not p-values, the problem is people who do not process and interpret them properly. When properly handled, we can correctly control the error rates of our statistical procedures with remarkable and knowable precision. In this example, I was able to pick out a goodly number of the efficacious drugs, while keeping my type I error rate at the advertised 5 percent level.

Science is difficult, and in trying to learn how the universe and everything in it works, people will make mistakes. I don’t consider them foolish when they make mistakes, if they make an honest effort to disclose the rate at which I can expect errors in their findings. Even Einstein didn’t get it right all the time, and I’m not going to call him a fool for making a mistake. I admire the courage that scientists display, able to pick themselves up after failure and carry on.

Statistics has been such a useful tool in the arsenal of the capable scientist, allowing progress to happen by understanding the rate at which errors are made, and allowing scientists to plan proper studies to determine answers in arenas of the unknown.

Long live the p-value, and proclamations of statistical significance, when properly carried out.

Reply
1. David Colquhoun says:
  
  May 9, 2024 at 6:47 pm
  
  I only just noticed that I never replied to your last comment.
  
  Since then, I did the math and the results agree exactly with my simulations -published in 2017
  https://royalsocietypublishing.org/doi/10.1098/rsos.171085
  and in 2019 I developed the ideas a bit further https://www.tandfonline.com/doi/full/10.1080/00031305.2018.1529622
  I now realise a bit better the assumptions I was making in the 2014 paper, First, the high false positive risk (FPR) occurs when testing a (near) point null hypothesis. The Sellke & Berger (1987) minimum BF approach, gives a similar answer to mine: for p=0.05 they suggest FPR = 0.29 (when prior odds are 1), compared with 0.26 in my approach. Of course the FPR will increase a lot if the hypothesis is implausible (prior odds on H1 less than 1).
  
  Given the low success rate with candidate drugs in several fields (like Alzheimer’s and cancer), the prior odds on H1 might well be less than one. This may be part of the reason why so many not-very-effective treatments reach the market.
  
  Reply

Weekend reads: Arguments for abandoning “statistically significant,” boorish behavior, and useless clinical trials

Related

14 thoughts on “Weekend reads: Arguments for abandoning “statistically significant,” boorish behavior, and useless clinical trials”

Leave a ReplyCancel reply

Share this:

Related

14 thoughts on “Weekend reads: Arguments for abandoning “statistically significant,” boorish behavior, and useless clinical trials”

Leave a ReplyCancel reply