Doing research is hard. Getting statistically significant results is hard. Making sure the results you obtain reflect reality is even harder. In this week’s Science, Eric Loken at the University of Connecticut and Andrew Gelman at Columbia University debunk some common myths about the use of statistics in research — and argue that, in many cases, the use of traditional statistics does more harm than good in human sciences research.
Retraction Watch: Your article focuses on the “noise” that’s present in research studies. What is “noise” and how is it created during an experiment?
Andrew Gelman: Noise is random error that interferes with our ability to observe a clear signal. It can have many forms, including sampling variability by using small samples, or unexplained error from unmeasured factors, or measurement error from poor instruments for the things you do want to measure. In everyday life we take measurement for granted – a pound of onions is a pound of onions. But in science, and maybe especially social science, we observe phenomena that vary from person to person, that are affected by multiple factors, and that aren’t transparent to measure (things like attitudes, dispositions, abilities). So our observations are much more variable.
Noise is all the variation that you don’t happen to be currently interested in. In psychology experiments, noise typically includes measurement error (for example, ask the same person the same question on two different days, and you can get two different answers, something that’s been well known in social science for many decades) and also variation among people.
RW: In your article, you “caution against the fallacy of assuming that that which does not kill statistical significance makes it stronger.” What do you mean by that?
AG: We blogged about the “What does not kill my statistical significance makes it stronger” fallacy here. As anyone who’s designed a study and gathered data can tell you, getting statistical significance is difficult. And we also know that noisy data and small sample sizes make statistical significance even harder to attain. So if you do get statistical significance under such inauspicious conditions, it’s tempting to think of this as even stronger evidence that you’ve found something real. This reasoning is erroneous, however. Statistically speaking, a statistical significant result obtained under highly noisy conditions is more likely to be an overestimate and can even be in the wrong direction. In short: a finding from a low-noise study can be informative, while the finding at the same significance level from a high-noise study is likely to be little more than . . . noise.
RW: Which fields of research are most affected by this assumption, and the influence of noise?
AG: The human sciences feature lots of variation among people, and difficulty of accurate measurements. So psychology, education, and also much of political science, economics, and sociology can have big issues with variation and measurement error. Not always — social science also deals in aggregates — but when you get to individual data, it’s easy for researchers to be fooled by noise — especially when they’re coming to their data with a research agenda, with the goal of wanting to find something statistically significant that can get published too.
We’re not experts in medical research but, from what we’ve heard, noise is a problem there too. The workings of the human body might not differ so much from person to person, but when effects are small and measurement is variability, researchers have to be careful. Any example where the outcome is binary — life or death, or recovery from disease or not — will be tough, because yes/no data are inherently variable when there’s no in-between state to measure.
A recent example from the news was the PACE study of treatments for chronic fatigue syndrome: there’s been lots of controversy about outcome measurements, statistical significance, and specific choices made in data processing and data analysis — but at the fundamental level this is a difficult problem because measures of success are noisy and are connected only weakly to the treatments and to researchers’ understanding of the disease or condition.
RW: How do your arguments fit into discussions of replications — ie, the ongoing struggle to address why it’s so difficult to replicate previous findings?
AG: When a result comes from little more than noise mining, it’s not likely to show up in a preregistered replication. I support the idea of replication if for no other reason than the potential for replication can keep researchers honest. Consider the strategy employed by some researchers of twisting their data this way and that in order to find a “p less than .05” result which, when draped in a catchy theory, can get published in a top journal and then get publicized on NPR, Gladwell, Ted talks, etc. The threat of replication changes the cost-benefit on this research strategy. The short- and medium-term benefits (publication, publicity, jobs for students) are still there, but there’s now the medium-term risk that someone will try to replicate and fail. And the more publicity your study gets, the more likely someone will notice and try that replication. That’s what happened with “power pose.” And, long-term, enough failed replications and not too many people outside the National Academy of Sciences and your publisher’s publicity department are going to think what you’re doing is even science.
That said, in many cases we are loath to recommend pre-registered replication. This is for two reasons: First, some studies look like pure noise. What’s the point of replicating a study that is, for statistical reasons, dead on arrival? Better to just move on. Second, suppose someone is studying something for which there is an underlying effect, but his or her measurements are so noisy, or the underlying phenomenon is so variable, that it is essentially undetectable given the existing research design. In that case, we think the appropriate solution is not to run the replication, which is unlikely to produce anything interesting (even if the replication is a “success” in having a statistically significant result, that result itself is likely to be a non-replicable fluke). It’s also not a good idea to run an experiment with much larger sample size (yes, this will reduce variance but it won’t get rid of bias in research design, for example when data-gatherers or coders know what they are looking for). The best solution is to step back and rethink the study design with a focus on control of variation.
RW: Anything else you’d like to add?
AG: In many ways, we think traditional statistics, with its context-free focus on distributions and inferences and tests, has been counterproductive to research the human sciences. Here’s the problem: A researcher does a small-N study with noisy measurements, in a setting with high variation. That’s not because the researcher’s a bad guy; there are good reasons for these choices: Small-N is faster, cheaper, and less of a burden on participants; noisy measurements are what happen if you take measurements on people and you’re not really really careful; and high variation is just the way things are for most outcomes of interest. So, the researcher does this study and, through careful analysis (what we might call p-hacking or the garden of forking paths), gets a statistically significant result. The natural attitude is then that noise was not such a problem; after all, the standard error was low enough that the observed result was detected. Thus, retroactively, the researcher decides that the study was just fine. Then, when it does not replicate, lots of scrambling and desperate explanations. But the problem — the original sin, as it were –was the high noise level. It turns out that the attainment of statistical significance cannot and should not be taken as retroactive evidence that a study’s design was efficient for research purposes. And that’s where the “What does not kill my statistical significance makes it stronger” fallacy comes back in.
Like Retraction Watch? Consider making a tax-deductible contribution to support our growth. You can also follow us on Twitter, like us on Facebook, add us to your RSS reader, sign up on our homepage for an email every time there’s a new post, or subscribe to our daily digest. Click here to review our Comments Policy. For a sneak peek at what we’re working on, click here.