## Headline-grabbing Science paper questioned by critics

When zoologists at the University of Oxford published findings in *Science* last year suggesting ducklings can learn to identify shapes and colors without training (unlike other animals), the news media was entranced.

However, critics of the study have published a pair of papers questioning the findings, saying the data likely stem from chance alone. Still, the critics told us they don’t believe the findings should be retracted.

If a duckling is shown an image, can it pick out another from a set that has the same shape or color? Antone Martinho III and Alex Kacelnik say yes. In one experiment, 32 out of 47 ducklings preferred pairs of shapes they were originally shown. In the second experiment, 45 out of 66 ducklings preferred the original color. The findings caught the attention of many media outlets, including the *New York Times*, *The Atlantic*, and *BuzzFeed*.

Martinho told us:

We estimated statistically the probability with which our results could be expected by chance. That probability is extremely low, well beyond what is required normally in experimental science…One of our critics has even been quoted as saying to the press that some ducklings had random preferences. True, that’s the point of statistics! A minority even showed the opposite preference, and this is still fine, that’s why we estimated probabilities.

However, two separate research teams reanalyzed the data and came up with different conclusions.

Jan Langbein at the Leibniz Institute for Farm Animal Biology in Germany — who co-authored a technical note in *Science* with Birger Puppe at the University of Rostock, Germany — told us:

If you had free choice and you [took] five pieces of black chocolate and six pieces of white chocolate, [the paper] would argue you have a distinct preference for the white chocolate. Which is not true because with the next piece you [might] choose the black chocolate again.

Langbein and Puppe used another statistical test, the binomial test, and found that the conclusions only held up for shapes, not colors, which might make evolutionary sense. Langbein told us:

As ducklings hatch at any time of day and night, one can conclude that imprinting not only occurs during daytime but also when brightness is low and color is not a salient stimulus for learning about the mother and siblings.

He added:

Our critique does not warrant a retraction of the data, but a new interpretation.

Another researcher argued that the entire study was based on a defunct statistical method. Jean-Michel Hupé at the Université Toulouse argued that p-values less than 0.05 do not show significance, only a surprising result.

The usual practice for about 60 years is to consider that if your observation is surprising, then the null hypothesis is probably wrong. But how “probably wrong”, you don’t know, and you don’t know it whatever the p-value….p-values are pretty useless to make any inference on models or parameters of models, and this has been very well known for decades (even though I, like all my colleagues, were taught to use p-values). Since last year, this is also official with the American Statistical Association publishing a statement about it. The usage of p-value is no longer controversial. It’s just wrong.

Instead, Hupé argued that the statistics of confidence intervals should be used, a method that finds a probability that data fall within a certain range. Even confidence intervals have issues, he said in an email, but:

practically, you know that you have **about** 95% chance that the true value is within your 95% CI…The CI allows you to interpret your data, the p-value does not.

Indeed, Hupé’s reanalysis showed that little conclusion could be made with the duckling data, he argued:

What if Martinho and Kacelnik had presented their data the way I suggest? Their study would have looked like a promising pilot study, describing a clever paradigm. That’s definitely worth publishing. I doubt however that Science editors would have considered it to be worth publishing in their high-impact journal. But that’s their problem, not ours. They are responsible for promoting good stories rather than humble facts. My hope is that readers will understand better statistics after reading my comment.

Still, Hupé agreed the paper should not be retracted:

Retraction is certainly not warranted (unless you decide to retract most papers, which based their conclusions on p-values), and I hope that my technical comment is enough as a cautionary notice.

A representative of *Science* agreed, telling us the journal

is not considering an Editorial Expression of Concern, nor a Retraction, for this paper.

The published exchanges allow for transparent debate about the data and conclusions, the spokesperson said, and the *Science* editorial team believed any technical issues were constructively addressed by the researchers.

Meanwhile, Martinho and Kacelnik — who also published a response to the criticisms — are moving forward with replicating their initial observations.

Martinho told us:

Both our group and other experimentalists have conducted more experiments testing the same and related ideas. Our practice is to report results through the appropriate, peer-reviewed media, when they are fully analysed. I can only say that results so far are reassuringly strong.

*Like Retraction Watch? Consider making a **tax-deductible contribution to support our growth**. You can also follow us **on Twitter**, like us **on Facebook**, add us to your **RSS reader**, sign up on our **homepage** for an email every time there’s a new post, or subscribe to our **daily digest**. Click **here to review our Comments Policy**. For a sneak peek at what we’re working on, **click here**.*

* *

The initial data analysis is sloppy, but to be honest I find the re-analysis to be sloppier — take arbitrary thresholds for the preference of individual ducklings? Seems to me like that’s just an extra step to remove some power from the initial test. A random effects model seems a lot more principled and higher powered here.

And……what is the conclusion from applying the random effects model?

Dear Toby and Patrick, I found the time to perform random effect analyses. Here are a couple of them for comparisons with the original and my analysis. (1) I start with the analysis that most convinced Martinho and Kacelnik, on 117 ducklings: « combining both results, […] twotailed binomial test, P < 0.0001) ». The model for the mixed (random) effect analysis, in R, is glmer(cbind(Imprinted.concept, Novel.concept) ~ (1 | Bird), data = allMartinhoData, binomial(link = "logit")). P = 0.02, 95% CI for the odds ratio = [1.1 to 3.3]. The result is therefore strictly similar to my first analysis (also done by Langbein and Puppe), using a criterion of p <0.05 (binomial test) to classify each duckling (“P = 0.01, 95% CI = [53 to 75]% chance”, i.e [1.1 to 3] for odds ratio). The slightly « less significant » P-value may be surprising for a more powerful analysis, but that corresponds to my observation that “the noisier the data, the more it deviates from randomness” ». (2) The main argument was the generalization of the preference for the imprinted pair over different kinds of objects (different shapes and colors). The crucial analysis was therefore the test of group effects, not performed by Martinho and Kacelnik. For the first experiment, I had reported that “group independence was unlikely (chi-square test, P = 0.022”). Here, the more powerful random effect analysis makes the result much more clear. When testing the four groups of ducklings ( glmer(cbind(Imprinted.concept, Novel.concept) ~ group + (1 | Bird), data = Shapes, binomial) we obtain F(3,28.4) = 9.5, p = 0.00002. Differences also emerge among the 10 color groups, but not as clearly because the small number of ducklings tested in each group, and the model without groups is in fact better (Akaike information criterion). (3) Discussion: the random effect analysis proposed by Patrick Mineault therefore questioned even more the original claims than my own analyses. But is it really the best, “more principled” analysis? Like all analyses, the random effect analysis rests on assumptions. The main one is the independence of measures. Martinho and Kacelnik considered in their response that this assumption (considering “the number of approaches as independent observations”) was not correct, and I fully agree, since the duckling’s behavior depends on its previous approaches (the test being done in the “sensitive period”). That’s why it made more sense to try to classify each duckling based on its global behavior, maybe using also qualitative observations, and then perform statistics only at the second level. The main problem is that Martinho and Kacelnik did not provide convincing arguments that their classification method was accurate. They used only what they called, improperly, a “sign test”, that is simply counting the difference of approaches. I had illustrated why such a criterion is problematic, since it leads to consider similarly a bird “like S14 […] that followed one pair four times and the other three times”, and a bird like “S04 [that] followed the imprinted pattern 44 times and the novel one 19 times”. Assuming independence (which is not correct in the case of imprinting, but yet a reasonable Null hypothesis if the duckling is not imprinting anything), “the probability of observing by chance a difference at least as large as this one is 1” in the first case and 0.002 in the second case. Statisticians cannot tell Martinho and Kacelnik how to classify ducklings. What I tried to do is simply to show the dependence of the conclusions on different arbitrary classification criteria. In any case, the different performance levels for the different groups remains the main issue, which Martinho and Kacelnik could have noticed even using their classification criterion.

oh, I forgot, the result of the analysis for the second experiment (colors) with the random effect analysis is also similar to the one with the .05 criterion: p= 0.23, 95% CI for odds ratio = [ 0.74 to 3.34] (see my Table I: p = 0.20 CI = [45 to 74]%).

Thank you Jean-Michel for taking the time to perform the extra analysis and for posting a summary here. Very nice to see the discussion going on.

Yes indeed, you’re right. Did you try it? I also thought about trying a random effect analysis, but my point is that it would not make any major difference (but don’t hesitate to prove me wrong, I’m happy to learn), as long as you present confidence intervals of course. This was what I wanted to stress: don’t look for any magic, unique “perfect” way of analyzing the data. Even with the unfortunate choice of arbitrary threshold made by the authors, showing CIs (and interpreting them correctly) would have changed the message quite dramatically. I found it more pedagogical to try to keep the analysis simple: you don’t need to master complicated statistical methods to analyze your data better. Also, even though I did not want to insist on that in my comment, the way CIs shift systematically when changing the criterion (“the noisier the data, the more it deviates from randomness”) does not make much sense. This suggests that something else may be wrong with the data set or analysis. You would not see that with a single random effect analysis. That’s why I find it informative to explore the data with different criteria, and to show all the possible outcomes. Importantly, that way you refuse to conclude, you refuse to tell stories, which has been very detrimental to scientific progress, imho.

The comment regarding confidence intervals in this post is not correct. It is absolutely correct to point out that the p-value does not give the probability that the results occurred by chance. But it is equally incorrect to interpret the 95% CI as having a 95% chance of including the true effect size. Consider the situation when a study is conducted twice, using identical methods. There is a 5% chance the first study will be an outlier such that the 95% CI does not include the true mean. It is also possible for the second study to be an outlier in the opposite direction. You will then have two 95% CI’s that do not overlap. It is obviously a nonsense to believe that both these intervals have a 95% chance of including the true mean (the probabilities of two mutually exclusive hypotheses cannot add up to more than 1).

The way to describe the meaning of the confidence interval is this: If a study was repeated many times and a 95% CI calculated for each of the data sets, approximately 95% of these CI’s would include the true population mean. (For any one CI, there is either a 100% chance or a zero% chance it includes the true mean. It either does or it does not.)

I of course agree with the correct description of the CI (I had recalled it in my Science comment). In this post, I had included **about**. I had developped in my response to Trevor : “Mathematically, [a CI ] is almost as useless as a p-value, since you have no way to know if your particular CI included the true value, and you even cannot compute the probability that the true value is included in your particular CI”. “Bayesian statistics allow you to consider all the possible priors – they are called flat or non-informative priors. Kruschke is a good advocate of those statistics, and, indeed, they allow you to compute “higher density intervals” (HDI) of likelihood, meaning that you may compute rigorously the 95% likelihood that the “true” parameter is within a given range. There is a piece of good news however for people who don’t want to learn Kruschke’s Bayesian statistics. For “well-behaved” statistics (like binomial or normal distributions), Confidence Intervals are very similar to HDI.” I had then concluded : ” if you have no information about the priors and you’re using well-behaved statistics, then you know that your HDI would be about the same. So, in fact, not mathematically but practically, you know that you have **about** 95% chance that the true value is within your 95% CI.”

You like it better that way?

But in fact, the description of a CI I prefer is this one : “The only use I know for a confidence interval is to have confidence in it” (Smithson 2003, quoting Savage 1962)

Unlike p-values, CI do reflect quite well the data and are thereful useful to scientists.

Well-put: I like your expanded explanation.