Retraction Watch

Tracking retractions as a window into the scientific process

Headline-grabbing Science paper questioned by critics

with 3 comments

When zoologists at the University of Oxford published findings in Science last year suggesting ducklings can learn to identify shapes and colors without training (unlike other animals), the news media was entranced.

However, critics of the study have published a pair of papers questioning the findings, saying the data likely stem from chance alone. Still, the critics told us they don’t believe the findings should be retracted.

If a duckling is shown an image, can it pick out another from a set that has the same shape or color?  Antone Martinho III and Alex Kacelnik say yes. In one experiment, 32 out of 47 ducklings preferred pairs of shapes they were originally shown. In the second experiment, 45 out of 66 ducklings preferred the original color. The findings caught the attention of many media outlets, including the New York Times, The Atlantic, and BuzzFeed.

Martinho told us:

We estimated statistically the probability with which our results could be expected by chance. That probability is extremely low, well beyond what is required normally in experimental science…One of our critics has even been quoted as saying to the press that some ducklings had random preferences. True, that’s the point of statistics! A minority even showed the opposite preference, and this is still fine, that’s why we estimated probabilities.

However, two separate research teams reanalyzed the data and came up with different conclusions.

Jan Langbein at the Leibniz Institute for Farm Animal Biology in Germany — who co-authored a technical note in Science with Birger Puppe at the University of Rostock, Germany — told us:

If you had free choice and you [took]  five pieces of black chocolate and six pieces of white chocolate, [the paper]  would argue you have a distinct preference for the white chocolate. Which is not true because with the next piece you [might] choose the black chocolate again.

Langbein and Puppe used another statistical test, the binomial test, and found that the conclusions only held up for shapes, not colors, which might make evolutionary sense. Langbein told us:

As ducklings hatch at any time of day and night, one can conclude that imprinting not only occurs during daytime but also when brightness is low and color is not a salient stimulus for learning about the mother and siblings. 

He added:

Our critique does not warrant a retraction of the data, but a new interpretation.

Another researcher argued that the entire study was based on a defunct statistical method. Jean-Michel Hupé at the Université Toulouse argued that p-values less than 0.05 do not show significance, only a surprising result. 

The usual practice for about 60 years is to consider that if your observation is surprising, then the null hypothesis is probably wrong. But how “probably wrong”, you don’t know, and you don’t know it whatever the p-value….p-values are pretty useless to make any inference on models or parameters of models, and this has been very well known for decades (even though I, like all my colleagues, were taught to use p-values). Since last year, this is also official with the American Statistical Association publishing a statement about it. The usage of p-value is no longer controversial. It’s just wrong.

Instead, Hupé argued that the statistics of confidence intervals should be used, a method that finds a probability that data fall within a certain range. Even confidence intervals have issues, he said in an email, but:

practically, you know that you have **about** 95% chance that the true value is within your 95% CI…The CI allows you to interpret your data, the p-value does not.

Indeed, Hupé’s reanalysis showed that little conclusion could be made with the duckling data, he argued:

What if Martinho and Kacelnik had presented their data the way I suggest? Their study would have looked like a promising pilot study, describing a clever paradigm. That’s definitely worth publishing. I doubt however that Science editors would have considered it to be worth publishing in their high-impact journal. But that’s their problem, not ours. They are responsible for promoting good stories rather than humble facts. My hope is that readers will understand better statistics after reading my comment.

Still, Hupé agreed the paper should not be retracted:

Retraction is certainly not warranted (unless you decide to retract most papers, which based their conclusions on p-values), and I hope that my technical comment is enough as a cautionary notice.

A representative of Science agreed, telling us the journal

is not considering an Editorial Expression of Concern, nor a Retraction, for this paper.

The published exchanges allow for transparent debate about the data and conclusions, the spokesperson said, and the Science editorial team believed any technical issues were constructively addressed by the researchers.

Meanwhile, Martinho and Kacelnik — who also published a response to the criticisms are moving forward with replicating their initial observations.

Martinho told us:

Both our group and other experimentalists have conducted more experiments testing the same and related ideas. Our practice is to report results through the appropriate, peer-reviewed media, when they are fully analysed. I can only say that results so far are reassuringly strong.

Like Retraction Watch? Consider making a tax-deductible contribution to support our growth. You can also follow us on Twitter, like us on Facebook, add us to your RSS reader, sign up on our homepage for an email every time there’s a new post, or subscribe to our daily digest. Click here to review our Comments Policy. For a sneak peek at what we’re working on, click here.


Written by trevorlstokes

March 13th, 2017 at 11:30 am

  • Patrick Mineault March 15, 2017 at 1:23 am

    The initial data analysis is sloppy, but to be honest I find the re-analysis to be sloppier — take arbitrary thresholds for the preference of individual ducklings? Seems to me like that’s just an extra step to remove some power from the initial test. A random effects model seems a lot more principled and higher powered here.

    • Toby March 15, 2017 at 9:07 am

      And……what is the conclusion from applying the random effects model?

  • Tim McCulloch March 16, 2017 at 11:46 pm

    The comment regarding confidence intervals in this post is not correct. It is absolutely correct to point out that the p-value does not give the probability that the results occurred by chance. But it is equally incorrect to interpret the 95% CI as having a 95% chance of including the true effect size. Consider the situation when a study is conducted twice, using identical methods. There is a 5% chance the first study will be an outlier such that the 95% CI does not include the true mean. It is also possible for the second study to be an outlier in the opposite direction. You will then have two 95% CI’s that do not overlap. It is obviously a nonsense to believe that both these intervals have a 95% chance of including the true mean (the probabilities of two mutually exclusive hypotheses cannot add up to more than 1).

    The way to describe the meaning of the confidence interval is this: If a study was repeated many times and a 95% CI calculated for each of the data sets, approximately 95% of these CI’s would include the true population mean. (For any one CI, there is either a 100% chance or a zero% chance it includes the true mean. It either does or it does not.)

  • Post a comment

    Threaded commenting powered by interconnect/it code.