A psychology journal is retracting a 2015 paper that attracted press coverage by suggesting women’s hormone levels drive their desire to be attractive, after a colleague alerted the last author to flaws in the statistical analysis.

The paper, published online in November, found women prefer to wear makeup when there is more testosterone present in their saliva. The findings were picked up by various media including *Psychology Today *(“Feeling hormonal? Slap on the makeup”), and even made it onto reddit.com.

However, upon discovering a problem in the analysis of the data, the authors realized that central finding didn’t hold up, according to *Psychological Science*‘s interim editor, Stephen Lindsay:

Last month, senior author Benedict Jones notified the Associate Editor who had served as Action Editor on this manuscript, explaining that he (Jones) had heard from another psychologist that there was a flaw in the central statistical analysis. Jones, having looked into the matter, agreed that there was indeed a flaw and that the central finding did not hold up when the flaw was corrected and therefore asked the AE to arrange for retraction. The AE relayed this information to me along with his judgment that the author was correct, and I judged that retraction was appropriate and worked with Jones and APS staff on a retraction statement.

Lindsay provided more details about how the problem was discovered:

The error had to do with a fine point in the specification of a linear mixed model analysis. The authors included in the supplemental online material all of the details of the model so in principle we should have caught it. I understand that the person who detected the error did so because he re-analyzed the data, which Fisher et al. had posted on [the Open Science Framework] OSF, as part of a statistics course.

Lindsay gave us a heads up about the retraction before it happened, along with an explanation of what went wrong, so he (and the authors of the paper) definitely belongs in our “doing the right thing” category. As he told us:

We are continuing to work on ways to detect these sorts of errors before they are published. But when that fails we do our best to take corrective action and get the word out quickly.

Last author Benedict Jones at the University of Glasgow confirmed this account to us:

Yes, we asked the journal to retract the article after a colleague had alerted us to a problem with the analysis. We had made the data and our analysis files available through OSF.

The colleague drew our attention to a 2013 paper (Barr et al., 2013 Journal of Memory and Language) that shows the type of analysis we reported can inflate the Type 1 error rate. Reanalyses that addressed this issue following recommendations in the 2013 paper did not show the key effect reported in our original article.

Here’s a link to the paper Jones mentions.

(Disclosure: Our parent organization, The Center For Scientific Integrity (CSI), is partnering with The Center For Open Science (COS) to create our retraction database on the Open Science Framework, or OSF.)

Here’s more from the abstract of the to-be-retracted paper, “Women’s Preference for Attractive Makeup Tracks Changes in Their Salivary Testosterone,” which has not yet been cited, according to Thomson Scientific’s Web of Knowledge:

We found that women’s preference for attractive makeup increases when their salivary testosterone levels are high. The relationship between testosterone level and preference for attractive makeup was independent of estradiol level, progesterone level, and estradiol-to-progesterone ratio. These results suggest that testosterone may contribute to changes in women’s motivation to wear attractive makeup and, potentially, their motivation to appear attractive in general.

Update 2/5/16 10:33 a.m. eastern: The retraction has been posted. It reads:

At the request of the authors, the following article has been retracted by the Editor and publishers of

Psychological Science:Fisher, C. I., Hahn, A. C., DeBruine, L. M., & Jones, B. C. (2015). Women’s preference for attractive makeup tracks changes in their salivary testosterone.

Psychological Science, 26, 1958–1964. doi:10.1177/0956797615609900The authors of this article have notified the Editor as follows:

Our article reported linear mixed models showing interactive effects of testosterone level and perceived makeup attractiveness on women’s makeup preferences. These models did not include random slopes for the term perceived makeup attractiveness, and we have now learned that the Type 1 error rate can be inflated when by-subject random slopes are not included (Barr, Levy, Scheepers, & Tily, 2013). Because the interactions were not significant in reanalyses that addressed this issue, we are retracting this article from the journal.

It also includes a note from the editor at the bottom:

Editor’s NoteI would like to add an explicit statement that there is every indication that this retraction is entirely due to an honest mistake on the part of the authors.

*Like Retraction Watch? Consider making a **tax-deductible contribution to support our growth**. You can also follow us **on Twitter**, like us **on Facebook**, add us to your **RSS reader**, sign up on our **homepage** for an email every time there’s a new post, or subscribe to our** **new daily digest**. Click **here to review our Comments Policy**.*

In such cases, it would be useful to get very specific information about the exact nature of the “linear mixed model” misspecification. There are 3-4 that I can think of:

1. incorrect covariance structure (use of compound symmetry instead of unspecified might lead to inflation in Type I error)

2. incorrect specification of dependent variable error type (failure to define as binary or ordered categorical)

3. failure to properly identify dependencies between observations

4. failure to properly define random and fixed terms

Being a little more specific for more technically minded readers would be helpful.

We’ve asked Jones for more details on the nature of the error, and he responded:

“The error was not including random slopes for the term ‘makeup attractiveness’ in the cross-classified model.

The Barr et al. (2013) paper demonstrates that not including these random slopes inflates the Type 1 error rate.

We do explain the nature of the error (and cite the Barr et al., 2013 paper) in the retraction statement.”

I agree with you, Paul. Without further details is quite hard to say where was the flaw.

Also, I want to draw the attention on the fact that Barr et al (2013) article and its recommendations have been criticized by Bates et al (2015), who argue for a more parsimonious random structure instead:http://arxiv.org/abs/1506.04967.

I would the find problematic to use the Barr et al paper as a unique argument for the retraction of an article that did not conform to these guidelines. But since I don’t know the details of the whole thing, it is hard to tell if the “flaw” was more trivial than just not adding, for example, random slopes to the model.

Judging by the reference to Barr et al. (2013), it sounds like the problem was (4): they didn’t specify a “maximal” random effect structure, i.e., each factor that was observed at multiple levels for a participant or item was not included as a random slope in that random group. I hope they looked to see whether random slopes were necessary in this particular case before retraction.

I’m one of the authors of the Barr et al. (2013) article, and I’d like to respond to the questions raised about the appropriateness of the random slopes in this case. Thanks to the progressive data sharing practices of the authors of the retracted article, which I commend, it is easy to answer these questions by working with the original data and analysis code (https://osf.io/ucn6q/).

In this case, the theoretically critical predictor in the regression analysis is the interaction between “attractiveness” — the third-party independently rated attractiveness of the makeup used on a face stimulus — and testosterone levels in influencing a participant’s rating of their preference for that face stimulus. (The shape of the interaction was that the effect of stimulus attractiveness on preference ratings was larger when testosterone levels were higher.) Because each participants rated faces over several sessions and their testosterone level varied across sessions, if there is inter-participant variation in the effect of this interaction on preference ratings, it could affect the strength of statistical evidence for the overall direction of the reported interaction. To get a sense of the potential effect on statistical conclusions: there are 21,250 observations in the complete dataset, but only 85 participants. If there is inter-participant variability in the interaction effect, then you effectively have 85 independent observations in your test for robustness of the interaction across individuals; by ignoring it, you are effectively acting as if you have 21,250 independent observations. The potential anti-conservativity of ignoring this kind of inter-participant variation — technically, the “random by-participant slope for the attractiveness-testosterone interaction” — is recognized even among the critics of the Barr et al. 2013 paper.

The disagreements raised by Bates et al. 2015 fundamentally regard how to approach the question of when to put the theoretically key random slopes in your model. Bates et al. advocate a “data-driven” approach in which you subject the inclusion of a random slope to a statistical test, whereas we (Barr et al.) advocate a “design-driven” approach in which you assume the presence of theoretically key random slopes for grouping factors in your data (e.g., participants) that you have good prior reason to believe might vary in their sensitivity to the manipulation in question. In this particular case, there is massive statistical evidence for the presence of the random slope, so I don’t think there would be any disagreement as to the appropriateness of including them (though you are of course welcome to check with the authors of the Bates et al. paper).

To add further quantitative details: a linear mixed effects model including a by-participants random slope for the attractiveness-testosterone interaction estimates its “fixed effect” (the overall population trend) to have size & direction +0.001809, and its “random slope” to have standard deviation 0.008547. So even if you were to assume that the overall population trend is real (which there is not evidence for at the p<0.05 level), the dataset shows that inter-participant variation in effect is much larger than the average effect.

Roger, thank you for your clarifications and for providing a better context to this discussion.

About the anti-conservativeness, I am not fully convinced. Well, I agree that adding random slopes is more conservative and I see your point about the ratio between the slope variance and the fixed effect. But I am not sure if not adding the random slope is equivalent to have 21250 INDEPENDENT observations. Shouldn’t the specification of the random intercept itself account for the dependence among the observations to some extent? The lmer4 output recognizes that there are 85 code groups, isn’t that a sign that the clustering of the data has been taken into account in some way? Also, I ran a simulation and the with the first analysis proposed by Fisher et al the rate of significant test for that interaction was .049, which suggest that the Type I error was appropriately controlled.

I tried to analyze the data myself and even just adding the random slopes for the testosterone in each participants and for the attractiveness in each participant leads to a serious convergence failure, which is deemed as a good reason for simplifying the model (see Bolker et al., 2009) even among the proponents of the maximal random effect structure: https://hlplab.wordpress.com/2011/06/25/more-on-random-slopes/

Anyway, regardless of the details, the most relevant question to me is: how confident are you in believing that not adding the random slope effect should be deemed as a good reason for a RETRACTION. In my point of view, a retraction is justified when there is something completely wrong in the data, while here we are talking about subtle differences in the output (when adding the random slopes the t value for the interaction was 1.57, not significant but still in that direction).

Out there there are thousands of papers with much sloppier statistics and nobody wold dare to ask for a retraction. To give you an example, in fMRI it is plenty of papers where the cluster level inference is made using ridiculously lenient thresholds that mathematically allow for a huge rate of spurious results, given the intrinsic and extrinsic spatial correlation in this kind of data. Should we ask for a retraction? Or could just move on and adjust our methods along the way?

Hi Marco,

Thank you for these good questions and points, which I’ll respond to one by one.

1) Doesn’t the random intercept by itself account for the dependence between observations collected from the same participant?

Answer: Yes, the by-participants random intercepts introduces a conditional dependence among observations from the same participant. However, the random intercept is not the right kind of dependence among these observations to appropriately downweight multiple observations from the same participant/condition combination. The random intercept says that participants have variable across-the-board differences, or idiosyncrasies, in their offset to the response variable (before trial-level noise is added), but these inter-participant idiosyncrasies do not vary with experimental condition. To make this concrete: this dataset contains 250 observations per participant. For simplicity, let us say that roughly half of them occur in either a high-attractiveness+high-testosterone or low-attractiveness+low-testosterone condition — we could call these the interaction+, or I+ conditions — and the other half of them occur in either a high-attractiveness+low-testosterone or low-attractiveness+high-testosterone condition, which we can call the interaction-, or I-, conditions. Let us say that participant P has a strong tendency for high preference ratings in the I+ condition and low preference ratings in the I- condition. The random-intercept model has no ability to attribute this tendency to an idiosyncrasy of P, because any participant-specific adjustments to the predicted mean response are insensitive to experimental condition. Thus for the random-intercepts model, participant P’s data provides effectively 250 observations worth of “evidence” for a fixed effect of the attractiveness-testosterone interaction. The random-slopes model has the ability to attribute this interaction-sensitive pattern to an idiosyncratic property of participant P, so that these 250 observations boil down to *one* observation (weighted in its importance by the clarity of the within-P interaction effect) worth of evidence for whether there is an overall population trend (a fixed effect) for the interaction effect.

2) I found convergence failure with the authors’ data and models when I tried to add more random slopes; thus the removal of the random slopes to achieve model convergence (with R’s lme4) is justified.

Answer: with respect to the specific dataset and models: add the term

(0 + test.c:att.c | code)

to line 49 of the authors’ code, so that it reads

m1<-lmer(pref ~ est.c * att.c + prog.c * att.c + test.c * att.c + etop.c * att.c + (1 | code/session) + (0 + test.c:att.c | code) + (1 | face) + (1 | code:face), data=makeup, REML=FALSE)

and in lines 39-43, change scale=F to scale=T so that the predictor variables are on a common scale. The resulting model will converge just fine. For the theoretically key predictor (the attractiveness-testosterone interaction, or test.c:att.c), the by-participants part of the maximal random-effects structure for participants crucially requires a random slope, because the interaction term varies within participants.

More generally, I would argue that it is usually a mistake to remove the random slope for a theoretically key predictor from your model simply on the basis that your model has failed to converge overall, if that random slope is included in the maximal model justified by the design. The reason is simple: there are many reasons that a model might fail to converge, and the random slope for your theoretically key predictor is not likely to be an inevitable culprit.

I would not actually recommend the plan of action outlined in part (1) of the blog post you've linked to, but note that even there the author specifically goes out of his way to say that following that plan of action

"does not mean that you can go around and say that higher random slope terms don’t matter and that your results would hold if you included those."

3) how confident are you in believing that not adding the random slope effect should be deemed as a good reason for a RETRACTION[?]

Answer: I made no comment regarding whether the discovery regarding the weakness of statistical evidence for the fixed-effects interaction justifies a retraction of the article. I was simply trying to help clarify the nature of the flaw in the statistical analysis.

Whether to retract the paper given this discovery regarding the statistics is a more complex question. That being said, it seems to me that the data do not give firm evidence for the central stated point of the paper, that "women’s preference for attractive makeup increases when their salivary testosterone levels are high". However, there are are other conclusions that the data seem to provide extremely solid support for (though I should offer the caveat that I haven't explored the data anywhere near exhaustively, so maybe there are confounds that I'm not appreciating). For example, as I stated before, the data offer massive statistical evidence that participants' preference for attractive makeup fluctuates as their testosterone level fluctuates — but that the nature of this fluctuation varies from person to person. Thus, while the data don't firmly support the central stated conclusion of the paper, they may turn out to firmly support other interesting scientific conclusions. Does this justify a retraction? I don't know, but I'm inclined to say "maybe yes" — the data could of course be republished in a different paper with a different central stated conclusion. What do YOU think should be done in this case?

4) There are many other papers, e.g. in fMRI analyses, with sloppier statistics, and nobody's asking for retractions.

Answer: I'm not sure you're right about this. Read, for example, the abstract in this paper:

http://pps.sagepub.com/content/4/3/291.short

It complains that the authors of the commented-on paper are asking for retractions!

1) Thank you, Roger, you have been very clear. I now completely understand your explanation about the impact of the random slopes in the dependance among the observations.

But I still have a couple of considerations on this. Even though there is little doubt that random slopes models tend to be more conservative, in some simulations I ran on the Fisher et al data, it does not seem that having random-intercept only yields to an inflation of the type I error (if not just minimally: .054 significant results). In a similar simulation on data with categorical predictors and repeated measures I found that a RM ANOVA and a random-intercept only model yield to very similar results (.046 of significative results when H0 was true). So, if it is true that in most cases the random slopes model is more conservative, the intercept-only does not seem to perform so bad, at least in these very basic simulations of mine. But of course you had explored this issue much more and probably my simulation was too simplistic and did not really cover all the possible data patterns.

2) Speaking of the actual Fisher et al. analyses, when adding “(0 + test.c:att.c | code)”. I get a convergence failure: “Model is nearly unidentifiable: very large eigenvalue”. That random effect is also the one that shows the least standard deviation (0.008) as compared to the others, so it seems that it can removed according to Jaeger blog.

Furthermore, I assume that if you add a random slope with an interaction you are supposed to add the main effects as well, right? But in that case I guess that the convergence failure would be even more problematic.

However, even when following your suggestion, the t value for the fixed effect is 1.51. Not significant but still there. On top of that, the estradiol:attractiveness interaction is still significant and this effect is correlated with the testosterone:att, so they have probably a similar (but reversed) functional meaning.

There is another issue on the Fisher analysis that I wanted to address, that is how they came to the most complex model. Thus, I tried to use the step function from lmerTest to select the optimal model and then to add random effect and see with which one I could find the convergence and it is the following: pref ~ est.c + att.c + test.c + (1 | code/session) + (1 | face) + (1 | code:face) + (0 + att.c | code) + est.c:att.c + att.c:test.c. In this model the original effect is 1.76, which has a p value of .08. Unless we are frequentist Neyman-Pearson hooligans, to me a p=.08 means that their conclusions still hold, especially considering that the model includes random effects. And this is especially true considering the other negatively correlated effect. Probably with a better model selection (and possibly data reduction) they could come to a more compelling result.

Concerning you consideration on convergence, it is true that Jaeger suggests that “does not mean that you can go around and say that higher random slope terms don’t matter.” But isn’t it true about ANY variable in any model? There is not anything like the best or true model, but only good ones given the data. If we had more data we could probably have better models and including random slopes would not be a problem. Having more date would probably have allowed us to include other fixed and random variables (why not a trial variable, for instance). in some sense our models are always (and always will) miss-specified.

3) Under the considerations emerged when discussing points 1) and 2) I would definitively recommend TO NOT retract the paper.

Having looked at the data their conclusions look still supported, unless we fall in a deluded vision of the world where p> .05 means that an effect does exist at all. But eve if the results with the random slope turned out to be worse than that, there is another more profound reason why I believe that they should not have retracted the paper: they did not conceal any information and an informed reader could judge by him/herself if the data are strong enough.

And this relates to the point 4): when I see an fMRI paper with brain-behavior taken from non independent ROIs, or with cluster-level inference justified by the only fact of having 10 voxels (without bothering about using the random-field-theory FWE correction approach provided by SPM BY DEFAULT), I simply don’t buy it, because I know that these results are very likely to be spurious (considering the huge intrinsic and extrinsic spatial smoothness of fmri data). Nevertheless, I don’t feel the urge to ask for retraction as far as I am provided with all the info to judge that paper and decide to cite it or not, or to be inspired by that paper to run my own experiment.

Here the authors provided all the info, even the raw data were made available. To me this retraction is completely unjustified and unfair, given that out there there are a lot of papers whose credibility is undermined by more serious QRPs.

Concerning “There are many other papers, e.g. in fMRI analyses, with sloppier statistics, and nobody’s asking for retractions.”

I think it is true insofar as it concerns the correct analysis of clustered data. Aarts et al (disclaimer: the et al includes cvdolan…) investigated this in

http://www.nature.com/neuro/journal/v17/n4/full/nn.3648.html

and found that clustered (nested) data are common, but the required multilevel modeling is often not used.

“To assess the prevalence of nested data and the ensuing problem of inflated type I error rate in neuroscience, we scrutinized all molecular, cellular and developmental neuroscience research articles published in five renowned journals (Science, Nature, Cell, Nature Neuroscience and every month’s first issue of Neuron) in 2012 and the first six months of 2013. Unfortunately, precise evaluation of the prevalence of nesting in the literature is hampered by incomplete reporting: not all studies report whether multiple measurements were taken from each research object and, if so, how many. Still, at least 53% of the 314 examined articles clearly concerned nested data, of which 44% specifically reported the number of observations per cluster with a minimum of five observations per cluster (that is, for robust multilevel analysis a minimum of five observations per cluster is required11, 12). The median number of observations per cluster, as reported in literature, was 13 (Fig. 1a), yet conventional analysis methods were used in all of these reports.”

(This is a reply to Marco’s post from February 9 at 4:39pm.)

Hi Marco,

Thanks for your additional comments. I’m going to respond only to the ones about this specific analysis and not to the question of whether a retraction is warranted, which as I said before is a more complex issue that involves some more subjective aspects of scientific judgment, and I think we can be free to disagree on.

Regarding your simulations in (1), I can’t comment without more information about the specific modeling details in your simulations.

Regarding your (2), several remarks:

I believe this is a warning, not a convergence failure, right? You may have also noticed that the warning suggested you rescale your variables. This is why I said to change scale=F to scale=T in lines 39-43. Once you do that, there is no warning — or at least not with my version of R and lme4.

You can’t evaluate the standard deviation of the random slope without also taking into account the scale of the predictor variable (in this case the scales of both test.c and att.c, since the predictor in question is the interaction between these two). And at any rate, the Jaeger blog post was not in the first place recommending evaluating the relative importance of random slopes according to their standard deviations, but rather through stepwise comparisons of model likelihoods.

Beyond these corrections, let me emphasize that the relative standard deviations of the different random effects in the model are a red herring as regards the issue of controlling anti-conservativity, since the by-factor-F random slope for predictor X plays a special role — one not necessarily played by other random effects — in assessing the strength of evidence for the fixed effect of X when X varies within F.

Actually, this is a misconception. If all the constituent predictors of the interaction vary strictly within-grouping, it is the random slope for the interaction and the interaction alone that is necessary to control Type I error. My co-author Dale Barr has a nice, short article demonstrating this point:

http://journal.frontiersin.org/article/10.3389/fpsyg.2013.00328/full

Hi Roger,

Thank you again for your further clarification. It turned out as a splendid tutorial on LMEMs thanks to your expertise on the matter.

You are right, Roger, my observation on the standard deviation was pointless without scaling and, no, it is not something suggested by Jaeger in his blog, you are right. I must have read it somewhere else (I am pretty sure) but, of course, this method requires the standardization of the variables otherwise the standard deviations refer to completely different scales.

And, yes, standardizing the variables led the model converge nicely and the original interaction has not a t=1.61.And I still find hard to claim the effect of interest of the study “disappeared”, as it still has a t=1.61.

All my other consideration on retraction remain there but I acknowledge that there might be different definitions of what is worth retracting. But a definition such as the one underlying this specific decision would imply the retraction of a huge amount of published articles.

PS: I can send you my simulation code (which is based on the Gelman and Hill Chapter on power estimation) by email if it is OK with you: it would be terrific to have your feedback on that (if you have time, of course).

Marco, given that we’re already mid-conversation and have thus established some common ground, I’d be glad to comment on your simulation code if you send it to me by email — go right ahead!

More generally, I’d encourage you and other readers to take advantage of the R-sig-ME mailing list to ask future questions about mixed-effects models. This forum covers R’s lme4 and other packages as well such as MCMCglmm, and also is a great forum for discussion of mixed-effects models more generally. In that forum, the benefits of the discussion become available to the general public, and you will generally get the eyes of many more experts onto your questions and code (potentially including experts who disagree with me!).

Thank you for your extremely clear (to me) explanation!

Hi Roger,

could you provide me the model and code where you got these numbers (from your comment):

random slope for the attractiveness-testosterone interaction estimates its “fixed effect” (the overall population trend) to have size & direction +0.001809, and its “random slope” to have standard deviation 0.008547

I would like to replicate them myself. I too found your comment extremely useful and would like to learn more. Thank you!

The accuracy of standard errors depends on the statistical model being correctly specified. But correct specification is usually unattainable as it requires a “true model”, which is hard to come by in psychological research. Presumably most statistical tests are incorrect to some degree. So that poses a dilemma. The randomness of the slopes in this study must have been considerable to change the outcome of the analyses (assuming these concerned fixed effects).

BTW: there must be thousands of results in psychology based on fixed effects ANOVAs, in which random effects were not included, even though random effects are often plausible (e.g., why would a given experimental manipulation ever have exactly the same effect on all individual undergoing the manipulation?). So if the failure to include random effects is a basis for retraction…..

Dear Roger,

first of all I want to thank you, Marco and all the others, for this very interesting conversation which helped me a lot in understanding mixed linear model better.

Regarding the need to model the random slope of the interaction I would have an additional question about individual differences.

Let’s say that I am measuring individual differences in my sample which in the case of the testosterone study might be for example narcissism (it just came random to my mind), and let’s say that I expect nacissm to expain the variance of my double interaction (in other words, let’s say that my hypotesis is that there is a triple interaction narcissm:attract:testost)

In this case I actually expect that my participants will behave differently according to the I+, I- conditions, and I want to model this variance using a continuos coviariate (narcissm).

Is in this case still appropriate to put the random slope of the double interaction in my model? I guess that this will depend on how much my covariate varies accross subjects, so let’s hypotesize that there are no participants with the same narcissm score, should I still put the random slope of the double interaction in this case?

Hi Maria,

Good question. The first point I would make is that adding an interaction with narcissism (more generally, with a third covariate, which we’ll call N) to the attraction:testosterone interaction (call this A:I) makes the interpretation of A:I dependent on the distribution of N. In particular, the meaning of A:I becomes the value that A:I takes on when N=0. So additive transformations of N will change A:I. If you still want to interpret the A:I coefficient (whether it’s significantly non-zero, which direction it goes in, and so forth), you want to make sure to represent N appropriate. This might mean centering N on the population mean, or on the population median, or on something else, depending on the details of the matter.

Assuming you’ve solved the problem of keeping A:I interpretable in the presence of the N:A:I interaction: unless you have very strong reason to believe that there will be no additional variability in the A:I interaction across participants, you still need to include the by-participants random slope for A:I. And for a psychological study it is unlikely that there will be no other variability for the A:I interaction, if for no other reason than that your measurement of N surely has some degree of noise.

Best & I hope this helps!

Roger