“The Chrysalis Effect: How Ugly Initial Results Metamorphosize Into Beautiful Articles”

jomThe headline of this post is the title of a fascinating new paper in the Journal of Management suggesting that if the road to publication is paved with good intentions, it may also be paved with bad scientific practice.

Ernest Hugh O’Boyle and colleagues tracked 142 management and applied psychology PhD theses to publication, and looked for various questionable research practices — they abbreviate those “QRPs” — such as deleting or adding data after hypothesis tests, selectively adding or deleting variables, and adding or deleting hypotheses themselves.

Their findings?

Our primary finding is that from dissertation to journal article, the ratio of supported to unsupported hypotheses more than doubled (0.82 to 1.00 versus 1.94 to 1.00). The rise in predictive accuracy resulted from the dropping of statistically nonsignificant hypotheses, the addition of statistically significant hypotheses, the reversing of predicted direction of hypotheses, and alterations to data. We conclude with recommendations to help mitigate the problem of an unrepresentative literature that we label the “Chrysalis Effect.”

Specifically, they found:

Of the 1,978 hypotheses contained in the dissertations (i.e., dropped and common hypotheses), 889 (44.9%) were statistically significant.1 That less than half of the hypotheses contained in a dissertation are supported with statistical significance is troubling, but more troubling is that 645 of the 978 (65.9%) hypotheses published in the journal articles (i.e., added and common hypotheses) were statistically significant. This is a 21.0% inflation of statistically significant results and corresponds to more than a doubling of the ratio of supported to unsupported hypotheses from 0.82:1 in the dissertations to 1.94:1 in the journal articles. To our knowledge, this is the first direct documentation of the prevalence, severity, and effect of QRPs in management research, and on the basis of these findings, we conclude that the published literature, at least as it relates to those early research efforts by junior faculty, is overstating its predictive accuracy by a substantial margin. Thus, we find evidence to support a Chrysalis Effect in management and applied psychology.

The authors are not naive:

Despite increased awareness of what QRPs are and the damage they cause (e.g., Bedeian et al., 2010; Martinson, Anderson, & De Vries, 2005), QRPs persist. We contend that this is because as a field, we reward QRPs, and we are embedded within a culture that reduces the likelihood of their detection. As such, QRP reductions are unlikely to occur by placing the onus of best practices on the individual researcher. Asking researchers to forego immediate, extrinsic rewards in order to serve the higher ideals of fair play and professional ethics is a noble request, but one that is unlikely to manifest into real change.

They offer a few solutions, including an “honor code” in which all of the authors — not just the corresponding author — “affirm that they did not engage in any of the specific QRPs discussed here” when they submit a manuscript.

The authors also sound a warning:

If we cannot self-police by establishing and enforcing best practices, then those external stakeholders that provide funding (e.g., state governments, federal grant agencies, private individuals and organizations) may reduce or withdraw their support.

You can hear O’Boyle discuss the findings in this podcast.

Like Retraction Watch? Consider supporting our growth. You can also follow us on Twitter, like us on Facebook, add us to your RSS reader, and sign up on our homepage for an email every time there’s a new post.

20 thoughts on ““The Chrysalis Effect: How Ugly Initial Results Metamorphosize Into Beautiful Articles””

    1. Hear hear! In randomised trials authors still alter endpoints etc., and journals still allow late registration,but at least readers can get a handle on some of the QRPs.

    2. What kind of vigilante justice can we enact for people who sign the pledge and then violate it? Bayesian interment camps…

      I still wonder if the general reticence for full preregistration owes to the cognitive dissonance of recognizing the practices that got you where you are might be flawed, or simply self-interest.

    3. science is not medicine. every study cannot be approached with the rigor that a clinical trial should be approached with. would we make any process if every single hypothesis that was kicked around were fully and exhaustively vetted with sufficient power to be somewhat conclusive?

  1. “The rise in predictive accuracy resulted from the dropping of statistically nonsignificant hypotheses, the addition of statistically significant hypotheses, the reversing of predicted direction of hypotheses, and alterations to data.”

    Of these so-called “QRPs”, only alterations to the data is research misconduct.

    I don’t see any problem with rewriting your hypothesis if the data show something other than what you expected. In other words papers should be written with a sense of logical order, not chronological order. If your data supports hypothesis X, there is no need to specify that you were actually trying to prove hypothesis Y.

      1. Why? The data remains the same. The statistics remain the same. The underlying physical reality behind the data remains the same. Why not present your work in the most clear and direct manner possible? Why go through possibly convoluted explanations of what you thought the result should be but wasn’t?

        1. The statistics do not remain the same. If each time you collect your data you are choosing only the post-hoc hypotheses that you can reject at your arbitrary level of alpha you are underestimating your type I error. If you perform 20 comparisons in your search for significance at alpha=0.05 and only report the hypothesis that yielded the publishable p-value then was the null hypothesis you rejected truly one that you could reject?

        2. There are several problems with framing post-hoc hypotheses as a-priori hypotheses. For example, by “inventing” or “modifying” theories based on significant results in the data you “fit” your theory to the data. When you do this, it is not clear if you are “overfitting”, that is, if you are just explaining significant noise in your data (i.e., a false positive in terms of null-hypothesis significance testing). I am not against doing post-hoc tests and I think this a valuable tool for theory-building but theory that has been modified or informed by data need to be tested with an independent (i.e., new) dataset.

          Let me be a bit polemic here for making the point more obvious. Do you feel the same confidence in a theory given the follwoing two scenarios. Scenario 1: I tell you that my theory was specified before I ran an experiment and I excpected a distinct effect that I actually observed in the data. Scenario 2: I tell you I had no theory at all when I ran the study and invented some theory after I saw a significant relation for some variable or condition in the experiment? What if I pretend that the theory in the second scenario was an a-priori theory? Would you feel being tricked?

          See also the following reference for a more detailed discussion:

          Kerr, N. L. (1998). HARKing: Hypothesizing after the results are known. Personality and Social Psychology Review, 2(3), 196-217.

          1. The way I think of it is: how much of an achievement is it, on the part of the data, to satisfy a post-hoc hypothesis? Not much of one – any data can satisfy a hypothesis framed specifically to match those data.

            But it is an achievement for some data to satisfy a hypothesis ‘which they had no part in making’.

  2. I’m just amazed at the comments in that article. “Of the 1,978 hypotheses contained in the dissertations (i.e., dropped and common hypotheses), 889 (44.9%) were statistically significant.1 That less than half of the hypotheses contained in a dissertation are supported with statistical significance is troubling,” Troubling? That’s not troubling in the slightest. The point of the Ph.D. is to show research competence. That is, process. Results are less important. I am totally NOT troubled. I would be FAR more troubled if more were significant.

    1. I was surprised that so many PhD hypotheses were statistically significant. Then again, PhD theses can have a Chrysalis Effect too.

    2. Agreed – with that comment the authors are contributing to the very problem they elsewhere decry.

      It is never a bad thing if the data show that a hypothesis turns out to be wrong – so long as the data are accurate. All kinds of hypotheses turn out to be wrong, and that doesn’t mean they’re ‘bad’ hypotheses, still less that they shouldn’t have been tested in the first place.

      To imply as the authors do, that a hypothesis should be confirmed is to imply that results should be positive and that’s exactly why ‘The Chrysalis Effect’ happens!

  3. Given the high number of hypotheses in these 142 theses–13.9 each on average–it is not surprising that only 45% of the total 1978 or 889 were supported with statistically significant results. This is still an average of 6.2 supported hypotheses per thesis, which looks good if you didn’t know that another 8 each on average were not supported. [and do both management and psychology avg 14 hypotheses per thesis, or does one avg even more?]

    Most surprising to me are that
    only 645 of the 889 supported hypotheses–about 72%–were published,
    along with 333 of the unsupported, or about 31% of them, which is more than I expected.

    I’m curious to know how many of the 142 authors tried to get all their hypotheses published
    but had some rejected in review,
    and how many only submitted those hypotheses they most wanted to get published?

    and were the management or psychology PhDs more likely on average to publish their unsupported hypotheses?

  4. A thesis that tests more than one hypothesis? I guess it depends on the discipline. The authors seem to have got themselves in a twist. I would be alarmed at the number of hypotheses that passed the significance test, seems rather high. That the hypotheses that did not pass such a test were dropped from publications relates as much to the drive for “positive” results and “glam” results as to anything else.
    The more pertinent analysis was one done in France (I think), where an analysis of the distribution of P values in a discipline was found to be bi modal, with peaks at 0.05 and 0.01 (or something similar), clearly showing evidence of a selection pressure.

    The solution is to have open data.

  5. I don’t get this at all. It is rarely possible for a journal article to include all the results from a dissertation. Therefore selection is necessary; and if selection is necessary, it makes sense to select the results that seem most important. Failure to reach significance is rarely an important result. Often it only means that the data did not have sufficient power. Failure to reach significance is not evidence, it is absence of evidence. Absence of evidence is not evidence of absence.

Leave a Reply

Your email address will not be published. Required fields are marked *