A paper in *Contraception* that purported to show serious flaws in an earlier study of abortion laws and maternal health has been retracted, after the authors of the original study found what were apparently significant flaws in the study doing the debunking.

That’s the short version of this story. The longer version involves years of back-and-forth, accusations of conflict of interest and poor research practice, and lawyers for at least two parties. Be warned: We have an unusual amount of information to quote from here that’s worth following.

As the editor of *Contraception*, Carolyn Westhoff, put it:

I got to make everybody angry.

**‘An accusatory or ad hominem tone’**

The story begins four years ago, when *BMJ Open* published “Abortion legislation, maternal healthcare, fertility, female literacy, sanitation, violence against women and maternal deaths: a natural experiment in 32 Mexican states,” by Elard Koch and his colleagues. That paper was accompanied by a press release from the MELISA Institute titled “Study Finds Better Maternal Health With Less Permissive Abortion Laws.”

Koch, founder and director of the MELISA Institute, which says it studies “determinants of maternal, embryonic and fetal health from an epidemiological and biological perspective,” told Retraction Watch that

the purpose of our study published in BMJ Open was not to argue for or against abortion laws, but to examine its association with the [maternal mortality rate] MMR in Mexican states after controlling for a number of factors known to impact maternal health at the population level. Our findings support the null hypothesis: after controlling for different combinations of these factors (we used a panel of 24 regression models), there was no evidence supporting an independent association (whatever positive or negative) between Mexican abortion legislation and MMR was found in our study.

The MELISA Institute does not seem to be well-known among researchers who study abortion policy; three such researchers contacted by Retraction Watch had not heard of the organization before reading about these two papers.

The saga began in earnest in August 2016, when *Contraception* published “Maintaining rigor in research: flaws in a recent study and a reanalysis of the relationship between state abortion laws and maternal mortality in Mexico,” by Blair Darney, of Oregon Health & Science University, and colleagues. The authors conclude:

We support a recent call to improve abortion data and research by adhering to three criteria: transparency, acknowledging the limitations of data and contextualizing results. Koch and colleagues fail at all three and do not help us understand the relationship between decriminalization of or access to safe abortion and women’s health.

That paper, Koch said,

aimed to refute and apparently discredit our BMJ Open paper based on the data from Mexico using an accusatory or ad hominem tone. We infer the specific purpose to discredit our research from the original research proposal funded by a $250,000 grant to Dr Darney by the Society of Family Planning. The SFP proposal states that they ‘have failed to respond to anti-abortion “junk science,” which influences policy in the region”. Sincerely, I don’t know how our study is ‘influencing’ policies in the region and of course I don’t consider our work as “junk science”.

When Koch and his colleagues reviewed the paper, they found an error.

According to the result presented in the Table 2 of the paper, there was a decline in maternal mortality associated with 31 states outside Mexico City, with restricted access to abortion (beta = -22.49) but the authors interpreted the result in opposite way, that is, associated to a decline of MMR in Mexico City with wide access to abortion.

When they tried to replicate the findings, Koch confirmed their original reading of the paper, and they submitted a manuscript describing their replication to Contraception in May 2018.

Our replication study confirmed that the negative regression coefficient (beta = -22.49) presented in the Contraception paper was basically correct, and we ruled out any typographical error for this result. In addition, we detected other serious methodological flaws and omissions in the paper. Overall, we found evidence supporting a potential case of research misconduct. The ad hominem accusations of a lack of transparency and false conclusions they stated for our study were unsupported and untrue.

**An insufficient correction**

On August 10, Westhoff told Koch by email that there was in fact an error in the Darney et al paper.

Dr. Darney has confirmed this, and my current plan is for her to submit an erratum to the journal. This erratum will acknowledge that this error was identified by a careful reader. I will let you know when we receive an erratum. If I find this to be satisfactory, then we will publish the erratum, but I do not see the need to publish an additional full-length paper on the topic.

Koch wrote back on August 13 to say that an erratum would not be enough:

The number and magnitude of the analytical errors detected in our replication study includes a serious misinterpretation of a pivotal result. This misinterpretation essentially invalidates the assumptions and conclusions from Darney et al‘s paper. We feel that this fundamental error cannot be corrected through an erratum, because unfortunately, it is neither minor nor simple.

In fact, it completely changes the conclusion of the paper. In addition, we remain concerned about the tone of Darney et al’s paper. The authors questioned the integrity of our BMJ Open publication, and their accusations are unsupported in light of their own study’s methodological flaws. An erratum cannot address this, or the damage to our research reputation, which has already occurred, since this article appears along with our other work during Medline searches. In our opinion the only acceptable and proportional solution in this case is that the authors proceed with a RETRACTION of their article. If they do not retract the paper then the editors of Contraception should do so. In addition, we believe this retraction should be accompanied at least with a reply or editorial comment pointing out the large issues leading to retraction.It is of some concern that the manuscript was led and submitted by a current member of the editorial board of Contraception, and the research itself was supported for a grant from the same institution funding the journal. However, it is an important and valuable first step to know that Dr Darney recognizes an error in their article (at this moment we don’t know the opinion of her colleagues). We need emphasize that there are a number of elements, which might potentially lead to a request for an investigation by the Office of Research Integrity and perhaps by COPE as well. In this context, we would sincerely prefer to work from the assumption that honest mistakes were made in writing this article, that these errors were not detected during an independent and fair peer review and publication process by Contraception, and that there were no other motives. An immediate retraction would be a clear signal of good faith and acknowledgement by the authors in the right direction, avoiding the burden of additional proofs.

On August 30, Andrea Bocelli, a publisher at Elsevier, wrote back to Koch:

Contraception is wholly owned by Elsevier Inc. Elsevier and our Editors are members of the Committee on Publication Ethics (COPE). Elsevier is not involved in the editorial decisions of the Journal and neither are our affiliated societies.

Dr. Darney is currently on the Board of Contraception, but she was not at the time her paper was accepted. Even so, it is important to note that work submitted to the Journal by Editors or Board members is assessed using the same criteria as that applied to all Contraception submissions.

I understand you feel an erratum is not sufficient. Dr. Darney did acknowledge an error in one of her tables and supplied a correction. She feels this correction does not change the overall conclusion of her paper. In order to review your allegations we are seeking a neutral review by an independent third party. Dr. Westhoff and I thank you for your patience while this investigation is being conducted.

Koch and his co-authors wrote back on September 8, detailing their concerns. On October 31, Westhoff emailed Koch to say that the journal had reversed its decision:

We have evaluated Dr. Darney’s 2017 paper, and decided that the paper requires retraction.

Sometime between then and December 4 — it is unclear when, as Elsevier did not add a date to the retraction notice — the paper was retracted:

This article has been retracted at the request of the Editor-in-Chief and Authors.

The authors recently discovered an error that affected the results in their article on the relationship between state-level maternal mortality in Mexico and state-level abortion legislation. In Table 2 the beta-coefficient for abortion legislation was calculated as -22.49 and erroneously interpreted as +22.49. This error affects several of the paper’s conclusions, and thus the editor and authors have jointly made the decision to retract the paper.

The authors would like to express their sincere regret at the errors in their initial report.

That retraction notice didn’t satisfy Koch and his colleagues. By this time, they had hired Paul Thaler, of Cohen Seglias, a law firm in Washington, DC. Thaler is perhaps best known for representing researchers who are accused of scientific misconduct. On December 4, Thaler wrote a letter to *Contraception* suggesting a different retraction notice:

This article has been retracted: please see Elsevier Policy on Article Withdrawal (https://www.elsevier.com/about/our-business/policies/article-withdrawal).

This article has been retracted at the request of the Editor-in-Chief and Authors.

The article purported to replicate, reanalyze and provide a critical review of a previous study on the relationship between state-level maternal mortality ratio (MMR per 100,000 live births) in Mexico and state-level abortion legislation [Koch E, Chireau M, Pliego F, Stanford J, Haddad S, Calhoun B, Aracena P, Bravo M, Gatica S, Thorp J. Abortion legislation, maternal healthcare, fertility, female literacy, sanitation, violence against women and maternal deaths: a natural experiment in 32 Mexican states. BMJ Open 2015;5(2):e006013]. The authors of the BMJ Open article conducted a thorough replication study of the now retracted paper, and they submitted their findings to the editor-in-chief of Contraception. An independent and neutral statistical review commissioned by the editors corroborated several methodological flaws, including a serious misinterpretation in the beta coefficient for abortion legislation in Table 2 of the now retracted article. This coefficient was calculated as -22.49 decline in the MMR associated to the 31 states with restricted access to abortion, but the authors erroneously interpreted this result as associated to Mexico City with wide access to abortion. This and other major errors affect several of the paper’s conclusions, and thus the editor and authors have jointly made the decision to retract the paper.

The authors would like to express their sincere regret at the errors in their initial report. The retraction of the article removes the basis the authors relied on for criticizing the BMJ Open study.

Elsevier rejected that suggestion. In a December 21 email, associate general counsel Jessica Alexander wrote:

In terms of the timeliness of the journal’s investigation process, as the wording of your proposed retraction notice acknowledges, the journal conducted an independent and neutral statistical review. This required the involvement of an independent external statistician as well as editorial review, which inevitably takes some time. The Editor took this issue seriously and the goal was to undertake a thorough review. This resulted in the decision to retract the article.

As set out in the Retraction Guidelines from the Committee on Publication Ethics, the purpose of retractions is not to punish authors but rather to correct the literature. The notice should mention the reasons and basis for retraction and who is retracting the article. This notice includes a description of the error that is the basis of the retraction and also includes a clear apology from the authors. It conforms to Elsevier’s standards and it alerts readers to the error. I am afraid that we do not therefore agree that the retraction notice requires amendment.

The editor of the journal has already declined your clients’ request to publish a response to the Darney article. Elsevier has a policy of editorial independence and we support this decision of the editor.

Koch tells Retraction Watch that he and his colleagues are considering filing complaints with the ORI and COPE.

Asked for comment, Darney referred us to the journal, and to OHSU’s communications office. An OHSU spokesperson said that Darney and the university declined comment.

**A first for journal editor**

While Contraception has had retractions before, Westhoff, who has been editor for five years, said this was the first she had handled. (There was a temporary removal in 2016, which Elsevier does not classify as a retraction.)

I struggled, never having written one before, with what should go into a retraction notice. I think the goal is to correct the record. And I think it did that.

If the authors are still angry, that’s a different level. I don’t think it’s my job to assuage their anger, which may not be possible.

Westhoff suspects that her decision not to publish the Koch group’s rebuttal to the now-retracted paper — a rebuttal that seemed to be moot, once the paper was retracted — was “infuriating.” But the claims of a conflict of interest don’t hold water, she said. Yes, she was once president of the society that funded Darney’s analysis, but the board of directors doesn’t have anything to do with such funding decisions, Westhoff said. And Darney only joined the editorial board of the journal after the now-retracted paper was published.

They’re trying to make some connections there that are at best really tenuous.

At the end of the day, said Westhoff,

The author made an error, and we made an error, in that the peer reviewers missed it. It won’t be the last time.

She added:

In retrospect, there was some inflammatory language in the paper that I might have suggested changing. There are a lot of papers and I do miss things. The peer reviewers are still unaware of all of this. I think there were errors here but they were honest ones.

It seems that hardly anyone noticed the papers, which Westhoff said she found a relief.

We did look at citations and social media and we saw essentially no evidence that either of the papers got any traction. I think the only person who cited [the Darney et al paper] was the author herself.

Like Retraction Watch? You can make a *tax-deductible contribution to support our growth**, follow us **on Twitter**, like us **on Facebook**, add us to your **RSS reader**, sign up **for an email every time there’s a new post (look for the “follow” button at the lower right part of your screen), or subscribe to our **daily digest**. If you find a retraction that’s **not in our database**, you can **let us know here**. For comments or feedback, email us at **team@retractionwatch.com.*

“material health”?

Fixed the typo, thanks.

In the first sentence, “maternal health”, not “material health”, right?

See previous comment and reply.

Despite the complainants clearly having no legal standing to dictate the language of a retraction notice here, I think this highlights the difficult position publishers are put in by these situations. One can easily imagine if the retraction notice read more like the “suggestion,” there’d be an attorney happy to take the other side’s money to write a (similarly toothless) nastygram explaining why that was unfair to the other authors. Journals are scared to death of getting sued — for the big houses, legal settlements may be the only thing that really threatens their gravy train in the short term — and given the zealousness of people to call in the attorneys, it almost seems understandable why they’d so often opt to say as little as possible.

Focus on statistically significant findings in this paper are misguided and irrelevant, as this study is based on all available data, not a random subsample that would change if the random subsample was independently reacquired.

The paper states

“Civil registration of vital statistics in Mexico follows international standards, has been regarded as virtually complete by the WHO, and has been included in List A—with good attribution of causes of death—along with 64 other countries.”

and that

“The United Mexican States (Mexico) is a federal republic comprising 32 federal territories (31 states and the Federal District, referred to as ‘states’ henceforth)”

with census data collected from all states over 10 years

“This study presents the results of a population-based natural experiment examining factors associated with maternal mortality in the 32 Mexican states during a period spanning 10 years (2002–2011).”

The whole idea of statistical confidence intervals and p-values applies to subsamples of census data. When we can only sample a few hundred or thousand sample points out of a much larger population, the issue becomes “how much would our results vary if we repeated our random draw and did the study again”?

This is not an issue with census data. All of the data is at hand, and there is no uncertainty. If these authors went back and collected the data from all 32 territories across those 10 years, they would get the same data. It is census data.

Thus model fits provide exact values. A model parameter value calculated from the census data is exact. If we went back and obtained the census data again, and refit the model, we would get the same coefficient value from the data. Thus p-values concerning parameter values in the paper such as displayed in Tables 2, 3, 4 and others, and confidence intervals as displayed in Tables 6, 7, 8 and other, have little if any meaning. The point of 95% confidence intervals is that on repeated draws of a sample from the whole population, 95% of those confidence intervals will contain the true value, if only we knew the true value should we have all the data. Well, here we do have all the data. If we obtained the data again, the calculated parameter values would be the same. They are not estimates, they are the real number from the entirety of the census data.

Statements concerning which variables are statistically significant and which are not are a misleading distraction here. To claim that “No statistically independent effect was observed for abortion legislation, constitutional amendment or other

covariates” is misleading, because this is census data. The measured effects are what they are. We can debate the social consequences of the size of the various calculated values, but their statistical significance is irrelevant, as the results are based on the entire universe of data.

One of the striking trends seen in the census data, see Figure 1 and supplementary tables 10 to 15, is that “The group of states with less permissive abortion legislation showed apparently stable trends for MMR, MMRAO and iAMR during the decade analysed. The group of states with more permissive abortion legislation displayed decreasing trends for MMR, MMRAO and iAMR, narrowing the gap between the two groups by 2011, but still exhibited statistically significant differences (eg, MMR of 40.9 vs 33.5 per 100 000 live births for more permissive vs less permissive states, Z=3.04, rate ratio=0.82, p=0.002).” (Ignore the p-value, the rate ratio and Z value are exact, calculated from all available data.)

The states with more permissive abortion legislation showed the most improvement in maternal mortality measures, though this is not discussed much elsewhere and not mentioned in the summaries of findings. Of course understanding changes in other associated covariates is important to understand as well, did availability of clean water or other positive health measures change more in the states with more permissive abortion legislation.

Rather than obsessing on p-values in an analysis of census data, the authors should have assessed the social and health consequences of the differences in calculated parameter values. That’s where their significance lies, because they are exact values, and given population census data, true calculations of number of lives saved, rates of change and so on are exactly knowable.

Steven McKinney, I think this is a basic misunderstanding. You can always ask if you would get the same results if you sampled the same quantities during a different period. And the answer is very clearly that no, you are not likely to get the same results, even assuming all other parameters used to characterize the system are the same. In that sense the data are not complete, they are really just samples, and statistics is completely justified.

The situation is analogous to the question of how many soldiers in the Preussian army dies evey year from being kicked by a horse. The number is (was) know exactly. It was nevertheless one of the earliest popular known applications of Poisson statistics.

Klavs Hansen, the misunderstanding is with the authors of the 32 Mexican States paper, and your characterization of the Prussian army example. While the number of deaths by horse kick from year to year may be well approximated by a Poisson distribution, once all counts of horse kick deaths for a given period are tallied, that is a known fixed quantity. You would not put a confidence interval around the total number of counted deaths from horse kicks or the mean number of kicks over the past 10 years with census data.

One can certainly use the characteristics of the Poisson distribution and the prior known quantities to estimate the likely range of expected deaths for future years, but having a census of all horse kick deaths for past years, why would one put a 95% confidence interval around a known quantity? So yes the authors could have used the 10 years of census data and a proposed model to predict future maternal mortality quantities, and placed uncertainty bands around such predictions, but for model fits within the 10 years of census data, the computed quantities are exact. Obtaining the census data again will yield the same quantities, there is no uncertainty involved with the census data and quantities computed therefrom.

From the study strengths and limitations summary: “In this study, relying on virtually complete, official vital statistics data, Mexican states with less permissive abortion legislation displayed lower maternal and abortion mortality ratios than states with more permissive legislation during a 10-year study period.” The study focuses on the 10 year window within which they have census data. Findings are therefore exact. One can debate the implications of the observed difference in mortality rates and counts, whether one or ten or a thousand more or less events has this or that social or health consequence, but the measured counts have no statistical uncertainty about them.

This is an issue addressed in the statistical field of Finite Population Sampling. In a finite population of size N, with a sample of size n less than or equal to N, the term (1-n/N) is generally referred to as the “finite population correction factor” as it appears repeatedly in equations such as the variance of a mean statistic.

If Yhat = (y1 + y2 + . . + yn)/n is the estimate of the population mean and sigmahat is the corresponding estimate of the standard deviation with n less than N then Var(Yhat) = (sigmahat x sigmahat)/n x (1 – n/N)

When we sample n = a few hundred or a few thousand people for an election poll or other study, and the population size N is millions, then n is much less than N and the finite population correction factor is so small that we can reasonably assume that the population size is infinite. But when n gets to be a sizeable percentage of N, and indeed when n is equal to N, the finite population correction factor becomes very important indeed.

Yes, Steven McKinney, I would put a confidence interval around a number of deaths by horsekick if I wanted to know the true average value. That’s what’s interesting, and that’s what you use statistics for.

If you don’t understand this, consider an experiment where you have an exact number of counts in a detector in all the 64 k channels. Happens every day for a lot of us. Is there no numerical, computable, statistical uncertainty about the conclusions you can draw from that data set because you know all channel counts exactly?

If you count the number of moons orbiting planet Earth, what confidence interval will you put around that?

If you count the number of chairs in your office, what confidence interval will you put around that?

I’m not sure what the counts coming from your detector represent. Perhaps you can provide some more detail. If you use this detector to measure something now, and then you measure it again in a few minutes/days, do you get exactly the same numbers?

If I have a detector, and channel 1 always gives me a value of 0, what confidence interval will I put around that?

Statistical uncertainty arises when repeated attempts at measurement of the same phenomenon yield different answers. If you turn on a light source and shine a light at the moon, and measure the time until you detect reflected light returning, you will get different values each time you do so. If you poll 1000 random people asking whom they will vote for, then poll another 1000 random people, you will get different vote tallies. When repeated measurements yield different answers, statistical confidence intervals and p-values become useful.

The point is that we can know some things with certainty.

The number of soldiers who died from horse kicks was tallied, the number of deaths over 20 years for the “G” corps (row 1 of his table) reads thusly

0 2 2 1 0 0 1 1 0 3 0 2 1 0 0 1 0 1 0 1

16 deaths over a 20 year period. If you look up the numbers, you will also see 16 deaths over the 20 year period. This is census data, it is known exactly for G corps. 16 deaths over 20 years yields an average of 0.80 deaths per year over those 20 years. There is nothing uncertain about this, it is just a known average obtained from census data.

A histogram of those numbers will be well approximated by a Poisson density, that’s a useful thing if you want to understand expected numbers of deaths outside the census year data. But there is nothing uncertain about the average annual number of deaths during the census years.

The number of soldiers who died in XV corps reads thusly

0 1 0 0 0 0 0 1 0 1 1 0 0 0 2 2 0 0 0 0

8 deaths over a 20 year period.

If I want to know the difference in horse kick fatalities between G corps and XV corps the difference is

G corps – XV corps = 16 – 8 = 8

There’s no confidence interval to put around that, and no p-value for any test that G corps deaths differ from XV corps deaths in this time period. This is census data of all deaths, and G corps suffered more deaths than XV corps in this 20 year window. We know this for certain, as we have all the data.

Similarly, when we tally the vote during an election, we do not put a confidence interval around the result. The count is a census count, we know the votes cast exactly. The candidate with the most votes wins outright. During the pre-election season, we can conduct polls of a few hundred or thousand people, and repeated such polls will yield different tallies, so we put confidence intervals around the measured differences. You hear poll results announced as “within three percentage points nineteen times out of twenty”, but nobody says that after the votes are tallied on election day.

There is a reason that a branch of statistics known as Finite Population Sampling exists, as I described previously. When our sample size approaches the size of the entire population, statistical uncertainty begins to wane at a rate expressed generally by the (1 – n/N) factor I discussed before. This is indeed the case for the assessment performed by Koch et al. since they have census data for Mexico for the ten year period of their study. n = N so the finite population correction factor is zero. If they or any other group collect the census data again and refit a model, they will get the same numbers. They will all get the same coefficient values in their fitted regression equations. They might argue that the differences in maternal mortality rates due to the different Mexican states’ abortion legislation are of no social relevance, but the differences do not have any statistical uncertainty.

Steven McKinney , your claim that “this study is based on all available data, not a random subsample that would change if the random subsample was independently reacquired.” is, I suspect, based on a misinterpretation of the point of the study. As I underatand it, the question behind the study was “how do certain policy decisions affect maternal health?”. As such, the population and time in question are a sampling of all of humanity and all of human history.