As we reported earlier this spring, the UK journal *Anaesthesia* published a remarkable statistical analysis of the work of Yoshitaka Fujii, the Japanese anesthesiologist who has been accused of fabricating his results for years — and who, we’re led to believe, may soon wind up with the record for retractions, at a number north of 190.

Fujii has responded to the journal with an equally startling (for different reasons, of course) rebuttal. We received permission from Steve Yentis, *Anaesthesia*‘s editor, to reprint the letter in its entirely. We present it here, and strongly recommend that readers take a look at the journal’s website to read the piece that prompted Fujii’s response:

I seriously read the Special Article by Dr Carlisle [1]. As is well known, Dr. Carlisle is interested in the area of peri-operative medicine [2]. Similarly, I am interested in this area and have made efforts to improve the postoperative outcomes of surgical patients. Additionally, we have provided information on diaphragm muscle dysfunction and its improvement in animal studies. However, this article by Carlisle can obviously be very damaging to me and I want to answer it seriously, but I am not a statistician. I can only offer a few elements of rebuttal at this point.

Postoperative nausea and vomiting (PONV) remains a common complication for surgical patients. In addition to patients’ discomfort, the physical act of vomiting may increase the risk of aspiration, wound dehiscence, and delayed recovery and discharge times [3]. For the management of PONV in high-risk patients, we have evaluated the efficacy and safety of antiemetics, including serotonin receptor antagonists, droperidol, metoclopramide and others, as first reported by us in 1994 [4]. Factors affecting PONV include patients’ characteristics, surgical procedure, anaesthetic technique and postoperative care [3]. Patient-related factors associated with increased PONV include age, female sex, obesity, a history of motion sickness and/or previous PONV, and menstruation. Increasing age during adulthood is associated with a decreased incidence of PONV. Considering these factors, most reports by us have excluded patients aged over 60 years, those who were obese, those with a history of motion sickness and/or previous PONV, and those who were menstruating. Being different from European and American nations, most Japanese people are middle-sized. Consequently, patients’ characteristics would be comparable in our series of clinical investigations. In addition, middle-aged Japanese women suffer from specific diseases, such as uterine myoma, breast cancer and goitre. Difference in diet, level of stress, etc can certainly produce a bizarre distribution of data specific to Japanese people. We cannot select the patients of our studies as broadly as we would want to.

As described in Kranke et al.’s letter and my response [5], granisetron, classified as a serotonin receptor antagonist, lacks the sedative, dysphoric and extrapyramidal symptoms associated with non-serotonin receptor antagonists. It is known that mild headache is one of the adverse effects in patients receiving granisetron. As mentioned in our published articles, trained nurses asked the patients about their conditions postoperatively. According to these results, in our manuscripts, its incidence was verified as approximately 10%. The researchers asked the patients if they experienced headache, dizziness and drowsiness, with only two possible answers (yes/no). This assessment might have caused the identical results regarding the incidence of postoperative adverse events. When analysing the degree of headache in detail, different results may have been obtained.

The diaphragm is the most important muscle in the respiratory pump. Since publishing our first laboratory report [6], we have studied the effects of several drugs, such as phospodiesterase-3 inhibitors, calcium channel blockades, benzodiazepines, and others, on diaphragmatic contractility in animals. All measurements (including haemodynamics, blood gas tensions, trans-diaphragmatic pressure and integrated activity of the diaphragm) and analyses of data obtained from the experiments were performed by myself and colleagues (co-authors), and this can be proved by them.

I understand that the tests by Dr. Carlisle are designed to uncover statistical anomalies based on very few assumptions about the data. I am not qualified to counter specific allegations concerning the ‘central limit theorem’ and its applicability in our case. As I said, our data sample is very special, but I do not have the skills to examine in detail if it has an impact on Carlisle’s analyses.

Finally, since the critical report against me by Kranke et al. was published in 2000, I have greatly suffered. Nevertheless, I have continued my clinical and laboratory studies with great care. In addition, there has been confusion concerning the ethical procedures at Ushiku Aiwa General Hospital where I did clinical research. This hospital did not have a formal institutional ethics committee, and therefore I sought and obtained the approval of the Vice-Chairman. Later, while at Toho University School of Medicine, I was unfairly blamed for Ushiku’s informal procedures. As a result of a lack of ethical approval, I received the advice of the university authorities and left Toho University.

The only thing I can say is that we performed the tests over years with full honesty and integrity. Additionally, I did not write these articles alone, and some of data were collected by others as well.

Now, we’ll freely admit that we aren’t stats gurus either. But a few things jump out at us about the letter.

The first is that Fujii seems to be engaging in a bit of misdirection here. At the heart of his defense is the argument that his study populations might be markedly different from those in other countries, to the degree that they could “produce a bizarre distribution of data.”

But, as Yentis and Carlisle point out in their own rebuttal, that’s irrelevant.

We thank Dr Fujii for his letter which, unfortunately, does not address the fundamental basis of the analysis of his work [1]. As has been explained [2], the distribution of means sampled from any population of continuous measurements, no matter how bizarre the original distribution of measurements, is always normal/Gaussian (see Fig. 4, reference [2]). Furthermore, the alleles that contribute to individual characteristics behave according to fundamental laws of nature and thus apply to all populations – including the Japanese – however distinct they may be [2, 3].

The statistical principles underlying the analysis [1] are literally universal. Apart from genetics, they apply to the behaviour of tiny particles (e.g. mass-velocity of atoms) and galaxies (e.g. Doppler shifts), and to analyses of the extremes of time (e.g. the speed of light and the slowest radioactive decay). An exception to these mathematical principles would shake the basis of most of modern scientific knowledge and understanding.

Then there’s Fujii claim that he “greatly suffered” as a result of the 2000 letter by Kranke et al, which was published in *Anesthesia & Analgesia*. As we have reported, that letter argued that Fujii’s data, and in particular the reported side effects in his trials, were too clean to be, well, clean: “Incredibly nice,” in the authors’ words.

Perhaps that complaint is true. But a Medline search of Fujii’s name for papers published between 2001 and 2012 turned up at least 37 articles on randomized trials alone, or an average of more than three a year over that period. We suppose that might be considered a great hardship, especially when one is used to cranking out five or 10 times that many papers a year, but it strikes us a somewhat more reasonable output.

Finally, there’s a lawyerly point to make. Fujii may well have “performed the tests over years with full honesty and integrity.” But that doesn’t necessarily mean he reported the results that way. Just saying.

Denial of all charges doesn’t add much in itself – both those guilty and innocent of misconduct would deny any fault.

The attempt to justify the carrying out of these studies without ethical approval of any form reduces credibility. There is also the beginnings of the defence that others are responsible for the problems which is also suggestive of misconduct.

However the big questions are:

1. Are these various institutions going to hold him to account and actually investigate these papers in a meaningful way? This would involve examination of the original ‘data’ and determination of whether the papers are fabricated or falsified. Somehow I find this prospect extremely unlikely. At the least, given the response above, the institutions are likely to be faced with obfuscations and diversions and quite possibly legal opposition; have the institutions done the right things so far under Japanese employment law? Easy to imagine that they haven’t or can’t – and this could make the allegation of misconduct almost impossible to investigate.

2. If the institutions don’t enforce a robust process, what will the anaesthetic journals do? Will they retract the papers as they have threatened to do?

By analogy, in the Gopal Kundu / NCCS case the paper in JBC obviously involved fraudulent pasting of gels, but the employer managed to find no evidence via an external committee. So technically Kundu was never found guilty of misconduct, despite the paper clearly involving misconduct. JBC correctly retracted the paper.

Here, however, we’re talking about 190 papers, not one – it will be fascinating to see how this goes.

“An exception to these mathematical principles would shake the basis of most of modern scientific knowledge and understanding.”

And this would be a bad thing? Why exactly? Did Newton’s work not “shake the basis of modern scientific knowledge”? What about Copernicus’, Einstein’s, Darwin’s? While I do not claim Fujii is another Einstein, progress in science is made by noticing anomalies, not facts confirming the established dogma.

Maybe Fujii should initiate a collaboration with Erik “theory of everything” Andriulis?

Chirality, you miss the point. Discovering something outrageously and radically new would of course be a good thing. But Occam’s razor can very usefully be applied here. Results are found that according to our current understanding of statistics are likely to occur by chance one time in 10^33. To put that into some sort of context, if you carried out that particular trial a million times per second, the universe could evolve from the Big Bang to its current age two billion times before you got the claimed result. So what’s more likely – that our fundamental understanding of supposedly universal principles of mathematics is wrong, or that the claimed result was fraudulent?

My comment had nothing to do with merits, or lack thereof, of Fujii’s work. Like him I am no statistician. I only disagreed with Yentis and Carlisle’s sentiment that science should conform to a dogma. Somehow, the statement did not sound right to me.

Yentis and Carlisle lay out two alternatives: 1) Everything we know about sampling, statistics, and natural variability is wrong, elucidated just by chance through an analysis of Fujii’s clinical trial data, or 2) Fujii did not truthfully report his results.

I wouldn’t read much more into it.

I do not think any such sentiment was present or intended. When you’ve got papers by one single researcher contradicting principles which are applied in all branches of science, it’s hardly being reactionary to deduce that the researcher’s results are suspect, and not the whole edifice of statistics.

There are currently 4 living US presidents. What are the odds that choosing 4 Americans, at random, would generate the false result that every American is a US president? 1 in 3×10^32. Is it really possible that Fujii’s results are 3 times less likely than the a random selection of 4 Americans selecting only presidents? I don’t think so. My guess is that Fujii’s paper presented means and standard deviations on non-Gaussian data, and p values calculated using those inaccurate sample characterizations yield humorous but highly exaggerated (and perhaps misleading) p values. A recent news article in science (paywall, abstract here: http://www.sciencemag.org/content/331/6015/272.summary) pointed out that p values in medicine rarely stand up to re-evaluation.

Fujii points out all measurements and anayses were performed by himself and colleages (coauthors),”and this can be proved by them”. I wonder has anyone actually asked them?

Pete, rather than share your guess at what Fujii’s data was like, it would be preferable to read Carlisle’s paper. As I understand it, he developed a novel method for analysing the demographic data in scientific papers. Essentially, he found that the distribution between the supposedly randomised groups of simple characteristics such as height, gender, weight, etc, etc, could not possibly have occurred by chance. (Well, he found they were incredibly unlikely to have occurred by chance). It is absolutely true that p values in medicine, and other areas of science, are grossly over interpreted. For example, a p value of 0.02 suggests very weak evidence indeed. But we shouldn’t throw out the baby with the bath water. When the p value becomes vanishingly small, it does indicate convincing evidence.

From the rebuttal by Yentis and Carlisle: “the distribution of means sampled from any population of continuous measurements, no matter how bizarre the original distribution of measurements, is always normal/Gaussian”. This is blatantly untrue. While the central limit theorem (no need to put that in quotation marks) does state that the sample mean from any population distribution (given certain regularity conditions) is normally distributed, this is true only as the sample size approaches infinity (i.e., asymptotically). In finite samples, this result may provide a useful approximation, but it is be no means “true” (in fact, I would argue that it is false with probability one, but I digress).

@A. Student: Study harder. The fact that we only achieve perfection as the sample size approaches infinity doesn’t at all rule out its use in cases like these. To see a nice simulation, try this demo from Rice University: http://opl.apa.org/contributions/Rice/rvls_sim/stat_sim/sampling_dist/index.html

@zbicyclist while what you say is mostly true, @A. Student is right though. Yentis & Carlisle’s quote is an oversimplification (or if you are more strict, wrong) . It holds only for independent (usually indentical distributed) random variables with finite means and variances (under certain regularity conditions) and that the sum (or mean) has to be taken over a sufficiently large number of these random variables to be approximately normal. This implies:

a) If the rv are not independent, the CLT does not necessarily hold

b) If the rv are not identical distributed, the CLT needs additional conditions to hold

c) If the distribution has nonfinite mean or variance (e.g. a Cauchy distribution), the CLT does not hold

d) For any finfite number the rv, th distribution will be approximate and only exactly normal if it is an infinte number

With CLT I mean the classic central limit theorem. There are other CLT that refer to weak convergence to an attractor distribution, which needs not be normal (e.g. to an alpha-stable distribution for power laws) .

I like your choice of names! However, Gosset might disagree with your conclusion.

The central limit theorem states that means will follow a normal distribution. It doesn’t matter if there are 2 samples, 100 samples, or millions of samples. For any number of samples, the deviation from normality, measured by Chi Squared, can be tested against the chi squared distribution for the number of samples. The result is the fraction of chi squared values equal or greater than the measured value. Obviously fewer samples results in greater deviation from normality. However, that deviation is captured by the chi squared distribution for the number of samples.

You say “blatently untrue.” One cannot know what is “true” or “false” in science, and Yentis and Carlisle do not claim otherwise. What they can and do state is that the deviation of the Fujii results from normality, tested using the chi squared distribution for the number of samples, is extraordinarily unlikely.