Two mega-corrections for Anil Potti in the Journal of Clinical Oncology

Anil Potti can add two corrections to his less-and-less impressive publication record. The mega-corrections — part of what we are close to being ready to call a trend in errata notices — in the Journal of Clinical Oncology (JCO) are, however, quite impressive, each with at least a dozen points.

One of the corrections, for a paper cited 15 times, according to Thomson Scientific’s Web of Knowledge, basically removes all references to chemotherapy sensitivity:

The September 1, 2009 article by Anguiano et al, entitled, “Gene Expression Profiles of Tumor Biology Provide a Novel Approach to Prognosis and May Guide the Selection of Therapeutic Targets in Multiple Myeloma” (J Clin Oncol 27:4197-4203, 2009), contained material that requires correction, in light of the retraction of an article by Potti et al (Nat Med 12:1294-1300, 2006) that was referenced in this article.

1) The corresponding author was given as Anil Potti, MD. It should now be: Sascha A. Tuchman, MD, Division of Medical Oncology, Duke University Medical Center, Morris Building 25155, Box 3872, Durham, NC 27710; e-mail: [email protected].

2) In the Abstract, the last sentence of the Purpose section was given as: “We performed gene expression profiling (GEP) with microarray data to better dissect the molecular phenotypes, sensitivity to particular chemotherapeutic agents, and prognoses of these diseases.”

It should now be: “We performed gene expression profiling (GEP) with microarray data to better dissect the molecular phenotypes and prognoses of these diseases.”

3) In the Abstract, the last sentence of the Methods section should be omitted.

4) In the Abstract, the last sentence of the Results section was given as: “These clusters differentiated themselves based on predictions for prognosis and chemotherapy sensitivity (eg, in ISS stage I, one cluster was characterized by increased CIN, cyclophosphamide resistance, and a poor prognosis).”

It should now be: “These clusters differentiated themselves based on predictions for prognosis (eg, in ISS stage I, one cluster was characterized by increased CIN and a poor prognosis).”

5) In the Methods section, under Development of Signatures, the last paragraph should be omitted.

6) In the Methods section, the heading of the last subsection was given as: “Application of Oncogenic Pathway–, Tumor Biology–, and Chemotherapy-Sensitive Signatures.”

It should now be: “Application of Oncogenic Pathway– and Tumor Biology–Signatures.”

7) In the Results section, the last subsection, entitled “Conventional Cytotoxic Chemotherapeutic Sensitivity Patterns in ISS Risk Cohorts,” should be omitted.

8) In the Discussion section, the last two sentences of the fourth paragraph should be omitted.

9) In the Discussion section, the third through last sentences of the second-to-last paragraph should be omitted.

10) In the Discussion section, the first sentence of the last paragraph was given as: “We view the work described here as a platform for the future development of highly refined genomic prognostic models and chemotherapy-sensitivity predictors for all agents that have activity in treating MM, not only conventional, cytotoxic agents.”

It should now be: “We view the work described here as a platform for the future development of highly refined genomic prognostic models.”

11) In the References section, references 46 and 48-55 should be omitted.

12) In the Appendix, Figure A2 and Figure A5 should be omitted.

The online version has been corrected in departure from the print. JCO has obtained written assurance from the corresponding author that there are no concerns with respect to the remaining data and conclusions stated in this article. JCO shared the decision to publish this Author Correction Note with institutional representatives at Duke University, who are in agreement with the decision. The authors apologize for the errors.

The other, for study cited nine times, goes as far as to change the title of the paper:

The November 20, 2009 article by Rao et al, entitled, “Age-Specific Differences in Oncogenic Pathway Dysregulation and Anthracycline Sensitivity in Patients With Acute Myeloid Leukemia” (J Clin Oncol 27:5580-5586, 2009), contained material that requires correction, in light of the retraction of an article by Potti et al (Nat Med 12:1294-1300, 2006) that was referenced in this article.

1) The title should now read “Age-Specific Differences in Oncogenic Pathway Dysregulation in Patients With Acute Myeloid Leukemia.”

2) In the Abstract, Patients and Methods section, the last sentence was given as: “Gene expression analysis was conducted utilizing gene set enrichment analysis, and by applying previously defined and tested signature profiles reflecting dysregulation of oncogenic signaling pathways, altered tumor environment, and signatures of chemotherapy sensitivity.”

Whereas it should now be: “Gene expression analysis was conducted utilizing gene set enrichment analysis, and by applying previously defined and tested signature profiles reflecting dysregulation of oncogenic signaling pathways and altered tumor environment.”

3) In the Abstract, Results section, the third sentence should be omitted.

4) Also, in the Results section of the Abstract, the last two sentences were given as: “Hierarchical clustering revealed that younger AML patients in cluster 2 had clinically worse survival, with high RAS, Src, and TNF pathway activation and in turn were less sensitive to anthracycline compared with patients in cluster 1. However, among elderly patients with AML, those in cluster 1 also demonstrated high RAS, Src, and TNF pathway activation but this did not translate into differences in survival or anthracycline sensitivity.”

Whereas they should now be: “Hierarchical clustering revealed that younger patients with AML in cluster 2 had clinically worse survival, with high RAS, Src, and TNF pathway activation compared with patients in cluster 1. However, among elderly patients with AML, those in cluster 1 also demonstrated high RAS, Src, and TNF pathway activation but this did not translate into differences in survival.”

5) In the Abstract, Conclusion section, the first sentence was given as: “AML in the elderly represents a distinct biologic entity characterized by unique patterns of deregulated signaling pathway variations that contributes to poor survival and anthracycline resistance.”

Whereas it should now be: “AML in the elderly represents a distinct biologic entity characterized by unique patterns of deregulated signaling pathway variations that contributes to poor survival.”

6) In the Patients and Methods section, the heading of the third subsection was given as “Oncogenic Pathway and Chemotherapy Sensitivity Analyses,” whereas it should now read “Oncogenic Pathway Analyses.”

7) Also, in the same subsection, the first sentence of the first paragraph was given as: “Previously described signatures of oncogenic pathway dysregulation (eg, RAS, PI3K, Src, Beta-catenin, Myc, and E2F), cancer biology, and tumor microenvironment (eg, wound healing [WH] as a measure of angiogenesis, epigenetic stem-cell signature [EPI], and TNF), and chemosensitivity (adriamycin) were applied to clinically annotated microarray data using MatLab Software, version 7.0.4 (MathWorks, EI Segundo, CA).9-12

Whereas it should now be: “Previously described signatures of oncogenic pathway dysregulation (eg, RAS, PI3K, Src, Beta-catenin, Myc, and E2F), cancer biology, and tumor microenvironment (eg, wound healing [WH] as a measure of angiogenesis, epigenetic stem-cell signature [EPI], and TNF) were applied to clinically annotated microarray data using MatLab Software, version 7.0.4 (MathWorks, EI Segundo, CA).9-12

8) Also, in the same subsection, the second sentence of the first paragraph and the entire last paragraph should both be omitted.

9) In the Results section, the subsection “Chemotherapy Sensitivity Patterns in AML” should be omitted.

10) In the Discussion section, the fourth sentence of the second paragraph was given as: “Elderly patients with AML have an increased probability of RAS, Src, and TNF pathway activation when compared with their younger counterparts, and this may in part explain decreased sensitivity to anthracycline.”

Whereas it should now be: “Elderly patients with AML have an increased probability of RAS, Src, and TNF pathway activation when compared with their younger counterparts.”

11) In the Discussion section, the last sentence of the third paragraph should be omitted.

12) In the Discussion section, the last paragraph was given as: “Finally, although host-related factors and performance status also play an important role in the prognosis of AML, we hope that this study has been able to dissect the biology of AML as a function of age with regard to the underlying molecular events and insensitivity to anthracyclines.”

Whereas it should now be: “Finally, although host-related factors and performance status also play an important role in the prognosis of AML, we hope that this study has been able to dissect the biology of AML as a function of age with regard to the underlying molecular events.”

13) In the References section, references 12 and 13 should be omitted.

14) In the Appendix, Figure A4 should be omitted.

The online version has been corrected in departure from the print. JCO has obtained written assurance from the corresponding author that there are no concerns with respect to the remaining data and conclusions stated in this article. JCO shared the decision to publish this Author Correction Note with institutional representatives at Duke University, who are in agreement with the decision. The authors apologize for the errors.

We’re all for complete transparency, and the revised papers are clearly quite clear about what’s been changed, but are we the only ones wondering about the threshold for retracting this work? Potti has had nine retractions, the most recent of which was also in the JCO.

For more on the ongoing Potti story, watch this Sunday’s edition of 60 Minutes.

Hat tip: David Hardman

74 thoughts on “Two mega-corrections for Anil Potti in the Journal of Clinical Oncology”

  1. Shame on JCO! This seems more an effort to save face than to correct the record. There are 12 corrections to the Anguiano et al paper, and the entire notion of tumor genotyping to predict prognosis seems to have been excised. It is absurd to pretend that this paper does not deserve to be retracted.

    The present paper is fatally tainted. Better to retract, correct, and republish without Potti on the byline, so that people can have some confidence that corrections are done with. I would NEVER cite this paper.

    1. Unfortunately, it seems very difficult to boycott the JCO itself… After all, many good papers warrant to be published here and thus deserve to be cited. However, just imagine: a top journal receiving 0 citations over two years would be sanctioned with a shameful 0 IP! Brrrrr!!!
      Anyway, with or without boycott, I’m afraid the reputation of JCO is definitively damaged with the publication of such errata.

  2. With the extent and nature of corrections reported, the original paper is changed completely .. the title in one case, corresponding author in another one. Yet the journal did not consider it fit for retraction? May be the rest of the findings are accurate but then why not retract these and then have them resubmit the corrected ones. Let it go through the review process to assess the manuscript!!
    This seems very odd

  3. This is astonishing….I’m not sure what to make of it.

    The “face value” interpretion is that the data presented in the corrected paper is sound but that the interpretations should be modified to remove reference to unsound data on anthracycline sensitivity/resistance and adriamycin chemosensitivity, arising from a retracted paper.

    In an ideal world this seems quite a satisfactory outcome, particularly as the electronic versions of these papers have apparently been “corrected” to account for erroneous interpretations arising from the flawed and retracted paper. The scientific record is intact…

    So why does this leave a nasty taste in the mouth? I think it’s because this episode suggests that it’s OK to cut corners, fiddle the data a little, and publish bullcrap because one can go back into the literature and correct “this and that” later on. Additionally we’re reliant on Dr. Potti’s (or is it Dr. Tuchman?) assurance that that data was flawed…but this data is just fine, even ‘though experience should inform us that Dr. Potti’s assurances (or are they Dr. Tuchman’s?) may not be as reliable as we might like.

    In this circumstance we’d really like an independent assessment of what bits of Potti’s work are sound and which aren’t..

  4. Kudos to JCO for recognizing that while Anil Potti’s chemosensitivity signatures were irreproducible, work published in the manuscript by other authors is reproducible and of value. Kudos also to the team that is trying to correct the record– this is a complex case where many will suffer because of the alleged fraud of Anil Potti. It is important to distinguish Anil Potti’s allegedly fraudulent data from work of others that completely reproducible. The scientific community would not be served best by “throwing the baby out with the bath water”.

    1. It seems to me an open question as to whether there’s a baby in this bath water. Parsing what is and isn’t presently defensible does not address the primary issue of lost trust. The reputation of all co-investigators has been tarnished and now we don’t know whether to trust the “correction.” Is another shoe ready to drop?

      Another thing to think about; would either paper have been published in the first place without those parts of the research that have now been recanted? In other words, are we left with a trivial paper, now that the exciting ideas have been retracted?

      In any case, thanks for being open about your affiliation…. That took some courage.

      1. “Another thing to think about; would either paper have been published in the first place without those parts of the research that have now been recanted? In other words, are we left with a trivial paper, now that the exciting ideas have been retracted?”

        This is exactly what I have been wondering. This is outside my expertise, and therefore I am not qualified to give a sound opinion… but from my experience publishing my own work, getting papers into top journals only occurs because of the entirety of the data PLUS the implications of the data/results/conclusions. These errata pull out a big part of the conclusions. I suspect the papers in their now-revised form would not have been sufficient for publication in this journal. (But again, that suspicion comes from a relatively uninformed position.)

    2. In this case the bathwater stinks so badly the baby is going to be covered with a permanent stink. Best for any baby that is worth raising to just be reborn.

    3. Without a clear separation and REANALYSIS of the data, it is not possible to separate infant from liquid. The Duke investigators who wish to do so would be advised to redo the research, determine if live infants are present, and republish if that is the case, otherwise allow the retractions to proceed. If I were Nevins, I would retract the entire mess, reanalyse, and publish what remains.

  5. There has been no evidence or even claims of misconduct for anyone other than Anil Potti. We are scientists, and base our conclusions on facts and evidence. In the absence of any evidence, you cannot assume guilt. In fact, I would argue that one would have to assume innocence. This goes for both the scientists involved in the group, as well as their research. To assume guilt by association is unscientific and inappropriate.

    1. I agree with these comments in theory. But as my PhD advisor often said, “we only look for our lost keys below the lit lamppost.” Has Duke investigated all contributors to these manuscripts? All involved biostatisticians? (I don’t know the answer to this question; it is asked honestly.) Thus, it may not be justified to say that no one else is implicated, if no one else has been examined with care.

      It was my understanding that Potti suggested that a few rows of an excel spreadsheet had been accidentally shifted. How did this happen (presumably once), and then contribute to errors in so many different studies?? Again, this isn’t my field, so I do not understand how the data was collected… but was a single data collection truly parcelled into 12 or more different publications? Seems a very dangerous way to do work – especially if a few transposed numbers could have this type of effect.

      1. This becomes a major issue in modern biology. Large data sets exist that most of the authors have no clue about. Mistakes in these can only be uncovered by the people who made the mistakes. Most do not have the expertise to do this.

    2. Unfortunately for Duke, Potti led a team. The team did the work, and Potti was part of the team. I DO NOT BELIEVE FOR A SINGLE MINUTE THAT POTTI WAS THE ACTUAL DATA MANIPULATOR. Someone else did the data manipulation. Who was that?

  6. As far as I know, aside from his “mea culpa by proxy” in the memo issued by Dr Willard there has been no official admission of sole responsibility from Potti.

    All of the retractions refer to the data as not being reproducible but provide no information about why this is or who is responsible.

    The only information available that provides an explanation for some of these retractions comes from Baggerly and Coombes but again this doesn’t identify who was responsible.

    I’m afraid this issue will not be resolved until Duke or the NIH/ORI issues a final report.

  7. If I understand correctly, the problem was with the data analysis and interpretation. Is Potti a bioinformatics person? Did he do the bioinformatics analysis himself? I doubt it. Therefore, he must have instructed bioinformatics experts to perform the analysis in a particular manner. There are quite a few common names in all the papers. If someoneelse has done the analysis of the data then they are also aware of this since the beginning. These are just assumptions but they must have covered this during the investigation.

    1. You do understand correctly, Ressci, the problem was with the data analysis and interpretation. To single out Potti as the fall guy here has unfortunate consequences, because the underlying problems with the analytical technique are not being adequately addressed – sadly, picking out and hammering on a fall guy is a popular paradigm in dealing with complex problems. An important principle in statistics is that a model fitted to one set of data should yield reasonably accurate predictions if that same model is assessed on additional sets of data that were never used in any initial model fitting exercises. This principle was repeatedly violated in analysis after analysis by groups of researchers at Duke. Their analysis always involved fitting models to ALL of their data and deriving new variables from that analysis to which they gave fancy names (see references to “supergenes”, then “metagenes”, then “gene expression signatures” – the name kept changing but they are the same in terms of their statistical modeling origins. They even applied for patents on these “metagenes” though patents were not granted). After deriving these variables from the whole data set, the data set was then split into two halves: one called the “training” set and the other the “validation” (or “testing”) set. Some additional model building on the training set was performed, but the final models from the training set included the “metagene” variables. Model performance was then assessed on the validation set, and results discussed as if the validation data had not been used in the model fitting exercise. This is the problem with dozens of analyses in paper after paper from the Duke genomics groups. Of course the models performed well on the validation data sets – the validation data set had been used in the early model building exercises. This is why no one can properly assess the validity of the Duke methods without all of the computer code used to analyze the data, a point which Baggerly and Coombes have repeatedly hammered home. A few sentences in a space-restricted publication such as a Nature or JCO journal article can never fully reveal exactly how the data were analyzed. Baggerly and Coombes did obtain computer code from Duke, which they have extensively analyzed and documented. In their document “ovca07.pdf” they state “The scores computed in Bild et al are based on a different approach, which proceeds as follows. (a) A matrix giving the expression values and train/test status for both the training and the testing data is supplied to a fitting routine. (b) The set of genes to explore is selected based on the training data alone, using two-sample t-tests and expression values on the log scale. The submatrix involving only these genes is extracted. (c) Rows of the selected submatrix are centered and scaled, and a singular value decomposition is applied to the result. Note that these steps are applied to both the training and the test data at the same time. (d) The weights associated with the first few singular vectors are then used with the training data to build a logistic regression model, returning a probability of the pathway being “high”. This logistic regression model is then used to compute scores and weights for the test samples as well. We chose to not employ the above approach, because we are uneasy with step (c), which allows values in the test data to affect the coefficients of the predictive model. Using the scores that we compute, we do not see a large story here.” Given the size of these genomic data sets, it is a known statistical conundrum that patterns will be found, even in completely random data, if you look around long enough. That’s all Potti had to do with this modeling machinery – look around for a while until a spurious pattern showed up, quickly capture that apparent pattern, write it up with technically correct jargon and publish. The bioinformaticians and statisticians involved in these studies should have stopped this kind of analysis, but did not. They too have some explaining to do. If you look at the authors of all the retracted (and “corrected”) papers, and look at the Duke statistics, biostatistics, bioinformatics and genomics faculty lists, you can readily come up with a list of personnel who should have done their statistical duties with more professionalism. My apologies for the long reply here, but that’s the problem underlying this whole fiasco – the issues are somewhat complex. If you are still reading this reply, wipe the glaze from your eyes, shake your head, stand up and stretch, and let’s go on shall we? Baggerly and Coombes recognized that all this jargon and discussion does confuse people and turn them off, so have concentrated their discussions on the simple errors they found in spreadsheets of data – much easier for people to grasp than the known statistical issue of overfitting a model to a set of data and then showing that the model performs well on the same set of data. Note that it was Anil Potti’s false claim of a Rhodes scholarship that started this whole cascade of investigations and cancellations of clinical trials, not the fact that Baggerly and Coombes had been reviewing the statistical issues for years. Eyes glazed over until the fraudulent scholarship claim allowed the scapegoat paradigm to unfold. This is a shame. Instead, proper statistical evaluation of the Duke techniques, as has been started by Baggerly and Coombes, needs to be completed so that we have a sound scientific basis by which to judge the potentially spurious nature of the findings in these JCO and dozens of other as-yet unretracted papers. When the Institute of Medicine first announced its investigation, I summarized my analysis of the statistical methods underlying these papers in a letter of concern to the committee (you can obtain my analysis in PAF Document 19.pdf obtainable from The Cancer Letter documents page, for the Jan 7, 2011 edition). Duke Investigator states in his/her reply above that “We are scientists, and base our conclusions on facts and evidence.” This is not true – there are unsubstantiated claims in the statistical papers that all these retracted papers cite that I documented in my letter to the Institute of Medicine. Until serious statistical analysis of the Duke methods are completed, Duke Investigator’s claim remains false. These retracted papers (and many other unretracted papers) are based on as-yet unproven assertions.

      The 60 Minutes episode this weekend will be interesting, but I really hope that people stop merely focusing on one guy who shamelessly lied on grant and job applications and start focusing on the bigger picture – an analytical paradigm that is guaranteed to find something interesting-looking in just about any large data set. This is why Ressci is correct – the issue is data analysis and interpretation, and it is more than just errors in spreadsheets.

      1. Thanks Steven for your expert opinion. This is what I was thinking all along this case has come. I have probably raised this issue earlier on another posting on Potti. I clearly understand the details of the analysis – we do microarray studies as well. One can easily get “signature genes” depending on the way analysis is done. If you look at the co-authors expertise, you can easily identify the people who would have performed the bioinformatics and statistics. What about the genome-wide association studies? Are they clean?

      2. I agree, the modern analytical paradigm can extract anything out of a data set and few are qualified to determine if this is a central issue of multiple hypothesis testing, is well grounded in statistical rigor, or is all smoke.

      3. I tend to agree with you that here Dr. Anil Potti has become the fall guy. He did misrepresent some facts on his CV so it possible that misrepresentation went onto the publications too. However as a biologist with little knowledge of stats and bioinformatics it is hard for me to completely follow even what you have tried to explain. I think it would have been the same for Dr. Potti who was primarily a clinician and possibly did not have deep understanding of the data processing. He was likely to have relied on the information he got from the bioinformatics and statistics guys and therefore they were possibly as much involved in the mistakes as Dr. Potti unless he gave them wrong data to analyze or misrepresented their analysis. The former seems unlikely as the people from MD Anderson found errors in the data made available and in the latter case, why did the people responsible for data analysis not see and object earlier?

        Why are no questions being raised about the people who conducted an earlier inquiry and found no evidence of anything being wrong? Doesn’t it imply that they were acting in collusion and are thereby equally responsible for the sufferings of the patients?

      4. Dr. McKinney,
        You write:
        “An important principle in statistics is that a model fitted to one set of data should yield reasonably accurate predictions if that same model is assessed on additional sets of data that were never used in any initial model fitting exercises.”
        And you follow this up with two posts about how “this principle was repeatedly violated in analysis after analysis”.
        Unfortunately, the “principle” that you dogmatically assert, while it sounds reasonable, is completely incorrect. In fact, there is a vast literature now on how to use the “validation” set to make better classifiers by refining the decision boundary. Of course, it is not proper to look at the class labels, but nobody is suggesting that that’s what’s going on in this case.
        As an example, let’s say we do a blind experiment where you provide me the validation data, but hold back the class labels. I can build a model and give you predictions that you compare with your class labels. What difference does it make how I have used your validation data? If you have hidden the class labels from me, isn’t the point of the exercise that I can use your validation data to make a prediction?
        You have confused the validation data set with the class labels. As a statistician, you should know this, so I have to conclude that you must have ulterior motives. While I’m not here to defend the Potti work, your post is irresponsible and potentially libelous. Shame on you.

      5. @Bad Horse

        Oh, my! An expert with an opinion and such an interesting name.

        My PubMed search of (Bad Horse[Author]) yielded “No items found.” Interesting.

        If you had bothered to read the Baggerly and Coombes reports, which I quote right there… look up there… see? right there!!! I’ll quote it again, right here >>> “(c) Rows of the selected submatrix are centered and scaled, and a singular value decomposition is applied to the result. Note that these steps are applied to both the training and the test data at the same time.” <<< Here, horsie, can I lead you right to the water??? Please re-read the vast literature on "validation". Good sources will tell you that your validation data should NOT be used in ANY training exercise model fitting.

        Bad Horse indeed.

      6. Dr. McKinney,

        It is not in dispute that Baggerly and Coombes reported that Potti has combined his training and validation sets. Also, as far as I am aware, Potti has never denied that this was the case. So nobody is arguing with you on this point. That you continue to beat this dead horse makes this horse MAD!

        Where you are wrong is the idea that the validation data must be kept strictly separate from the training data. Just because you call it a statistical principle doesn’t make it one. Yes, it’s a no-no to look at the answers. Only a naughty boy or girl would do that. But to look at the available data? That’s just smart statistics, and not at all indicative of someone who is not a “good source.” As an example, you should be aware of:

        “Supervised learning from incomplete data via an EM approach”. Pay special attention to Section 3: Learning from incomplete data, equations 12-13, which provides a ready-made formula for how to mix your “validation” data into your model. Lest you think that this is not a “good source,” it was published in NIPS 1994 by Michael Jordan. No, not that one, the one that is a member of the NAS, NAE, and was recently elected to the AAAS. The baddest horse of them all.

        Maybe you could argue that Potti’s works were missing a validation with a held out or blinded test set. That might be a fair criticism. But the statistical methods that Potti abused were not at fault. There’s nothing in the methodology that prevents it from being used in a rigorous blinded study. To blame the statistics as somehow being wrong because it was used improperly is just plain irresponsible.

        1. Bad Horse, your condescending tone is really not appreciated. You keep hiding behind a pseudonym, so we can’t even judge whether you have a reason to feel superior to the rest of us.

          In fact, the whole Bad Horse persona seems juvenile….

      7. I agree, Dr. Steen, the Bad Horse persona and the schoolyard bullying techniques are juvenile. Perhaps I’ll just call him/her “Larry”.

        Larry – I have never beaten a horse, dead or alive. Once again, pointing a finger at Potti is a scape-goat paradigm distraction. Potti did not take a software suite that performed proper validation set holdout and subsequent honest error-rate evaluation, and then re-jigger it to combine the training and validation sets. Potti was provided with an analytic suite that combined training and validation data, and subsequently performed misleading error-rate evaluation on that same validation data, data indeed used to fit a portion of the model under evaluation. I have not seen any statistical study demonstrating that model error rates estimated using data involved in the model building are equivalent to model error rates estimated using data that was not used in any way whatsoever in the model building. Perhaps you can provide such a reference. I provide the following, from the book “The Elements of Statistical Learning” by Hastie, Tibshirani and Friedman, statisticians at Stanford who in my estimation are some of the “baddest”, though definitely not horses. Pay special attention to chapter 7, section 7.2 to these statistical principles called so not by me, but by statisticians so good that they have positions at Stanford:

        “It is important to note that there are in fact two separate goals that we might have in mind:
        [Model selection: estimating the performance of different models in order to choose the (approximate) best one.]
        [Model assessment: having chosen a final model, estimating its prediction error (generalization error) on new data.]
        If we are in a data-rich situation, the best approach for both problems is to randomly divide the dataset into three parts; a training set, a validation set, and a test set. The training set is used to fit the models; the validation set is used to estimate prediction error for model selection; the test set is used for assessment of the generalization error of the final chosen model. Ideally, the test set should be kept in a ‘vault,’ and be brought out only at the end of the data analysis. Suppose instead that we use the test set repeatedly, choosing the model with the smallest test set error. Then the test set error of the final chosen model will underestimate the true test error, sometimes substantially.”

        The substantial underestimation of true test error rates was demonstrated in the analyses performed by Baggerly and Coombes, so it remains unclear just how “smart” these statistics are. It’s difficult to argue the we are not in a data-rich situation.

        You yourself correctly state “Maybe you could argue that Potti’s works were missing a validation with a held out or blinded test set. That might be a fair criticism.” This is all I ever have been asking for – an appropriate assessment of this methodology’s error rates on held-out test sets – documentable held-out test sets, with full disclosure of computer code, not the kind of “blinded” study retracted from the Lancet. “There’s nothing in the methodology that prevents it from being used in a rigorous blinded study.” So why doesn’t anyone so invested in this methodology do so?

        I have never stated that the statistics “are wrong”. Look up there – right up there – I’ll say it again right here >>> “These retracted papers (and many other unretracted papers) are based on as-yet unproven assertions.” <<<

        In the twelve years that this methodology has been pushed out, surely with all the money available at Duke someone could have run the simulation studies on random data and on structured data to evaluate the model error rates under both scenarios. This is the problem. I don't know if the methodology and its published error rates are right or wrong, because these important assessment studies have not been performed and published, with the exception of the assessment performed by Baggerly and Coombes. Without such published and verified evaluation studies, using this methodology in medical studies involving humans is irresponsible and unethical.

        I will not stop discussing this issue until proper evaluations are provided, and failing that, until papers using this and any other unproven methodology stop appearing in the scientific literature.

        PS: Michael Jordan's paper has no discussion of training and validation set methods – just discussion of how to run a particular frequentist-based (kooky, I know) supervised learning method when the available data has missing values. How this relates to the Duke fiasco is still a bit of a mystery to me. I don't see discussion of missing data from gene chips, indeed I have a hard time finding the word "missing" in any of the papers. All I could find was "Missing values for grade were assigned to a separate category to avoid a decrease in the sample size in the logistic regression analysis." in the retracted Lancet paper. Why didn't they use Michael Jordan's method? It would have been a slam dunk.

        If I was a reviewer for Jordan's paper (a step not necessary to have a paper published in a book of conference proceedings as Jordan's paper was) I would have requested that he amend the sentence in section 4.1 "For a given input, SLSE picks the Gaussian with the highest posterior . . ." Issue one: the highest posterior ?? Issue two: posterior “entities” such as posterior distributions and posterior probabilities are found in Bayesian-based methods, not frequentist-based methods. I know, kooky . . .

    2. Exactly and perfectly correct. POTTI DID NOT DO THE CRAPPY INCOMPETENT DATA MANIPULATION. HE IS A PHYSICIAN. Who did the screw-up?

  8. Excellent point Chris. It now seems to be acceptable science to cut corners, fiddle data, and publish ‘bullcrap’, and then go back into the literature and correct “this and that” later on. Often the cutting corners and fiddling data are associated because in a weak experimental system judicious removal / massaging of data points can completely change the result. As R. Grant Steen says, such papers would simply be unacceptable if the data were honestly presented at the outset.

  9. Dr. McKinney,

    Thank you so much for writing this thorough explanation. I believe you are completely correct. The misapplication of statistical validation techniques has much greater scope than the Potti papers. Unfortunately Duke has only reviewed papers that include Dr Potti as an author (at least at the time of the last IOM meeting). The co-authors are the people who are doing the review and it is not in their best interest to find any significant error that was not attributable to Dr. Potti. If you only review papers that are authored by someone who has already been caught, and the reviewers are the co-authors, then I don’t think it is surprising that all reported errors are attributed to Dr. Potti. It does not take a statistician to see that is a biased experiment.

    Dr. Potti is an M.D. he does not know R and all of the signatures in the clinical trials were programmed in R. That was done by a bioinformatician. It is clear from the external review of the signatures PAF3, that a statistician was involved. The independent reviewers stated “Only by examining the R code from Barry were we able to uncover the true methods used and thus we were able to replicate the approach independently…”.

    It takes a team to run clinical trials. Dr, Potti was briefly on the cover page of one trial and never on the cover page of the others. This can be seen in the protocols released by the IOM.

    1. So if the same statisticians and bionformatics people are involved in the corrected paper – are “revised” publications trustworthy?

      1. Exactly. And, was Potti’s name removed from these two “corrected papers”, acknowledging/asserting/alleging that his faulty work is no longer a contribution to these findings? Or is he retained as an author??

    2. Again, part of the problem is the greater and greater reliance on specialized knowledge that only a minor subset of authors understand but which compose the key fulcrum upon which the central conclusions are derived. If you do not understand R automatically you do not understand the signatures. I see this kind of problem becoming more and more common, it is inevitable. Welcome to the future.

    3. I don’t know of a rule that says M.D.’s are not allowed to study and use R or know about statistics(or prepare data-sets to be used by an algorithm that they did not write). I do not assume they are too stupid to do it either, since I know of counter-examples. I am not saying I think Potti was likely the author of the original versions of the code, or that he invented the trick to extract principal components from the combined data set rather than the training data only (though that is not impossible).

      Horse’s observation is interesting. No need to tone-troll. I admit I would never use the test set in any way except to test, and am used to it not even being available when building models, in the most rigorous studies. That doing so gives smaller estimated error rates does not seem like a proof that they are biased, but it does seem like cheating.

      1. Also @SteveM

        As to the “trick” of using PCA between training/test sets, I think we can be almost entirely sure
        that he didn’t come up with this method or the idea of combining it with logistic regression. There were
        other authors on the Nature Medicine paper who do have this expertise.

        It really comes down to whether “normalizing” or “batch correcting” genomic data between training and test sets increases predictive power of the statistical procedure. If so, it should be used because that means that a certain percentage of patient cases will be more accurately predicted. It goes without saying that it’s not valid to compare error rates between statistical procedures that normalize between the input variables of training/test sets and those that don’t.

        But there are many “real world” situations in clinical bioinformatics where normalization/batch correction is absolutely called for. One recent example that we have had to deal with: Affymetrix recently changed their prep kits for RNA extraction to a new technology, and the new tech(though more efficient) causes substantial differences in the readout from microarrays, though it may be fact be more accurate. Given these differences, we are faced with the necessity of batch correction to account for the differences in sample prep.

        Some batch-correction methods are review here (we use Combat):
        http://www.plosone.org/article/info:doi/10.1371/journal.pone.0017238#pone.0017238.s008

        The far bigger issue for me in the Potti case is not how Principal Components Analysis was used—which is legitimate in my opinion, although I would agree needed to be more thoroughly vetted before being used in clinical trials)– but rather that in-vitro drug signatures were apparently being used to guide clinical trials without extensive pre-clinical testing (ie mouse models).

        BTW, any opinions that I am expressing are entirely my own and I certainly wouldn’t try to speak for any of my colleagues at my institution or for my institution in general.

      2. nci_researcher,
        I think McKinney below (Feb 28 and 29, and me too, but shorter) makes it more clear. In my words it’s like this:
        We readers want to know how well a classifier does on a closed-box set of data, since this is the situation faced in the real world except for fairly artificial special problems. Reader expected that was what was being estimated and reported. Compared to what we wanted (McKinney calls them “true”, and for completely closed-box test set, I guess they would be “true”), the estimated error rates are optimistic. I’m ready to learn better if that’s not quite right.

        I think the bigger problem is when you can’t tell exactly what happened. When that’s the case nobody knows whether it is the calculation they wanted or not. I don’t think it’s that hard to make it clear. I don’t think it’s this paper but in one of them principal components are explained in gothic detail and as fancy words as possible – which we knew already if we cared, or could learn from Wikipedia. It felt like just showing off cool equations, perhaps to lull reader. But when we finally get to the part where we really wanted to know what was done, and can’t learn that from any other source, it was very short, and impenetrable.
        -Rork Kuick.

        PS: as I tried to say further down, the batch modeling (or correction) most folks do on Affy arrays probably isn’t “real-world”, where I am thinking of actual clinical use. We mean centered data from different institutions in the Shedden Lung paper (PMID:18641660) for example, but if patients were walking in the door one at a time we couldn’t do that. I say we permit ourselves and other this “cheat” because we are sympathetic that the assays used aren’t that hardened for field use, and are imagining data from better assays that would actually not display such batch effects. So it’s “if our assays were a bit more robust like those we imagine being actually used in clinical decision making, this is how well we would do approximately”. Maybe your situation is different than the one I’m familiar with though.

      3. rork, I definitely understand the “closed box” paradigm for clinical studies. Clinical trials need
        to be run so that people can compare results between studies, and that becomes a problem if some study is using a statistical procedure that uses genomic data from the test set (but I assume not the case labels).
        I was just pointing use that the error rates from statistical procedures that use genomic data from the test set have error rates that are not necessarily “artificially inflated”, but in reality might just be higher performing and producing more accurate results. But given the conservative nature of statistics in clinical trials, it seems unlikely that the “closed box” paradigm will go away any time soon, even if it would benefit patients. The same thing could be said about the use of Bayesian statistics in clinical trials–it might be a better way to do things, but people are resistant to it because it’s new.

        > don’t think it’s this paper but in one of them principal components are explained in gothic detail and as >fancy words as possible

        Not sure if you mean the Metagene Projection paper from PNAS? If so, that’s MIT for you. They like to use big words whenever possible, including and especially when they don’t need to. 😉 I know because I used to be there. It’s actually very similar to the technique from the Duke group except it uses Non-Negative-Matrix factorization. They claim it is “complementary” to the Duke technique to try to make it seem unique and useful to the paper referees and editors so that they would seem worthy of a PNAS publication.

        If you meant the COXEN paper, then I’ll just say that I didn’t have any trouble writing R code to implement it. I didn’t find the explanation of the algorithm difficult to understand, although it was hidden in the supplementary data and I think possibly removed when they got the patent issued.

        I guess our real-word example of batch correction is having to deal with microarray data from the TCGA project that comes from a 96-well plate AFFX plate that produces some strange looking histograms–definitely not normally distributed, and mean-centering wouldn’t allow us to use it. To reconcile the TCGA microarray data with our own (so that we could use the TCGA classification scheme for our cancer type), we needed to batch correct it. Same thing for the new AFFX RNA prep method—mean centering doesn’t work well enough to compensate for it.

      4. nci_researcher

        Could you please explain how the Duke methodology or related Bayesian techniques would work in a clinical trial, or in the clinic in general?

        To date, publications from the Duke group involve amassing a data set, obtaining “metagenes” from the whole data set, splitting the data in half, fitting more models including the metagenes to the first half of the data, then showing how well those final models work on the second half of the data.

        In a clinical trial, what is the equivalent of the second half of the data? Do you have to wait until you have dozens of patients so you can recalculate the metagenes and refit the subsequent models?

        Typically in a clinical setting a new patient walks in the door and needs treatment now. A treatment plan needs to be made now, with patient data not previously available. There is no time to wait for other patients to accrue.

        So how does the demonstration provided by the Duke group in paper after paper work in real life clinical settings or trials, where all data from future patients is not available prior to the patient presentation? What is the training set, and what is the validation set, in the clinical or trial setting?

        My problem with the published papers (those not subject to erroneously aligned input data files etc.) is that the error rates they report were derived from a framework different from the clinical framework in which the methodology will be applied.

      5. No reply here from nci_researcher . . .

        I didn’t think there would be.

        Now the Institute of Medicine has released its report, with clear guidelines stating that model assessment should be performed with a locked-down model on lock-box data never used in the model building process.

        There’s nothing in the report saying Bayesian methods can not be used – there’s no resistance to Bayesian-based methods – just resistance to improperly assessed models whether they are frequentist-based or Bayesian-based or otherwise.

  10. The corrected myeloma paper in JCO mentions that there are no conflicts of interest:

    “AUTHORS’ DISCLOSURES OF POTENTIAL CONFLICTS OF INTEREST: The author(s) indicated no potential conflicts of interest.”

    http://jco.ascopubs.org/content/27/25/4197.full

    In fact, some of the other coauthors do have undisclosed conflicts – which have been pointed out in the past. These conflicts have not been disclosed in the majority of papers published by this group.

    http://onlinelibrary.wiley.com/doi/10.1002/cncr.25917/full

    Waheed S, Shaughnessy JD, van Rhee F, Alsayed Y, Nair B, Anaissie E, Szymonifka J,
    Hoering A, Crowley J, Barlogie B. International Staging System and Metaphase Cytogenetic
    Abnormalities in the Era of Gene Expression Profiling Data in Multiple Myeloma Treated
    With Total Therapy 2 and 3 Protocols. Cancer doi: 10.1002/cncr.25535.

    Questions were raised by a reader about the financial disclosures provided by some of the authors of the above paper. The published disclosures were as follows:

    “Supported in part by Program project grant CA55819 from the National Cancer Institute, Bethesda, Maryland.”

    In response to the concerns raised by the reader, the journal followed up with the authors and asked if there were any additional disclosures that should have been included prior to publication. The authors disclosed the following information:

    Dr. Shaughnessy holds patents, or has submitted patent applications, on the use of gene expression profiling in cancer medicine, use of FISH for chromosome 1q21 as a cancer diagnostic, and targeting DKK1 as a cancer and bone anabolic therapy. Dr. Shaughnessy receives royalties related to patent licenses from Genzyme and Novartis. He has advised Celgene, Genzyme, Millennium, and Novartis, and has received speaking honoraria from Celgene, Array BioPharma, Centocor Ortho Biotech, Genzyme, Millennium, and Novartis.

    Dr. Barlogie has received research funding from Celgene and Novartis. He is a consultant to Celgene and Genzyme, and has received speaking honoraria from Celgene and Millennium. Dr. Barlogie is a co-inventor on patents and patent applications related to use of gene expression profiling in cancer medicine.

    The editors of the journal have reviewed this additional information and determined that although the disclosures are relevant to the publication, they would not have altered the decision to accept the paper. We apologize for the lack of transparency caused by the authors’ failure to disclose this information.

  11. http://www.inforum.com/event/article/id/350629/

    Published February 12, 2012, 11:53 PM
    Disgraced cancer researcher with ties to UND, Fargo subject of ’60 Minutes’ report

    By: Patrick Springer, INFORUM

    Dr. Anil Potti
    .
    Photo credit: Duke Photography

    FARGO – A cancer researcher with ties to the University of North Dakota School of Medicine was the subject of a CBS “60 Minutes” report called “Deception at Duke.”

    Dr. Anil Potti has resigned from Duke University and faces an investigation for scientific misconduct following conclusions that he “manipulated data” involving what once appeared to be breakthrough cancer research, according to the CBS report, which aired Sunday.

    At Duke, Potti was regarded by colleagues as a rising star with a reputation for modesty and diligence. He was born in India and received a college degree in his native country in 1995.

    His specialty was developing personalized treatments for patients with lung cancer.

    “Very bright, very smart individual, very capable,” Dr. Joseph Nevins, who directed a lab at Duke and chose Potti as a protégé, told “60 Minutes.” “He was a very close colleague to many, many people.”

    Other cancer researchers, including some at the National Cancer Institute, spotted problems with Potti’s research data. Acting on those suspicions, Nevins investigated and concluded that problems with Potti’s data were not the result of error, but of deliberate fabrication, according to the report by Scott Pelley.

    According to his biography, Potti received training in internal medicine at UND’s Fargo campus. He then served as an assistant professor at UND for three years before receiving a research fellowship at Duke starting in 2003.

    In 2000, while at UND, Potti received a $72,500 research grant from Dakota Medical Foundation to help him hunt for a gene implicated in aggressive breast cancers that spread to other parts of the body.

    For that one-year project, he planned to recruit 600 cancer patients from Fargo-Moorhead, Grand Forks and Bismarck.

    In 2003, before leaving for Duke, Potti and a fellow researcher at UND were recipients of the Arnold P. Gold Foundation’s Humanism in Medicine Award at UND.

    The award is given to recognize compassion and sensitivity in the delivery of care to patients and their families.

    Duke suspended Potti’s research trials. Nine patients have filed suit. Potti, now working as a cancer doctor in South Carolina, told CBS he was “not aware that false or ‘improper’ information had been included” in his research.

    A UND spokesman was not immediately available for comment Sunday night.

    ——————————————————————————–

    Readers can reach Forum reporter Patrick Springer at (701) 241-5522

        1. http://www.cbsnews.com/8301-18560_162-57376073/deception-at-duke/

          …..Scott Pelley: Was the idea here that this would change the way we thought about treating cancer?

          (Dr. Rob) Califf: Well, you’ve never seen such excitement at an institution, and it’s understandable.

          It wasn’t just Duke that was excited. A hundred and twelve patients signed up for the revolutionary therapy. Hope was fading for Juliet Jacobs when she learned about it. She had Stage IV lung cancer. And this would be her last chance.

          Walter Jacobs: She was my best friend, but that’s kind of cliche. She’s, she’s somebody who after 49 and a half years, I was still madly in love with.

          She and her husband Walter were looking into experimental treatments. They had to choose carefully because there was only time for one.

          Scott Pelley: When you met Dr. Potti, what did you think?

          Jacobs: We felt that he was going to give us a chance. He was… He was very encouraging.

          For a patient with no time, Dr. Potti’s research promised the right drug, right now.

          Pelley: Fair to say Potti was a rising star at Duke? ……

  12. http://www.ncbi.nlm.nih.gov/pubmed/advanced for PubMed Advanced Search Builder

    The search Potti, Anil[Author] yields 112 results

    RETRACTIONS – ONLY 5 OF 112.

    Potti, Anil[Full Author Name] AND Retraction[Title] Results: 5

    1.Retraction: Acharya CR, et al. Gene expression signatures, clinicopathological features, and individualized therapy in breast cancer. JAMA. 2008;299(13):1574-1587.

    Acharya CR, Hsu DS, Anders CK, Anguiano A, Salter KH, Walters KS, Redman RC, Tuchman SA, Moylan CA, Mukherjee S, Barry WT, Dressman HK, Ginsburg GS, Marcom KP, Garman KS, Lyman GH, Nevins JR, Potti A.

    JAMA. 2012 Feb 1;307(5):453. Epub 2012 Jan 6. No abstract available.

    PMID:22228686[PubMed – indexed for MEDLINE]
    Related citations

    2.Proc Natl Acad Sci U S A. 2011 Oct 18;108(42):17569. Epub 2011 Oct 3.
    Retraction for Garman et al: A genomic approach to colon cancer risk stratification yields biologic insights into therapeutic opportunities.
    Garman KS, Acharya CR, Edelman E, Grade M, Gaedcke J, Sud S, Barry W, Diehl AM, Provenzale D, Ginsburg GS, Ghadimi BM, Ried T, Nevins JR, Mukherjee S, Hsu D, Potti A.
    Retraction of
    Garman KS, Acharya CR, Edelman E, Grade M, Gaedcke J, Sud S, Barry W, Diehl AM, Provenzale D, Ginsburg GS, Ghadimi BM, Ried T, Nevins JR, Mukherjee S, Hsu D, Potti A. Proc Natl Acad Sci U S A. 2008 Dec 9;105(49):19432-7.
    PMID:21969600[PubMed – indexed for MEDLINE] PMCID: PMC3198325
    Related citations

    3.RN Engl J Med. 2011 Mar 24;364(12):1176. Epub 2011 Mar 2.
    Retraction: A genomic strategy to refine prognosis in early-stage non-small-cell lung cancer. N Engl J Med 2006;355:570-80.
    Potti A, Mukherjee S, Petersen R, Dressman HK, Bild A, Koontz J, Kratzke R, Watson MA, Kelley M, Ginsburg GS, West M, Harpole DH Jr, Nevins JR.
    SourceChapel Hill, NC, USA..

    Abstract
    To the Editor: We would like to retract our article, “A Genomic Strategy to Refine Prognosis in Early-Stage Non-Small-Cell Lung Cancer,”(1) which was published in the Journal on August 10, 2006. Using a sample set from a study by the American College of Surgeons Oncology Group (ACOSOG) and a collection of samples from a study by the Cancer and Leukemia Group B (CALGB), we have tried and failed to reproduce results supporting the validation of the lung metagene model described in the article. We deeply regret the effect of this action on the work of other investigators.

    Retraction of
    Potti A, Mukherjee S, Petersen R, Dressman HK, Bild A, Koontz J, Kratzke R, Watson MA, Kelley M, Ginsburg GS, West M, Harpole DH Jr, Nevins JR. N Engl J Med. 2006 Aug 10;355(6):570-80.
    PMID:21366430[PubMed – indexed for MEDLINE] Free Article
    Related citations

    4.Retraction–Validation of gene signatures that predict the response of breast cancer to neoadjuvant chemotherapy: a substudy of the EORTC 10994/BIG 00-01 clinical trial.

    Bonnefoi H, Potti A, Delorenzi M, Mauriac L, Campone M, Tubiana-Hulin M, Petit T, Rouanet P, Jassem J, Blot E, Becette V, Farmer P, André S, Acharya CR, Mukherjee S, Cameron D, Bergh J, Nevins JR, Iggo RD.

    Lancet Oncol. 2011 Feb;12(2):116. No abstract available.

    PMID:21277543[PubMed – indexed for MEDLINE]
    Related citations

    5.Retraction: Genomic signatures to guide the use of chemotherapeutics.

    Potti A, Dressman HK, Bild A, Riedel RF, Chan G, Sayer R, Cragun J, Cottrill H, Kelley MJ, Petersen R, Harpole D, Marks J, Berchuck A, Ginsburg GS, Febbo P, Lancaster J, Nevins JR.

    Nat Med. 2011 Jan;17(1):135. No abstract available.

    PMID:21217686[PubMed – indexed for MEDLINE]
    Related citations

  13. http://www2.nbc17.com/news/2012/feb/10/nbc-17-investigates-fda-findings-duke-audit-ar-1917827/

    NBC-17 Investigates FDA findings in Duke audit

    By: Charlotte Huffman | NBC17.com 1205 Front St., Raleigh, NC 27609, 919-836-1717

    Published: February 10, 2012 Updated: February 10, 2012 – 10:51 PM

    NBC-17 Investigates has uncovered the findings of the FDA’s year-long audit of Duke University Health System.

    The U.S. Food and Drug Administration began inspecting Duke’s Institutional Review Board after a former Duke researcher and oncologist, Dr. Anil Potti, was forced to resign.

    Before his resignation, Potti admitted he embellished his resume to obtain a grant from the American Cancer Society to help fund clinical trials.

    The trials were first made public in 2006, when Duke Medicine’s News and Communications Office released a public statement saying the model developed by Duke researchers, including Potti, have “promising results” and have “initiated a landmark multi-center clinical trial.”

    The trials, conducted at Duke between 2007 and 2010, assigned patients with early-stage non-small cell lung cancer to treatments based on now-discredited gene expression patterns that Potti and his mentor, Dr. Joseph Nevins, claimed they had identified in tumor cells.

    A dozen cancer patients and the families of those patients who are no longer living have filed a lawsuit against Duke University, Duke University Health System, administrators, researchers and physicians alleging that they knew or should have known about Potti’s questionable research.

    In the 73-page complaint filed in Durham Superior Court, the plaintiffs allege that the defendants allowed the cancer trials to continue while ignoring public warnings from outsiders. Those warnings included two scientists at M.D. Anderson Cancer Center in Houston that the underlying research of the trials was faulty and could put patients at risk by exposing them to ineffective or dangerous treatments.

    In addition, the complaint alleges that in 2006, an investigator from the National Cancer Institute alerted Duke, Nevins and Potti to flawed science published in the 2006 article.

    In 2010, Duke canceled the trials and informed participants that the treatments they received may have done little or nothing to stop tumor growth.

    In January 2011 the FDA began inspecting the Institutional Review Board (IRB) at Duke University Health System. According to the audit, Duke’s IRB has the authority to approve, disapprove or require modification to research studies as well as suspend or discontinue research if the IRB believes the study is not being conducted according to requirements or if “human subjects are questioned or protocol compliance is an issue.”

    And FDA representative, Erica Jefferson, told NBC-17 that models like the one being used in the clinical trials require FDA approval of an Investigational Device Exemption (IDE) prior to beginning a clinical trial.

    “In general, clinical trials of in vitro diagnostics that direct patient treatment for serious disease as part of the trial, such as the one at issue here, are significant risk studies that require FDA approval of an IDE,” said Jefferson.

    Duke did not obtain an IDE nor submit pre-clinical data in an investigational application before beginning the trials, as required by the FDA.

    The audit found “no significant deficiencies” in IRB conduct.

    “However, it was not possible for FDA to assess the allegations made by MD Anderson about the algorithm supporting Dr. Potti’s genetic profiling test. If the required IDE had been submitted, FDA could have requested and reviewed this data before the study began,” Jefferson told NBC-17.

    During the on-site audit, FDA investigators interviewed Dr. John Falletta, Senior Chair for Duke IRB.

    According to the audit, Falletta told the FDA “the IRB now realizes that it was probably wrong to assume everything was OK to proceed.”

    In regards to the model that was central to the clinical trials, Falletta told the FDA “The IRB realizes now that the device does pose significant risk and that an IDE should have been filed.”

    Currently, all three clinical studies have been closed.

    Duke declined NBC-17 requests for an on-camera interview and did not answer the specific questions submitted via email but sent NBC-17 a statement saying that Duke has fully cooperated with the FDA and “…we have identified this as an area in which we will make significant improvements moving forward.”

    For several weeks, NBC-17 has reached out to the attorney representing the cancer patients but requests to interview them have not yet been granted

    1. Yes, it was OK and superficially accurate.

      They made one important mistake. They continue to use the INCORRECT term “data fabrication” or “data falsification”. This is an issue with data handling.

      After the program, I checked with Keith Baggerly. He is in general agreement with the notion that this is NOT an issue with “data fabrication”.

      1. Paul Thompson:

        “OK and superficially accurate”? “Mistake”? “INCORRECT”? I am amazed you can say any of those things that with any degree of conviction!

        The program has been produced by people who are bright and responsible – and certainly worried about the legal implications of accusing somebody of fraud.

        Also, I don’t think dropping the Baggerly name makes your statement any more credible.

      2. Why would the lab head actually agree with the “data fabrication” accusation and say that the evidence was pretty clearcut?

  14. Yes, I watched the 60 minutes piece. Here’s something more to consider:

    Health, Stem Cells, and Technology
    Updates on health care, including stem cells, regenerative medicine, systems biology, skin care, age management, and new technologies

    http://healthstemcellstechnology.blogspot.com/2012/02/university-of-texas-scientists-expose.html?showComment=1329121962758#c596944970019017804

    Sunday, February 12, 2012

    University Of Texas Scientists Expose Fraud By Physicians At Duke University School Of Medicine

    Kudos to Dr. Kevin R Coombes, Ph.D., Dr. Jing Wang, Ph.D., and Dr. Keith A Baggerly, Ph.D. for exposing the fraudulent work of Anil Potti, M.B.B.S and Joe Nevins, M.D., two Duke University physicians who falsified their data in the area of personalized medicine for cancer indications.

    The Nevins and Potti team had emerged as false pioneers of personalized medicine in 2006, when Nature Medicine published their paper claiming that microarray analysis of patient tumors could be used to predict response to chemotherapy.

    However, two biostatisticians at the MD Anderson Cancer Center attempted to verify this work when oncologists asked whether microarray analysis could be used in the clinic. Keith Baggerly and Kevin Coombes, the statisticians, found a series of errors, including mislabeling and an “off-by-one” error, where gene probe identifiers were mismatched with the names of genes.

    Also, Paul Goldberg of The Cancer Letter reported on an investigation into Duke cancer researcher, Anil Potti,, and claims made that he was a Rhodes Scholar – in Australia. Potti was not a Rhodes scholar, and the misrepresentation was made on grant applications to NIH and the American Cancer Society.

    Further, Anil Potti is apparently not an M.D., but instead possesses a bachelors degree, an M.B.B.S. degree from India. Therefore Anil Potti is a phsyician (sic), but not a doctor. Society must understand the difference between doctors and physicians..they are not the same. Doctors have obtained the highest degree possible and use their mind to think through problems, as such doctors can be either Ph.D.s or M.D.s. Not all physicians have obtained an M.D., and therefore should not be called doctors. Anil Potti was not a doctor, rather he was simply a bachelors degree trained physician, another words he was a technician.

    Our system must not pay observance to poorly trained physicians masquerading as doctors.

    Posted by Dr. Greg Maguire at 7:44 PM

    ________

    Dr. Greg Maguire
    San Diego, California, United States
    Dr. Maguire has authored over 100 publications, was an NIH Fogarty Fellow and medical school professor, has numerous patents, founded two biotech companies and two non-profit organizations to support stem cells and neuroscience. He is Co-Founder and CEO of BioRegenerative Sciences, Inc. of San Diego, CA. USA email: [email protected]

    1. Dr. Greg Maguire’s comments “Further, Anil Potti is apparently not an MD …. physicians masquerating as doctors” stems from his lack of knowledge about education system in other countries. If only he had bothered to check out “In India, Britain, Ireland, and many Commonwealth nations, the medical degree is instead the MBBS i.e. Bachelor of Medicine, Bachelor of Surgery (MBChB, BM BCh, MB BCh, MBBS, BMBS, BMed, BM) and is considered equivalent to the MD and DO degrees in the U.S. system” (http://en.wikipedia.org/wiki/Doctor_of_Medicine – yes even though wiki may not be an authority on the subject yet can give you basic information). In addition if Dr. Anil’s CV is correct, then he did complete the residency program and also a fellowship program as per the US system and was certified by the US medical board. So to say that he was poorly trained physician masquerading as doctor is incorrect.

      In India, 4 years of undergraduate college is NOT required for admission to the Medical degree program. Students are eligible to apply and seek admission to the medical degree program after completing 10+2 years of school education and hence it is not surprising that Dr. Potti had no ‘graduate’ training.

      Before spreading information it would have been appropriate for Gene Nelson to have verified the authenticity of the information. If some one had made any comments without having any knowledge then reproducing it serves no function.

      1. I’ve worked overseas in those systems and I know the MBBS is not equivalent to the M.D. degree. Further, I have colleagues from Ireland and India who are physicians in the States but do not call themselves doctor, and, as an example, the UC system does not consider those with a MBBS to be doctors. A number of my colleagues with a MBBS physician degree have been designated doctor, but only based on their also having attained a Ph.D. degree. Yes, Wiki provides basic info, but sometimes basically wrong.

      2. Drgregmaguire – I don’t see the importance of whether one is called ‘doctor’ or ‘physician’ as being relevant issue over here. In several countries PhD’s are not called as doctors. Even in states when you fill in the passenger information ‘Dr’ is mostly used to designate a medical professional or what you would call as ‘physician’. My PCP is only ‘MD’ yet I refer to her as ‘doctor’ and so does the rest of the staff at the medical facility also refers to her as ‘doctor’. To the best of my knowledge based on my interactions with professionals from the medical field, MBBS from India is considered equivalent to MD degree in US.

        The fact is that his medical credentials were considered to be equivalent by the relevant authorities in the US and he completed all the other requisite training (residency and fellowship) as mandated by the US medical system according to the information provided in his CV (unless those are lies and you have some inside information). So to say “…another words he was a technician” is not appropriate.

        It may be true that he is responsible for publications that are not factually correct and involved research misconduct. When we are pointing fingers at some one for being wrong then we need to make sure that we are correct about what we say.

  15. Anil Potti earned what is called a “doctor” in India with 4.5 years of studies to earn his bachelor’s degree and 13 months of “residency.” His Residency Application for the University of North Dakota (UND) shows that he had no graduate training. He was born in 1972. Various biographies show (or don’t show) that Anil was a “Rhodes Scholar” in 1995 or 1996. His UND residency began in 1996 … or was it 1997…. and he completed it in 2001.

    1. In India, MBBS (Bachelor of Medicine and bachelor of Surgery) is awarded after 4.5 years of medical school and 1 year of internship – and is the equivalent of “MD” here. 3 additional years of residency/fellowship results in the award of an “MD” degree – which is specialization in a broad or narrow subspecialty. An Indian physician in India could have an MBBS degree, or both an MBBS and an MD.

      It is common (and in my opinion accurate) for Indian physicians who are MBBS and obtain appropriate credentials here to change the “MBBS” to “MD” here. However, if they went back in India, it would be inaccurate for them to call themselves “MD”.

      However, I must disagree with Dr Maguire’s statement posted above:

      “Further, Anil Potti is apparently not an M.D., but instead possesses a bachelors degree, an M.B.B.S. degree from India. Therefore Anil Potti is a phsyician (sic), but not a doctor. Society must understand the difference between doctors and physicians..they are not the same. Doctors have obtained the highest degree possible and use their mind to think through problems, as such doctors can be either Ph.D.s or M.D.s. Not all physicians have obtained an M.D., and therefore should not be called doctors. Anil Potti was not a doctor, rather he was simply a bachelors degree trained physician, another words he was a technician.

      Our system must not pay observance to poorly trained physicians masquerading as doctors.”

  16. I think Anil Potti’s work is perfectly fine. His co-authors should be flattered to be associated with his stellar work. Shame on others to put this great scientist down.

    By the way, I would like to somewhat correct the above statement. The sixth word of the first sentence should read ‘is not’ instead of ‘is’. The fifth word of the middle sentence should read ‘infuriated’ instead of ‘flattered’. The penultimate word of the penultimate sentence should read ‘less-than-stellar’ instead of ‘stellar’. The last sentence should be removed and the whole text should be translated to Swahili and back to English via Yiddish. The exact meaning of the final version either should or shouldn’t be treated with a pinch of salt.

  17. Bad Horse is right. It’s not statistically improper to include the test set (input variables) in refining decision boundaries in the training set, as long as the class labels in the test set are not used. This was one criticism from Baggerly and Combs that doesn’t hold water. Here is another paper that does it:

    http://www.ncbi.nlm.nih.gov/pubmed/17666531

    1. There appears to be an epidemic of embarrassment about this analytical paradigm. People are embarrassed to post their names when making statements. I’m rather curious as to why this is the case.

      If “NCI Researcher” really is an NCI researcher, the tax payers aren’t getting their money’s worth out of that researcher. That’s a shame, because now more than ever, with healthcare costs rising at unsustainable rates, evidence-based scientific reviews of healthcare practices are essential, and federal institutions have a vital role to play. Fortunately from the documents released during the Institute of Medicine review of the Duke fiasco I know that far more competent researchers such as Dr. Lisa McShane are also at the NCI, providing taxpayers with excellent biostatistical services and opinions.

      Just because another group employed a similar strategy to that undertaken at Duke does not make both strategies correct.

      Let’s take a look at this latest proffered piece of evidence.

      The link provided directs to a paper titled “A strategy for predicting the chemosensitivity of human cancers and its application to drug discovery” in which the authors present “a generic algorithm we term ‘coexpression extrapolation’ (COXEN)”.

      In the paper the authors state “Detailed descriptions of the COXEN algorithm and its slightly different implementations for those three applications are in Materials and Methods and supporting information (SI) Materials and Methods.”

      My favourite potion of the detailed description reads “Uij and Vij are the correlation coefficients between probes i and j in the NCI-60 and BLA-40, respectively. Then, rc(j) is defined as where and are the mean correlation coefficients of the row k correlation coefficient vectors for the NCI-60 and BLA-40.” Nothing like a detailed description so we can sort out the important details. I was, however, able to find a Word document description in document “journal.pone.0030550.s007.doc” as part of the supplemental materials to a newer 2012 paper very recently published by the COXEN authors titled “Multi-Gene Expression Predictors of Single Drug Responses to Adjuvant Chemotherapy in Ovarian Carcinoma: Predicting Platinum Resistance”.

      The similarities of the Virginia group to the Duke group in their algorithm development and marketing are indeed striking. As the astute “NCI Researcher” points out, the Virginia group also split their data into two subsets, develop part of their model on a “training” subset, another part on the “test” subset, and then assess the predictive capabilities of the model on the “test” subset – a strategy remarkably similar to the Duke strategy now receiving so much attention. Whether the Virginia group really do generate honest error rate estimates on true hold-out data kept in a lock box and never used in any model development remains to be seen, once data and computer code are released by the Virginia group. Very strong claims that the data used in performing the model error rate statistics had never been used in the model fitting are made, as were made by the Duke group – claims that can only be verified by review of the data and computer algorithm code, and claims that contradict the algorithm descriptions. Similar also to the Duke situation is the lack of availability of the data and computer code implementing the COXEN algorithm. The Virginia webpage listing COXEN, http://geossdev.med.virginia.edu/research/software/softwareetc.html, also discusses open source Bioconductor code from this group, so they should be able to provide COXEN in the open source R statistical programming language. In 2008 after publishing the paper referenced by “NCI Researcher”, a biomedical company named “Key Genomics” began appearing on the business wire to market the Virginia technology. The listed website “www.keygenomics.com” is now stale. A biomedical start-up marketing company and unpublished algorithm code – again more similarities to the Duke story. The Virginia COXEN group even cite Duke papers in their papers – the 2012 PLoS ONE paper described above cites the recently retracted Dressman et al JCO paper twice (references 9 and 20). Such careful attention to detail is something I look for in developers of life-saving biomedical algorithms.

      What is statistically and scientifically improper is to assert that a method has a high accuracy rate, when the published accuracy rates do not come from data held-out in a lockbox and never made available during the model fitting exercises. A model that will be used in medical settings will be used to make decisions for newly diagnosed patients, for whom data was not previously available. This is why such models must be shown to work adequately on data held in a lockbox and never used in any way in the model development. How much water can be held by a criticism from Baggerly and Combs [sic] will become more and more evident as this scientific vetting process continues. To offer as proof of the adequacy of an algorithm published 11 years ago, another algorithm of a similar nature published 3 years ago and not yet vetted, is to misunderstand the nature of statistical rigour and scientific reproducibility. I look forward to more scientific publications by Bad Horse and to appearances by Bad Horse at scientific meetings, perhaps even a collaborative effort by Bad Horse and NCI Researcher, but not, I hope, appearances on Sunday night television shows.

      1. Hi Steve,

        Thanks for your note, and I imagine nobody is using their real names here so that we can have an open and frank discussion without becoming professional flamebait –as nicely illustrated by your somewhat abusive response.

        The COXEN algorithm from the Virginia group is actually very different algorithmically from that used by the Duke group and is not difficult to understand. (FYI I have no connection to the Duke or Virginia groups or the company you mentioned). Unlike the previous algorithm, the COXEN algorithm doesn’t use PCA or any factor analysis at all—genes are not collapsed to “metagenes”. Rather, this is a variation on a “correlation of correlation” method sometimes called the Integrative Correlation Coefficient. The Pearson Correlation Coefficient for each particular gene A in the training set and all other genes in the training set across all conditions is first computed. The same procedure is repeated for gene A in the test set. The two vectors of correlation coefficients for gene A in the training set and Gene A the test set are then subjected to another correlation analysis, which in this case appears to be closer to covariance rather than a true correlation coefficient. Each “correlation of correlations” for each gene is then assigned a p-value through an empirical cumulative distribution function via a permutation test (ie by selecting same-sized training and test sets drawn randomly). The p-value for each gene then describes how a gene A is “behaving” (relative to other genes in the signature) in the training set relative to the test set. Genes that are “behaving differently” that have p-values above threshold are removed from the signature and the remaining set of genes in the signature are then used to train the model using the case labels in the training set. The case labels in the test set are not used—this is essentially a feature selection/normalization procedure.

        You can find another PNAS paper with methodology even more similar to that of the Duke group from a recent effort by our colleagues at the Broad Institute:

        http://www.pnas.org/content/104/14/5959

        The idea of normalization and batch-correction between the input variables (in this case microarray data) is becoming critical in modern genomics. We know that even small details such as which operator processes each microarray can have significant effects on the readout of global gene-expression—-and the problem will not go away with RNA-seq and next-gen sequencing, trust me. So while case labels in the test set have to remain in a “Lockbox”, genomic data really can’t be. In fact, the most popular algorithm for Affymetrix microarray normalization(rma) uses quantile normalization on both the training set and test set simultaneously, and is almost always used this way before any model generation is done—probably 1000’s of papers have been published using this kind of normalization, which would violate your strict interpretation of keeping the entire test set in a “Lockbox.”

      2. Hi NCI Researcher,

        “Thanks for your note, and I imagine nobody is using their real names here so that we can have an open and frank discussion without becoming professional flamebait –as nicely illustrated by your somewhat abusive response.”

        Lovely flame. Thank you. As to how people can have open and frank debates under the complete cloak of anonymity, I remain confused. Under that hypothesis, the swag at scientific meetings should include a hood and gown, completely covering the attendee from head to toe, so that open and frank discussions might ensue. The scientific meetings I have been to present people in the open, with recognizable faces, and name tags showing us all their true identity. I thought the subsequent meetings were then frank and open, but may well be mistaken. As a self-identified scientist, I remain open to being further educated, and changing my mind when new and useful ideas and methods come along. I would appreciate it if you could inform me about the scientific meetings you have attended where open and frank discussions have ensued under the cloak of complete anonymity. I myself have used my true identity in all these discussions, and no one has had any problem in contacting me directly to discuss these issues. I recommend that you give it a try.

        Thank you for the reference to the “Metagene projection” paper by Tamayo et al. It is an interesting read indeed. It presents apparently useful ideas for exploring multi-gene associations with subject phenotypes, and for reducing dimensionality. I have written to the corresponding author to obtain the open-source R software that they state to be available in the paper.

        The reason the open source code is so important is that the number of steps in such complex analyses is large, and is extremely difficult to summarize in a five page journal publication that must also describe the motivation, present analysis results and so on. Even in this paper, typos quickly present issues with understanding the methodology. The first paragraph of the Results section discusses n_sub_M genes, then the next paragraph discusses N_sub_M genes. Is it small case n or capital case N? The letter M used as a subscript is not defined that I can find. Then an italicized M is used to denote a matrix. Given such a brief description in a journal article subject to typos, I could not possibly write up a routine to reproduce the analysis. You can not switch the upper and lower case letters N and n in computer code – the resultant code will do something but it will not be accurate and useful. Thus in this modern era of large genomic data, algorithm concepts can truly only be shared via functioning computer code. Computers are dumb (but very fast), and they only do exactly what we tell them to do, so algorithm code that works is the best way to share algorithms in today’s genomic scientific culture.

        As to the degree of similarity of this methodology to the Duke methodology, the authors themselves state “There is complementary work of Huang (40) and Bild (41), which is conceptually similar to ours in the sense of combining dimensionality reduction and classification models, but has distinct objectives. Their main goal is to provide an exquisitely specific predictor of pathway activation, which has been experimentally characterized by the overexpression of a single gene. In contrast, our goal is to model global transcriptional states, rather than specific pathways, and to use them to describe an entire range of biological behavior, e.g., different morphologies, lineages, etc. Thus, the specific methodologies and techniques we use are also quite different.” The authors seem to think their methodology is “quite different” from the Duke methodology.

        The fact that issues such as “which operator processes each microarray can have significant effects on the readout” points to a major limitation in current genomic assay technology, and in as much as I can trust someone who will not identify him or her self, apparently next-gen technology is no better. How then are these technologies ready to use in experiments on human beings? How can algorithms based on the output of such untrustworthy assays be ready for use in clinical trials? This was a major part of the problem with the Duke fiasco – less-than-honest portrayals about the adequacy of the technology and associated algorithms for assessing important aspects of patient treatment such as choice of medicine for cancer treatment. The rush to bury such assays under the cloak of “proprietary trade secrets” within a biotech startup company, thus seeking to avoid further scrutiny of these problems you outline, should be of much concern to all healthcare customers. I was not able to find any biotech startup advertising the Metagene Projection methodology.

        When I go in for a blood test, my blood is assayed under some controlled conditions specified in a protocol, which I presume and hope was developed and thoroughly tested and verified in reasonably-sized medical studies. The values I see on the test printout are then matched to pre-determined ranges. Those ranges are not refined, there is no “refining decision boundaries” involved, the boundaries are established and false negative and false positive rates for the test are known. Furthermore, my lab data was in a “lockbox”, completely unavailable to the developers of the medical blood test assay algorithm. That lockbox is the future. The decision boundaries for my data are not refined, based on my data. Current algorithms for human cancer treatment include for example the HER2 immunohistochemistry and FISH assay, in a protocol published by the American Society of Clinical Oncology and the College of American Pathologists. They even run a CAP Laboratory Accreditation Program to maintain as much uniformity in processing across all accredited labs. Conditions are set, having been worked out through years of careful study.

        How then are results from genomic assays to be used in medical decision making? If I go into the lab and give a tissue sample, and the tissue is run on a gene chip, how will that be “normalized”? Is my gene chip data going to be added to the 1000’s of papers data and all of it renormalized? Is my data going to be normalized with the other gene chips done in the lab today? If I miss my lab appointment today, will it them be normalized with tomorrow’s chips, giving me a different readout? This is a major problem for this technology. Baggerly and Coombes identified just such a problem in the Duke investigations – a batch effect which associated strongly with model predictions. Just because 1000’s of papers do this doesn’t make it statistically defensible. The fact that thousands of papers present overly-optimistic statements about model fit error rates is very well documented by now, by such researchers as John Ioannidis and Richard Simon. So yes indeed 1000’s of papers do violate the lockbox principal. That really should stop. The findings in those papers have certainly helped guide researchers through the vast dimensions of the genomic landscape in an early genomic exploratory exercise, but there have been plenty of excursions down blind alleys in that journey. Until we can develop more stable genomic assays that will allow a freshly-assayed sample to be compared to historical models, or run through locked-down algorithms, with consistent results given repeated sampling of the same patient, the application of findings in those papers in experiments on human beings remains problematic. These are the points you and other anonymous posters repeatedly refuse to discuss.

      3. Despite the difficulties I had in understanding the methodology in the Tamayo et al. “Metagene projection” paper, the authors were very cordial and helpful in providing me with complete computer code and data. Within 48 hours of my contacting them, they had provided me with computer code and data and I was able to run it and produce the output in their paper. With computer code in hand, I can truly understand their methodology despite the space limitations of the paper referenced above by nci_researcher.

        Hopefully researchers at Duke will adopt this policy and start providing computer code and data with their publications, as also recommended by the recently released Institute of Medicine Omics report. If the Duke Translational Medicine Quality Framework initiative is to amount to anything, they will work with their biostatistical teams to stop repeatedly reporting overly-optimistic model fit assessments on “validation” data used in a portion of the model fitting exercise, or at a minimum where this is unavoidable, to at least honestly report that the results were obtained using data involved in the model fitting exercise and are thus likely to be overly optimistic.

    2. On performing some additional research on the COXEN algorithm, I came across an excellent example of a decent algorithm assessment study. The study is reported in the paper “Prospective Comparison of Clinical and Genomic Multivariate Predictors of Response to Neoadjuvant Chemotherapy in Breast Cancer”. The author list includes two of the principal developers of COXEN, Jae K. Lee and Dan Theodorescu, and also Keith Baggerly. In this study, four different algorithms for prediction of response were assessed, on a cohort of 100 breast cancer patients whose data had not been used in the development of any of the four model algorithms. Bad Horse and NCI Researcher should review this paper and adopt some of its good practices. This study represents exactly the sort of honest error rate assessment using a true test set never used in the model fitting that I have been describing in other posts. The COXEN developers are to be commended for publishing true out-of-sample error rates associated with their algorithm, an improvement over the assessment reported in the paper cited by NCI Researcher above.

      Not surprisingly, when assessed on a true out-of-sample test set the authors report “calibration was less good than previously reported” (clinical nomogram algorithm), and “[error rate] values are similar but generally lower than observed in the previous small validation study” (DLDA30 algorithm). This is why repeated assessment of algorithms on data from a lockbox is important. Assessment is all too frequently overly-optimistic when performed on data used in the model fitting procedures. The in-vivo COXEN-based algorithm performed as well as two of the other algorithms, with all three performing better than chance (though the in-vitro COXEN-based algorithm did not perform better than chance).

      Once again, kudos to the COXEN developers for working with Keith Baggerly and running a reasonable lockbox data assessment of their methodology. This is how assessments should be done, and though the initial publication of the COXEN algorithm and attendant biotech start-up company had several disturbing parallels to the Duke fiasco, the COXEN developers avoided going down that bad road, and have here provided an excellent example of how assessments should be done.

      1. After reading these last two or three McKinney comments I agree that it is because we want to see how a model performs on truly new data that we should not peek at the test set at all. Maybe it is less a matter of true and false, and more a matter of what reader wants to know. In these cases every reader wants the same thing I think.

        I myself am guilty of mean centering training and test sets when I think they will have large batch effects (and I have anxiety about the two sets not having the same spectrum of patients, so that I might be making things worse rather than better). In Affy data from separate institutions there will be such effects. It’s a problem of not having well-calibrated or very-reproducible assays. I imagine the cloud-centering as something like simulating what better (PCR) data would act like. I always admit it of course, but don’t actively point out that it is a bit of a cheat (the nerds know anyway). This is in situations where you admit your classifier and assays are not the ones you would use out in the field – I think that’s why we excuse it.

        RMA does have the problem NCI_researcher mentions, and a corollary is that if you removed a sample from the analysis, and don’t admit that and cough up it’s raw data, nobody can get the same numbers you did. Other algorithms don’t have these problems. Right now I don’t think quantile-normalization is a problem like that. I often do it to make the quantiles agree with those of a set standard (not using RMA). I can do that a year from now for a new array I’ve never seen. If the standard is built off statistics computed from training and test sets it violates closed-box, but is anyone very concerned that it wins much of an unfair advantage? I think the winnings are likely small, but that is not a demonstration.

        I agree with “somewhat abusive”, and find it unnecessary. Very nicely done otherwise.

      2. Oh, I’m Rork Kuick from U of Michigan.
        I think some people saying certain things may deserve anonymity though. For instance if they want to tell stories of lab culture they have seen, and don’t want to make it too easy to tell what lab they mean. I might want to do that in future, but maybe now feel I don’t have the luxury.

        Maybe I should divulge that I am a co-author on some David Beer papers, one where the Potti group used the data and we’ve never understood what they did, in case that might bias me some against the Duke group.
        I thank NCI_researcher and Mad Horse for speaking up though, even if they didn’t give their names.

      3. I continue to find it fascinating that people hiding behind the facade of anonymity are so comfortable pointing fingers. I am glad to see Rork Kuick self-identify today after the anonymous post. (I am now reading with interest your ConceptGen paper.) My first post on Retraction Watch is above, dated February 10. Abusive? Really? And the reply from “Bad Horse” stated “You have confused the validation data set with the class labels. As a statistician, you should know this, so I have to conclude that you must have ulterior motives. While I’m not here to defend the Potti work, your post is irresponsible and potentially libelous. Shame on you.” Now that is abusive. Faced with such an accusation, a threat really, I will continue to speak out on this issue, especially with anonymous posters making unproven assertions. The abuse in this whole unfortunate affair was done to dozens of cancer patients who were prematurely entered into clinical trials, many of whom are now involved in law suits against personnel involved in setting up those trials, and to the many cancer victims who will not get adequate treatment as the resources that could have been used for their care were sadly squandered over the many years that this unproven methodology was funded. That needs to stop until this methodology is assessed in proper studies such as the study the COXEN developers conducted that I discussed above.

        As a statistician I have never confused a validation set with a set of class labels. Such confusion apparently was not avoided in the analyses involved with the many retracted papers listed above.

        Rork – which observation of Horse’s is interesting, and how? Certainly there is no need to tone-troll, though there is every need to discuss inappropriate application of unproven methodology. That is not tone-trolling, that is discussion of the substantive issues.

        I also know of no rule saying an MD is not allowed to study and use R or know about statistics – I have also worked with such MDs (I’m doing so right now, and the last MD I worked with received a Guggenheim fellowship in part because of his computing expertise). There is little doubt that Potti was not the author of original versions of the code. One stunning revelation from the Institute of Medicine investigation (see PAF Document 13.pdf, page 1) was that the Duke group claimed that they no longer had the source code for their TreeProfiler program due to a server malfunction and stated “we just do not have the resources to regenerate the source code for the TreeProfiler program. . .”

        You state “If the standard is built off statistics computed from training and test sets it violates closed-box, but is anyone very concerned that it wins much of an unfair advantage?” We would all be DELIGHTED if the methodology won an unfair advantage – this is exactly why other cancer research groups including Baggerly and Coombes at MD Anderson set out to reproduce the methodology. Cancer is a horrible disease, no one would care how much of an unfair advantage any technique had, if it helped sick people identify better therapies. “nci_researcher” posits that “error rates from statistical procedures that use genomic data from the test set have error rates that are not necessarily “artificially inflated”, but in reality might just be higher performing and producing more accurate results”. This needs to be established with proof, and certainly could be evaluated in a clinical study, as demonstrated by Baggerly and the COXEN developers. No one would care if the algorithm for a successful cancer treatment was Bayesian, frequentist or otherwise if it worked. So what might a trial testing such methodology look like? Arm 1: Validation Arm: Collect new patient data. Refit predictive models with new data and old. Make treatment predictions for new patients. Arm 2: Test Arm: Collect new patient data. Run new patient data through locked-down algorithm, predict treatment for new patients. Arm 3: Control Arm: Use current best practice or placebo. Evaluation: After running the trial, overall survival rates on the various arms would show whether such methodology really had an advantage. The report that the Institute of Medicine will soon release will have reasonable guidelines about what a group will need to do to get such a trial up and running in a reasonable and ethical fashion. I predict that one of the requirements will be to provide source code to a regulatory body such as the NCI for any computer programs involved, so back up those servers.

        “nci_researcher” further states “It goes without saying that it’s not valid to compare error rates between statistical procedures that normalize between the input variables of training/test sets and those that don’t.” Why would it not be valid? This is exactly the sort of error rate comparison performed in the clinical study of Baggerly and the COXEN developers.

        There are sound statistical methods to properly and demonstrably evaluate the performance of any of these models. Threats and hubris do not constitute such methods. More open and honest discussion of reasonable evaluation methods will win out in this ultimately scientific debate, and yield helpful maneuvers for sick people sooner.

      4. I’m not saying I agree with anything Horse wrote, or understand every sentence even, but it did point out that the method didn’t use the sample labels of the test set. That’s all.

        I do think we could benefit from wearing hoods at a conference sometimes. Folks often fail to ask the hard questions or make points about fudging for fear of retaliation. My understanding is that is why reviewers of papers are often not identified. Maybe every statistician in the world that reads paper P is deeply skeptical of whether it contains cheating or not, but nobody speaks out. If I find that a group has used an underhanded method or made it impossible to figure out – and that happens every week – I am unlikely to say so publicly. Bad people benefit. Note that I do realize and worry about the reverse, where an unknown person criticizes a group for some hidden reason. I don’t doubt these issues have been debated to pieces here in the past.
        Maybe numerate scientists need to band together and publish reviews of papers, where every member of the band is known, but the 2 or 3 people writing any particular review is impossible to figure out. A bit like Nicolas Bourbaki.

        PS: Most of the credit for ConceptGen doesn’t go to me. I did not program any of it.
        -Rork Kuick

      5. “I’m not saying I agree with anything Horse wrote, or understand every sentence even, but it did point out that the method didn’t use the sample labels of the test set.”

        We certainly expect that gene expression levels or gene copy number associates with potential sensitivity or resistance to a treatment – after all we know that patients with ER positive breast cancers respond well to hormone therapies such as Tamoxifen, and patients with HER2 positive breast cancers respond to Herceptin – so why not look for other gene targets? If we thought that there were no patterns in gene expression or gene copy number data that could tell us anything about disease severity or potential response to drugs, why would we be studying genetic data so fervently?

        So we expect that patterns in gene expression data can tell us something about class labels – in this case the class labels are whether or not a patient will get some anti-cancer benefit from taking a drug.

        So now we examine the gene expression data of the validation set, and we look for patterns in the validation set that are like patterns we see in the training set where we have the class labels.

        Then we show that the model we have built can magnificently show us who will respond to a drug in the validation set, after we have peeked at valuable information in the validation set and “refined the decision boundaries”.

        So just because we don’t know the class labels from the validation set does not protect us from overfitting to some random ridge or valley in the multidimensional dataset that runs from the training data into the validation data – but not, unfortunately, on in to the data for patients yet to be diagnosed.

        We need to find the patterns that occur repeatedly – the generalizable patterns that show up consistently – which is why we need more model evaluation on lock-box test sets. Granting agencies and scientific communities need to pool resources so that studies of a large enough size can be run using the train-validate-test set paradigm that Friedman, Hastie and Tibshirani discuss that I cited above, instead of many small studies that just keep publishing results reflecting local lumps and bumps in small data sets. Granting agencies and journals also need to stop treating replication as a secondary endpoint of little interest. We need to place greater value on replication and establish mechanisms to allow more validation of potentially useful methods, so we can eliminate the poor ones sooner and not litter the literature with dozens of papers demonstrating apparently accurate predictive models that have never truly been vetted and replicated.

    1. This seems to be the current trend … show concern but do nothing. I fail to understand why authors are allowed to do new experiments to prove what they had originally claimed rather than asking them to provide the original data from the 2-n experiments done which presumably got misrepresented while making the figures?

  18. Additional journal review and guideline papers just published, summarizing issues from the large IOM report of findings and providing new guidelines for OMICS researchers:

    Nature 502, 317–320 (17 October 2013)

    Criteria for the use of omics-based predictors in clinical trials

    Lisa M. McShane et al. doi:10.1038/nature12564

    http://www.nature.com/nature/journal/v502/n7471/full/nature12564.html

    BMC Medicine 2013, 11:221

    OMICS-based personalized oncology: if it is worth doing, it is worth doing well!

    Daniel F Hayes doi:10.1186/1741-7015-11-221

    http://www.biomedcentral.com/1741-7015/11/221

    BMC Medicine 2013, 11:220

    Criteria for the use of omics-based predictors in clinical trials: explanation and elaboration

    Lisa M McShane et al. doi:10.1186/1741-7015-11-220

    http://www.biomedcentral.com/1741-7015/11/220

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.