The authors of a paper on brain genetics published online in June in the Proceedings of the National Academy of Sciences (PNAS) are retracting it for “a potential confound relating to statistical inference.”
Here’s the notice for “Identification of gene ontologies linked to prefrontal–hippocampal functional coupling in the human brain:”
The authors wish to note the following: “In this paper we report an association of the ‘synapse organization and biogenesis’ gene set with a neuroimaging phenotype, using gene set enrichment methodology. The methods and results of the paper, as described, have been conducted after consultation with experts in the field and support this conclusion. However, a potential confound relating to statistical inference has been brought to our attention that arises from the fact that several clustered genes, all of which are included in this gene set, have been tagged by the same SNP. This problem, which concerns only a small fraction of our tested gene sets (unfortunately including our top finding), belongs to a known category of potential pitfalls in gene set association analyses, and we are sorry that this problem was not detected earlier. Our reanalyses suggest that if adjustments for this confound are applied, the results for our top finding no longer reach experiment-wide significance. Therefore, we feel that the presented findings are not currently sufficiently robust to provide definitive support for the conclusions of our paper, and that an extensive reanalysis of the data is required. The authors have therefore unanimously decided to retract this paper at this time.”
The Scientist, which first reported the retraction, has more details on the back story:
The Scientist first learned of possible problems with this analysis when the paper was under embargo prior to publication. At that time, The Scientist contacted Paul Pavlidis, a professor of psychiatry at the University of British Columbia who was not connected to the work, for comment on the paper. He pointed out a potential methodological flaw that could invalidate its conclusions. After considering the authors’ analyses, Pavlidis reached out to Meyer-Lindenberg’s team to discuss the statistical issues he perceived.
And:
Elizabeth Thomas, who studies the molecular mechanisms of neurological disorders at The Scripps Research Institute in La Jolla, California, and was not involved in the work noted that the GO [gene ontology] annotations used in the study were outdated. “GOs change every few months, and it’s unfortunate for researchers that rely on a certain set of annotations. It makes you wonder whether the papers published in the past five to 10 years are still relevant,” said Thomas. “This retraction raises the issue of how many papers may have falsely reported gene associations because of the constantly evolving changes in gene assemblies and boundaries. That’s really alarming to me.”
As a cell biologist, and not a geneticist, I am confused and startled by the statement of Elizabeth Thomas. Can someone from the genetics field please explain, what this means for the reliability of genetics approaches, when constantly papers are churned on “discovery” of new sets of genes associated with a particular disease or phenotype?
Gene Set Enrichment Analysis is a tool used for microarray analysis typically when one is struggling to find significance using more traditional approaches, so in general one is starting from a fairly weak foundation to begin with. What it does is cluster genes (or SNPs in this case) into pathways BEFORE performing statistical tests on the pathways themselves, thereby increasing power by essentially reducing multiple comparisons. Critically, a gene or SNP that is not significantly altered on its own can appear as such if the pathway it belongs to is ‘enriched’ with other genes that are moving. The standard approach is to just run the analysis on all your genes first and then perform an unbiased pathway analysis on the significantly altered genes for mining purposes. In this case it seems they picked up a pathway that was significantly enriched artificially because a single SNP was counted multiple times in different genes, which apparently is a known issue with their specific type of analysis.
But in my opinion, GSEA is a very weak technique that is only considered when one is desperate and all other approaches have failed. I’ve done it myself, but never published the data because it’s just hard to believe in it. It can work, but it should be done with extreme caution and a lot of post-hoc validation. I suspect much of the array data out there that used this technique back in the day are unreliable.
Dave, many thanks for your explanation. So if I have invested lots of time, money and human manpower into a genetics project where my expectations of significance were not fulfilled, I can use GSEA approach to milk the data till something comes out and with some luck publish it in PNAS?
Essentially, it is about multiple testing including interactions- like it or not, Meyer-Lindenberg made his career with this. Check his earlier papers.