Weekend reads: Bad peer reviews; crimes against science; misconduct at Oxford

booksThe week at Retraction Watch featured an exclusive about a prominent heart researcher being dismissed, and a look at signs that a paper’s authorship was bought. Here’s what was happening elsewhere:

Like Retraction Watch? Consider making a tax-deductible contribution to support our growth. You can also follow us on Twitter, like us on Facebook, add us to your RSS reader, sign up on our homepage for an email every time there’s a new post, or subscribe to our daily digest. Click here to review our Comments Policy. For a sneak peek at what we’re working on, click here.

5 thoughts on “Weekend reads: Bad peer reviews; crimes against science; misconduct at Oxford”

  1. Matt Hodgkinson is setting the bar very high for Hindawi. Here’s a challenge for Hindawi: can it run a plagiarism check on all previously published papers before its started to use commercial plagiarism detection software?

  2. Various statements and quotations in the articles about research misconduct by Jeffrey Mervis’ articles in Science prompt my comment that the NSF Office of Inspector general website contains significant documents for cases that result in action by the agency, up to and including debarment. In particular, the central report of investigation document is found there. It is redacted as required. However, the report sections that describe what was done, what the apparent intent was, and what the impact was, are certainly of interest to those trying to get an overview of these issues in research misconduct. The main report also includes a summary of the grantees’ inquiry, investigation, and actions. NSF’s actions are almost always directed to the individual, and NSF does not require specific consequences be imposed by the grantee.

  3. Oops I think they did it again! The compilers of the Weekend Reads have linked to yet another dodgy paper showing some complicated looking statistics, but with inappropriate reasoning and thus inappropriate conclusions about how to fix statistical evidence presented in scientific writings.

    I commented last weekend on the faulty lines of reasoning displayed in the David Colquhoun paper linked in the Weekend Reads.

    The link this weekend, to an accepted manuscript in the Journal of the American Statistical Association by Johnson, Payne, Wang, Asher and Mandal of Texas A&M (“A new paper argues the lack of reproducibility in psychological science means a higher threshold is needed for what constitutes a scientific discovery”) reads quite similarly, save that these authors employ a Bayesian set of models.

    This is informative as one can see here the type of analysis David Colquhoun may have had in mind when he alluded to Bayesian alternatives in his paper.

    But whether one employs frequentist or Bayesian mathematical methods, when the philosophical statistical logic is not properly employed, the same faulty findings emerge.

    This team reaches a remarkably similar conclusion to Colquhoun, suggesting that the p-value cutoff be switched from 0.05 to 0.001.

    Statistical significance alone is insufficient evidence upon which to base a philosophically sound statistical finding of a discovery of scientific relevance. These authors and Colquhoun do not focus sufficiently on the additional essential requirements of determining an effect size of scientific relevance, and a sample size that is likely to detect such an effect size. I see no discussion in this article of how many papers from the psychology field soundly reason about effect sizes of scientific merit, nor how many demonstrated that they had sufficient data to reliably detect such meaningful effects.

    This problem will not be fixed by merely reducing the size of the p-value deemed to be acceptable. Interpreting p-values is always context-dependent. A single p-value in a phase III clinical trial, with a-priori specification of type I and type II error rates, a description of a meaningful effect size, and demonstration of sufficient sample size, is entirely meaningful. A single p-value from a gene chip yielding data on 20,000 genes should not be interpreted alone, and multiple comparisons methods must be employed to properly assess findings across so many results and maintain overall type I and type II error rates.

    The fix to this problem is insisting on a demonstration of what size of an effect has any scientific meaning, and insisting on power calculations that show how much data is needed to reliably detect such a difference of scientific relevance. This is the ugly crux of this problem. We need fewer experiments, of larger sizes, to figure out what is going on scientifically and reach sound conclusions.

    Without a clear demonstration of what a scientifically relevant difference is, and that sufficient data was available to detect such a difference in a high percentage of experiments, the presented findings are merely suggestive exploratory findings. Unfortunately journals seeking to return double digit profits have lowered their standards so that more studies can be published, and the result is a slew of studies, exploratory and suggestive only, presented as exciting new discoveries. This has been demonstrably successful for journal profitability, but demonstrably disastrous at yielding a corpus of useful scientific findings.

    There is no excuse these days upon which to limit the length of articles. Any journal offering a word and page limited set of articles should also offer unlimited supplementary space on the internet so that all of the elements I describe above and below can be demonstrated fully somewhere without page length or word count limitations. No “Brief Communication” or “Letter” article should stand on its own. Such brevia are essentially the modern abstract, and many journals have failed to provide the resources to allow researchers to present their findings in full, nor demanded that they do so. Given the increased complexity of so many multi-centre collaborative efforts needed to solve more complex scientific phenomena, we must have more space to fully describe scientifically sound findings.

    From the article’s conclusions: “More generally, however, editorial policies and funding criteria must adapt to higher standards for discovery. Reviewers must be encouraged to accept manuscripts on the basis of the quality of the experiments conducted, the report of outcome data, and the importance of the hypotheses tested, rather than simply on whether the experimenter was able to generate a test statistic that achieved statistical significance.”

    This conclusory statement by the Texas A&M team is a sound one, though this will not be achieved by merely insisting that p-values less than 0.001 become a new threshold.

    Higher standards of discovery include a solid argument describing an effect size of scientific relevance, and power calculations showing the minimal sample size needed to reliably detect such differences of scientific relevance. Longer articles are needed to fully describe such efforts. An outline of a properly presented scientific finding will include preliminary experiments which suggested a finding; the use of such preliminary experiments to establish an effect size of scientific relevance, and to establish the sample size needed to reliably detect such a difference (high power); then the final experiment or experiments that clearly show detection of a meaningfully sized effect or repeatedly fail to do so. All attempted experiments should be described, so that an assessment of experiment success rate can be established, and when experiments fail to reliably detect a meaningful difference, such studies should be welcomed somewhere to allow others to see failed efforts so as not to waste more time in those pursuits. When journals or on-line archives and databases begin logging all of these descriptions, we will have a much improved evidence base upon which to sort out scientific phenomena. But merely lowering the “p-value cutoff” is not going to get us there.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.