Anita Bandrowski, a neuroscientist at the University of California, San Diego, works on tools to improve the transparency and reproducibility of scientific methods. (Her work on Research Resource Identifiers, or RRIDs, has been previously featured on Retraction Watch.) This week, Bandrowski and colleagues — including Amanda Capes-Davis, who chairs the International Cell Line Authentication Committee — published a paper in eLife that seeks to determine whether these tools are actually influencing the behavior of scientists, in this case by reducing the number of potentially erroneous cell lines used in published studies.
Such issues may affect thousands of papers. Among more than 300,000 cell line names in more than 150,000 articles, Bandrowski and her colleagues “estimate that 8.6% of these cell lines were on the list of problematic cell lines, whereas only 3.3% of the cell lines in the 634 papers that included RRIDs were on the problematic list,” suggesting “that the use of RRIDs is associated with a lower reported use of problematic cell lines.”
Retraction Watch spoke with Bandrowski about the role of these tools in the larger movement to improve transparency and reproducibility in science, and whether meta-scientific text-mining approaches will gain traction in the research community.
Retraction Watch (RW): Your study presents RRID as a behavioral “nudge,” beyond its primary goal of standardizing method reporting. What other nudges can you envision to prevent misuse of cell lines in scientific research?
Anita Bandrowski (AB): We note that while a nudge seems already quite effective, more than 50% difference, we do envision that the future will have tools that will warn authors a little more explicitly about cell lines before they publish. We envision that text mining will allow authors’ text to be interpreted and will enable warnings perhaps when they are writing.
RW: You found scientists often used potentially misidentified or contaminated cell lines, despite an ability to check for this online while writing their manuscript. Isn’t it better to identify problems with your cell line before you perform the experiment, rather than at the manuscript stage?
AB: Absolutely! However, we should all eat well and exercise, but just because we know what to do does not mean we will do it. With RRIDs when they are implemented by journals we at least have one last check before signing off on a manuscript that will be part of the permanent scientific literature.
RW: Would you say the problem is more information barriers or that too many studies are carried out without a prospective plan? Would “registered reports” also address this issue?
AB: I think that registered reports are great. However, the problem is that when authors are registering their reports, there is no means to make sure that cell lines are checked by the author. RRIDs in registered reports would be quite useful, but if there is no enforcement then only the most diligent authors would be alerted to problems….and they are probably the least likely to need the nudge.
RW: Your study is limited by the ability of a natural language processor to automatically detect the names of cell lines in a large scientific corpus. Will this kind of methodology quickly hit its limits in what kind of conclusions can be drawn, or do you think there is a lot of progress to be made using “big data” in meta-science?
AB: I really don’t know what big data will enable, certainly many computer scientists will continue to do very interesting things. For us, the technique of text mining was a means to introspect what we are doing as a field of study and so the key here was the problem. As RRIDs are adopted more widely, many people will likely look again at the assertion that RRIDs and the warnings they carry has some impact on author behavior. I also think that there is nothing particularly new in our text mining methods, save the particular application of these rather powerful tools to a set of biologically interesting entities, which have thus far been overlooked, along with the rest of the methods section, by the text mining community.
RW: Can you discuss how you addressed confounding in this study? Journals and authors that utilize the latest meta-scientific methodologies (e.g., RRID) may represent a different population than those who don’t. Will confounding be a significant issue in future studies along these lines?
AB: We spent significant time trying to address confounds in the paper. Basically, we think that having a sample size in the thousands to hundreds of thousands and an effect size of more than 50% makes it highly unlikely that we are letting ourselves be fooled with statistics. Furthermore, while it is possible that authors of eLife or Neuron papers are more diligent than most, but the more journals that join the RRIDs initiative and ask all authors for RRIDs, or enforce RRIDs, the less likely it is that the RRID authors are different from the general population. So enforcement of RRIDs by journals counteracts the most plausible confounds. Of course, I would love to see what happens in the next 5 or 10 years, will journals that enforce RRIDs continue have half the rate of cell line problems?
RW: Do you think we should have randomized trials for these types of behavioral interventions in scientific methodology, or is this too impractical?
AB: I would love to have a randomized trial, but this would be rather difficult to do given that cell line use is a relatively rare event so it might be rather impractical with smaller journals where a relatively small number of papers is published every year that contain cell lines. I would imagine that getting a sample size of several thousand papers would be quite difficult. The power of text mining allows us to monitor a relatively rare event across a wide range of literature.
RW: How much does variation in open access licensing affect this study and similar studies? Is lack of permission to do “text mining” a problem? How can scientific publishing better enable these kinds of corpus studies?
AB: The thing that makes this study possible, and many other text mining studies is the gold open access set of literature, meaning the ~2 million biomedical papers that are licensed as available for text mining. Without this corpus the cost of doing this type of study would simply be impractical. Just imagine the case where each paper would need to simply be downloaded by a student who makes $10/hour and has no benefits. Lets say the student could download 100 papers per hour (I am assuming a very efficient student but the math is easy), so to analyze a corpus of 2 million papers, the cost to a project would be $200,000 just to download the data but not to do anything with it. Now if we wanted that data to be in a standard XML format (JATS) which is machine readable, that would likely cost at least 100 times as much and still you have not built any algorithm yet you have only gathered the raw data to begin your work. So the work of the publishers, which provide this content in a structured manner, and the national library of medicine in preparing and providing access to the information is critical to the field of text analysis. Their work in structuring publications has allowed us to understand how an intervention, the RRID, is affecting the quality of cell lines used. I wonder what sorts of things we will be able to understand as more data becomes available to machines.
Like Retraction Watch? You can make a tax-deductible contribution to support our growth, follow us on Twitter, like us on Facebook, add us to your RSS reader, sign up for an email every time there’s a new post (look for the “follow” button at the lower right part of your screen), or subscribe to our daily digest. If you find a retraction that’s not in our database, you can let us know here. For comments or feedback, email us at email@example.com.