A group of researchers in Canada and India have lost a paper on vaccine hesitancy and Covid-19 because they didn’t have the proper license to mine a database of news articles used in the study.
The paper, “Tracking COVID-19 vaccine hesitancy and logistical challenges: A machine learning approach,” was published in PLOS ONE on June 2. Led by Shantanu Dutta, of the Telfer School of Management at the University of Ottawa, the researchers set out to:
systematically track ‘vaccine hesitancy’ and ‘logistical challenges’ associated with the Covid-19 vaccines, in the USA. To that effect, we use news articles from reputed media sources and create dictionaries to estimate different aspects of vaccine hesitancy and logistical challenges.
Their analysis, which incorporated machine learning and language processing to mine data from a repository of articles owned by Factiva — a Dow Jones property — found that:
over time, as vaccine developers complete different phase trials and get approval for their respective vaccines, the number of vaccine related news articles increases sharply. Accordingly, we also see a sharp increase in vaccine hesitancy related topics in news articles. However, in January 2021, there has been a decrease in the vaccine hesitancy score, which will give some relief to the health administrators and regulators. Our findings further show that as we get closer to the breakthrough of effective Covid-19 vaccines, new logistical challenges continue to rise, even in recent months.
That relief, if it ever existed, was fleeting — and so was the paper. As Dutta explained, although the University of Ottawa library subscribes to Factiva, the service doesn’t permit data mining. Here’s the retraction notice:
After this article was published, concerns came to light regarding data use permissions. The authors obtained news articles for this study on Factiva. While the authors represented to PLOS that they had legitimate permissions to access the articles, concerns were noted post-publication that the authors’ data mining of news articles on Factiva did not comply with the terms of the University of Ottawa’s license with Factiva. Therefore, the authors retract this article.
The authors requested that PLOS remove the article from online publication, and informed PLOS that the University of Ottawa library and representatives of Factiva (Dow Jones) had both requested this action. Following an internal assessment of this case, PLOS agreed to remove the article from the PLOS ONE website at the time of retraction.
All authors agreed with retraction.
For Dutta and her colleagues, the episode has been “a very unfortunate event.” The lesson for other researchers, she said, is:
we should not automatically assume that we have text mining right for the contents we can access through library or other legitimate sources. We need to verify this.
Dow Jones did not respond to a request for comment.
Like Retraction Watch? You can make a one-time tax-deductible contribution or a monthly tax-deductible donation to support our work, follow us on Twitter, like us on Facebook, add us to your RSS reader, or subscribe to our daily digest. If you find a retraction that’s not in our database, you can let us know here. For comments or feedback, email us at [email protected].
Sorry if I don’t understand what may be a basic concept, but how does “data mining” differ from doing word or boolean searches in a database in order to retrieve articles, and then writing about the retrieved articles? I assume there is no problem with doing the latter.
Data mining uses large corpora of text to do analysis using programming — there’s a bunch of popular Python libraries. There is a huge amount of interest in it, but library database licenses don’t allow downloading the huge number of articles needed. I’m surprised they were able to build their corpus, as Factiva has put in controls to stop scripts and will cut off access site-wide if you manage to get around that.
What exactly does this type of licensing restriction imply? You can view individual items of data but are prohibited from doing any correlations across the items?
Good point. I’d like to know too.
Sounds like some type of data “plagiarism.” You can view the info but cannot use the data for comparison? Please explain?
So sad to see researchers fold like this. They didn’t even suggest that this is insane state of affairs while complying. These types of agreements with publishing companies are detrimental to critical analysis and clear thinking on important issues that affect us all.
This is pretty close to the issue that recently came out with NYU researchers analyzing Facebook advertising data. I hate Facebook and all they stand for, but in that case, as in this one, the terms of service are clear about what researchers can and cannot do with their proprietary data. While it would be nice for them to extend these policies for researchers, they are not obligated to. This situation is akin to not getting proper IRB approval to run a study with human subjects. The onus is on the researchers to know what is and what is not allowable in their collection of data.