The Open Science Framework (OSF) has pulled a dataset from 70,000 users of the online dating site OkCupid over copyright concerns, according to the study author.
The release of the dataset generated concerns, by making personal information — including personality traits — publicly available.
Emil Kirkegaard, a master’s student at Aarhus University in Denmark, told us that the OSF removed the data from its site after OkCupid filed a claim under the Digital Millennium Copyright Act (DMCA), which requires the host of online content to remove it under certain conditions. Kirkegaard also submitted a paper based on this dataset to the journal he edits, Open Differential Psychology. But with the dataset no longer public, the fate of the paper is subject to “internal discussions,” he told us.
In place of the dataset on OSF, this message now appears:
Unavailable For Legal Reasons
This record has been suspended
Kirkegaard told us more about what happened:
Initially, the dataset was unpassworded. However OSF requested that we password-protect it. This was done to better enable the debate about the privacy status of such datasets.
Second, the repository is currently unavailable due to a DMCA claim sent by OKCupid. It’s unclear to me which part they claim copyright on. They have not contacted me. OSF is investigating the claim.
Brian Nosek, the executive director of the Center for Open Science, which maintains the OSF, confirmed the data were removed Friday:
We removed the user datafile on Friday following our internal review (the one with the potentially identifying data). OKCupid did file a DMCA claim on Friday evening, and to process that request, we shut down access to the entire project. We are still processing that claim.
The COS had been looking into the dataset even before OkCupid responded, Nosek added:
OSF is like Youtube, Facebook, Instagram, and other places that users post content. We don’t know in advance what people will post. We respond to potential misuse with an investigative process.
When we learned on Wednesday that there was the possibility of users identifying information in that file, we initiated an investigation. By Wednesday evening, we had enough preliminary information to request that the person posting remove the user datafile or make it private. He agreed and converted the file to a password protected version on Thursday morning. We also removed access to prior versions of the file at that point. Then, we took Thursday to conduct a full review and determined that the file should be removed. We confirmed with the poster, and he agreed to have it removed. Later Friday, we received and started to addres the DMCA request.
Because the data are no longer available, it’s not clear whether the journal will accept the paper, Kirkegaard noted:
The paper is submitted for review, not published. That’s why it does not appear on the front page of the website. Some newspapers have incorrectly stated that it was published. It was not. However, journal policy is that papers must publish all their data at the time of submission, which is what we did…The review team/editorial board is having internal discussions on the fate of this paper. The purpose of the journals was not to attract this much attention. They are meant as open science alternatives to journals like Intelligence (Elsevier’s closed access journal).
If the journal does not take the paper, we will probably publish it elsewhere. The paper itself should be fairly uncontroversial as none of the findings are new — in fact, they were explicitly chosen as calibration tests for the dataset.
The newspapers Kirkegaard mentions have, indeed, been reporting on the dataset, as it appears to violate a cardinal rule of research by publishing identifiable information about people without their consent. As Vox reported:
The data, collected from November 2014 to March 2015, includes user names, ages, gender, religion, and personality traits, as well as answers to the personal questions the site asks to help match potential mates. The users hail from a few dozen countries around the world.
The move sparked a large amount of concern, as critics objected to the use of data without users’ permission:
I'm dumbfounded. This is egregious; those so-called researchers needed some data ethics classes.
Just. Shameful. https://t.co/l0HNJz8Zl9
— Angela Bassa (@AngeBassa) May 11, 2016
Even Aarhus disavowed the project:
3/3 Neither @KirkegaardEmil's research nor his methods are an expression for AU practices. We are on the case and will keep you updated.
— Aarhus Universitet (@AarhusUni) May 13, 2016
Kirkegaard reiterated to us that the university had nothing to do with the project:
This research was done in our spare time and was not related to our studies.
He added that he did not expect such intense criticism about the project:
We did not anticipate any strong reaction, no. We wanted to contribute a nice open dataset to science, we did not want to be famous for it.
When we asked him why he believed he didn’t need permission from users, he sent us a Q&A about the dataset, which says, in response to the question “Are the data public?:”
This depends on the definition used, but in our opinion yes. The profile information of many users can be freely seen from Google. This includes pictures, age, gender, sexual identity and the profile text. To see users’ answers to questions, however, one must have answered the same question. This means that one must be logged in with a user that has answered that question. OKCupid itself clearly states in their terms of service that the information may be public…Furthermore, when users answer a question, they get the option to answer the question privately…Most users do not choose to answer privately. We did not and could not scrape the private answers because they are not possible to see for others.
In response to the question “Why did you publish the usernames?,” the document says:
There were two reasons to do this. First, we forgot to scrape some of the information such as the profile text (a critical oversight!) and if one has the usernames, one can do this at a later point provided the user is still there. Second, the usernames themselves are an interesting topic of research. Usernames play a crucial part in a person’s presentation and so are not randomly chosen. One can thus research what predicts choice of username. For instance, do people who include “hot” in their username see themselves as more attractive? Many users use animals in their names, are people who chose the same animal more similar than people who don’t?
It is possible that the usernames will be removed in a future version of the dataset as one may argue that the two scientific goals above do not outweigh the privacy concern from the usernames being available.
Long-time readers of Retraction Watch will be familiar with claims related to the Digital Millennium Copyright Act. In 2013, ten of our posts about Anil Potti disappeared for two weeks, after an organization in India claimed we had violated the DMCA, and filed a takedown notice. Soon after, we joined Automattic, the parent company of our blog’s host WordPress, against the person who filed the fraudulent DMCA takedown notice. You can read our entire complaint here. (We ended up withdrawing the suit one year later, when it became clear it would be impossible to pursue action against the defendant.)
Readers will also be familiar with the OSF — we’re partnering with them to create a database of retractions.
Like Retraction Watch? Consider making a tax-deductible contribution to support our growth. You can also follow us on Twitter, like us on Facebook, add us to your RSS reader, sign up on our homepage for an email every time there’s a new post, or subscribe to our new daily digest. Click here to review our Comments Policy. For a sneak peek at what we’re working on, click here.