Publicly available data on thousands of OKCupid users pulled over copyright claim

okcupidThe Open Science Framework (OSF) has pulled a dataset from 70,000 users of the online dating site OkCupid over copyright concerns, according to the study author.

The release of the dataset generated concerns, by making personal information — including personality traits — publicly available.

Emil Kirkegaard, a master’s student at Aarhus University in Denmark, told us that the OSF removed the data from its site after OkCupid filed a claim under the Digital Millennium Copyright Act (DMCA), which requires the host of online content to remove it under certain conditions. Kirkegaard also submitted a paper based on this dataset to the journal he edits, Open Differential Psychology. But with the dataset no longer public, the fate of the paper is subject to “internal discussions,” he told us.

In place of the dataset on OSF, this message now appears:

Unavailable For Legal Reasons

This record has been suspended

Kirkegaard told us more about what happened:

Initially, the dataset was unpassworded. However OSF requested that we password-protect it. This was done to better enable the debate about the privacy status of such datasets.

Second, the repository is currently unavailable due to a DMCA claim sent by OKCupid. It’s unclear to me which part they claim copyright on. They have not contacted me. OSF is investigating the claim.

Brian Nosek, the executive director of the Center for Open Science, which maintains the OSF, confirmed the data were removed Friday:

We removed the user datafile on Friday following our internal review (the one with the potentially identifying data). OKCupid did file a DMCA claim on Friday evening, and to process that request, we shut down access to the entire project.  We are still processing that claim.

The COS had been looking into the dataset even before OkCupid responded, Nosek added:

OSF is like Youtube, Facebook, Instagram, and other places that users post content.  We don’t know in advance what people will post.  We respond to potential misuse with an investigative process.

When we learned on Wednesday that there was the possibility of users identifying information in that file, we initiated an investigation. By Wednesday evening, we had enough preliminary information to request that the person posting remove the user datafile or make it private. He agreed and converted the file to a password protected version on Thursday morning.  We also removed access to prior versions of the file at that point.  Then, we took Thursday to conduct a full review and determined that the file should be removed.  We confirmed with the poster, and he agreed to have it removed.  Later Friday, we received and started to addres the DMCA request.

Because the data are no longer available, it’s not clear whether the journal will accept the paper, Kirkegaard noted:

The paper is submitted for review, not published. That’s why it does not appear on the front page of the website. Some newspapers have incorrectly stated that it was published. It was not. However, journal policy is that papers must publish all their data at the time of submission, which is what we did…The review team/editorial board is having internal discussions on the fate of this paper. The purpose of the journals was not to attract this much attention. They are meant as open science alternatives to journals like Intelligence (Elsevier’s closed access journal).

If the journal does not take the paper, we will probably publish it elsewhere. The paper itself should be fairly uncontroversial as none of the findings are new — in fact, they were explicitly chosen as calibration tests for the dataset.

The newspapers Kirkegaard mentions have, indeed, been reporting on the dataset, as it appears to violate a cardinal rule of research by publishing identifiable information about people without their consent. As Vox reported:

The data, collected from November 2014 to March 2015, includes user names, ages, gender, religion, and personality traits, as well as answers to the personal questions the site asks to help match potential mates. The users hail from a few dozen countries around the world.

The move sparked a large amount of concern, as critics objected to the use of data without users’ permission:

https://twitter.com/AngeBassa/status/730462575549386752

Even Aarhus disavowed the project:

Kirkegaard reiterated to us that the university had nothing to do with the project:

This research was done in our spare time and was not related to our studies.

He added that he did not expect such intense criticism about the project:

We did not anticipate any strong reaction, no. We wanted to contribute a nice open dataset to science, we did not want to be famous for it.

When we asked him why he believed he didn’t need permission from users, he sent us a Q&A about the dataset, which says, in response to the question “Are the data public?:”

This depends on the definition used, but in our opinion yes. The profile information of many users can be freely seen from Google. This includes pictures, age, gender, sexual identity and the profile text. To see users’ answers to questions, however, one must have answered the same question. This means that one must be logged in with a user that has answered that question. OKCupid itself clearly states in their terms of service that the information may be public…Furthermore, when users answer a question, they get the option to answer the question privately…Most users do not choose to answer privately. We did not and could not scrape the private answers because they are not possible to see for others.

In response to the question “Why did you publish the usernames?,” the document says:

There were two reasons to do this. First, we forgot to scrape some of the information such as the profile text (a critical oversight!) and if one has the usernames, one can do this at a later point provided the user is still there. Second, the usernames themselves are an interesting topic of research. Usernames play a crucial part in a person’s presentation and so are not randomly chosen. One can thus research what predicts choice of username. For instance, do people who include “hot” in their username see themselves as more attractive? Many users use animals in their names, are people who chose the same animal more similar than people who don’t?

It is possible that the usernames will be removed in a future version of the dataset as one may argue that the two scientific goals above do not outweigh the privacy concern from the usernames being available.

Long-time readers of Retraction Watch will be familiar with claims related to the Digital Millennium Copyright Act. In 2013, ten of our posts about Anil Potti disappeared for two weeks, after an organization in India claimed we had violated the DMCA, and filed a takedown notice. Soon after, we joined Automattic, the parent company of our blog’s host WordPress, against the person who filed the fraudulent DMCA takedown notice. You can read our entire complaint here. (We ended up withdrawing the suit one year later, when it became clear it would be impossible to pursue action against the defendant.)

Readers will also be familiar with the OSF — we’re partnering with them to create a database of retractions.

Like Retraction Watch? Consider making a tax-deductible contribution to support our growth. You can also follow us on Twitter, like us on Facebook, add us to your RSS reader, sign up on our homepage for an email every time there’s a new post, or subscribe to our new daily digest. Click here to review our Comments Policy. For a sneak peek at what we’re working on, click here.

15 thoughts on “Publicly available data on thousands of OKCupid users pulled over copyright claim”

  1. Although I fully agree that publishing the data is worrisome in light of ethical choices (usernames might be interesting, but hardly proportional to the violation of the “participant”), data is not subject to copyright and the DMCA takedown does not really make sense because of this. In the EU, one might even state that the database currently falls under the database directive and is the copyright of the creators. I wonder how the DMCA claim was formulated, because it might just be an invalid takedown request depending on the formulation.

    Note: the database directive is likely to be removed from law in the upcoming copyright reforms later this year.

    1. By scraping the data using profiles that did not belong to the authors, they essentially were able to recreate a subset of the OKCupid database. While data are not themselves copyrightable, databases are, at least in USA law, which would be relevant to OKCupid and the OSF. Essentially, if Kirkegaard’s dataset in some way recapitulates the creativity or selectivity OKCupid used to compile the data (such as, just for speculation, the specific questions that were chosen to be scraped), then the company would have grounds to use a DMCA on anyone hosting that data. If they want to go after Kirkegaard individually, they will likely have to find a similar statute in Danish law or prove that in publishing the dataset he violated the privacy rights of Danish citizens, which I’m told can be punished quite severely by people who have more experience with academia in Denmark.

  2. An important question is whether their use breached the terms and conditions they would have needed to agree to to access the site. Judging from the OkCupid website terms where the requirement is that data is to be used only for personal use, I expect so.

  3. Did the study protocol go through any ethical review? Or was evidence of ethical review required by the journal?

    The authors are quoted here as saying ” This research was done in our spare time and was not related to our studies.” Does this mean they believed it didn’t need any oversight? What is the university’s role when they appear to be claiming a uni affiliations?

    1. A few notes from the following link (https://ironholds.org/blog/when-science-goes-bad-consent-data-and-doubling-down-on-the-internet/):
      – The author/creator of the dataset is the editor of the journal that it is published in and has authored half of the papers.

      – The data was created for the testing of a number of hypotheses that are of various dubious scientific validity. Mostly psuedo-science racasim, homophobia and similar content.

      – The data was collected with any effort to provide anonymity and without cooperation from Okay Cupid and when it was suggested to make the data less identifying, critics were dismissed as a “social justice warrior conspiracy”.

      In general this dataset would violate most of the IRB rules presented to social science researchers and the research it is for has no value to the public.

      1. Thanks. I looked at the journal’s website, and it seems there is absolutely no requirement for papers to contain statements about ethical review or even compliance with declarations on ethical conduct of human subjects’ research. The “journal” is therefore well outside the generally accepted standards for ethical conduct in research involving human subjects.

        Interesting too is the response on Twitter from @AarhusUni. They’ve attempted to distance themselves, saying “Notice: @KirkegaardEmil points out himself that he is working on a private initiative and not associated with AU”
        https://twitter.com/AarhusUni/status/731089547648438272

        Yet Kirkegaard lists his AU affiliation in his role as the “journal” editor: http://openpsych.net/ODP/editorial-board/
        I’d be very surprised if he didn’t also use AU resources to conduct the study.

  4. It’s peculiar to see Kirkegaard discussing his journal submission process in this distant way, given that (as I understand it) he’s the editor in chief of that particular journal and also an author on about half its papers.

  5. I don’t believe that this was legal, at least not under the European law that I know. I research in the Netherlands, and the Dutch laws recognize Good Clinical Practice (GCP) standards. This entails that anyone must give informed consent when you *collect* sensitive data, regardless of whether you are doing it anonymously or not. Moreover, the data collected were fairly sensitive, which makes me think that such an undertaking would require approval some sort of ethical committee.

    Neither did any of the people whose data were pulled consent to it, and judging by the university’s statements, nor were the data were collected in the context of some university-wide blanket approval. So, if the situation in Denmark is the comparable to the Netherlands (and I’d think it is as both countries acknowledge GCP in their legislation), I would expect that this study was undertaken illegally.

    But then again, I’m not working in Denmark so I might be wrong. Can someone explain to me how the situation in Denmark is in this matter?

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.