
Scrolling through an online image dataset, Adrian Barnett, a statistician at the Queensland University of Technology in Australia, pointed out a few familiar faces. Sylvester Stallone as Rambo, and then again on the red carpet. “This is just ridiculous,” Barnett said. George Clooney, Angelina Jolie and Daniel Craig all appear more than once, often with the same image. “You can see,” Barnett said, “this is just a comically bad dataset.”
This particular dataset, collected in a folder titled “droopy” and hosted on an open-source repository called Kaggle, underpins a paper published in Scientific Reports – not as a find-the-celebrity game, but as a training set for a predictive clinical model for early detection of strokes.
The paper is the most recent example of a much wider problem that Barnett and his Ph.D. student Alexander Gibson have documented with Kaggle, which is owned by Google and hosts datasets uploaded by users that researchers and machine learning practitioners can use to build predictive models. By examining two other Kaggle datasets on stroke and diabetes, both of which included tabular patient data, Gibson and Barnett traced how the data move through the scientific literature and in some cases, into clinical use. Their work, described in a preprint posted to medRxiv in February, already has led to several retractions of the papers using these dubious datasets.
Continue reading ‘Comically bad’ datasets used to train clinical models for stroke and diabetes






