It’s been a busy few months for Brian Wansink, a prominent food researcher at Cornell University. A blog post he wrote in November prompted a huge backlash from readers who accused him of using problematic research methods to produce questionable data, and a group of researchers suggested four of his papers contained 150 inconsistencies. The scientist has since announced he’s asked a non-author to reanalyze the data — a researcher in his own lab. Meanwhile, criticisms continue to mount. We spoke with Wansink about the backlash, and how he hopes to answer his critics’ questions.
Retraction Watch: Why not engage someone outside your lab to revalidate the analysis of the four papers under question?
Brian Wansink: That’s a great question, and we thought a lot about that. In the end, we want to do this as quickly and accurately as possible – get the scripts written up, state the rationale (i.e., why we made particular choices in the original paper), and post it on a public website. Also, because this same researcher will also be deidentifying the data, it’s important to keep everything corralled together until all of this gets done.
But before we post the data and scripts, we also plan on getting some other statisticians to look at the papers and the scripts. These will most likely be stats profs who are at Cornell but not in my lab. We’ve already requested one addition to [the Institutional Review Board (IRB)], so that’s speeding ahead.
But even though someone in my lab is doing the analyses, like I said, we’re going to post the deidentified data, the analysis scripts (as in, how everyone is coded), tables, and log files. That way everyone knows exactly how it’s analyzed and they can rerun it on different stats programs, like SPSS or STATA or SAS, or whatever. It will be open to anyone. I’m also going to use this data for some stat analysis exercises into one of my courses. Yet another reason to get it up as fast as possible – before the course is over.
RW: A number of commenters have raised concerns about the general research approach you took in the four papers. As in – here’s a dataset, let’s try to get some papers out of it. How do you respond to accusations of p-hacking or HARKing?
BW: Well, we weren’t testing a registered hypothesis, so there’d be no way for us to try to massage the data to meet it. From what I understand, that’s one definition of p-hacking. Originally, we were testing a hypothesis – we thought the more expensive the pizza, the more you’d eat. And that was a null result.
But we set up this two-month study so that we could look at a whole bunch of totally unanswered empirical questions that we thought would be interesting for people who like to eat in restaurants. For example, if you’re eating a meal, what part influences how much like the meal? The first part, the middle part, or the last part? We had no prior hypothesis to think anything would predominate. We didn’t know anybody who had looked at this in a restaurant, so it was a totally empirical question. We asked people to rate the first, middle, and last piece of pizza – for those who ate 3 or more pieces – and asked them to rate and the quality of the entire meal. We plotted out the data to find out which piece was most linked to the rating of the overall meal, and saw ‘Oh, it looks like this happens.’ It was total empiricism. This is why we state the purpose of these papers is ‘to explore the answer to x.’ It’s not like testing Prospect Theory or a cognitive dissonance hypothesis. There’s no theoretical precedent, like the Journal of Pizza Quality Research. Not yet.
Field studies aren’t lab studies. They’re so darned involved that, in addition to the main thing you’re testing, we usually try to explore some empirical answers to other things that there aren’t yet answers to but which might happen in this real-world situation. Like do guys eat more with women or with other guys. If there’s a provocative answer to one of these, it can be tested in more detail in the lab, if merited. For instance, it could be the first essay in a dissertation, and then it could be followed up with a couple lab studies to confirm or disconfirm it. In this case, her dissertation went in a different direction once she got back to her own university. As a result, these ended up as single exploratory studies.
These sorts of studies are either first steps, or sometimes they’re real-world demonstrations of existing lab findings. They aren’t intended to be the first and last word about a social science issue. Social science isn’t definitive like chemistry. Like Jim Morrison said, “People are strange.” In a good way.
RW: Cornell has said they think it’s up to investigators to decide if they should release data or not, balancing the needs for confidentiality. Do you agree with that?
BW: I do agree with that. I think having researcher independence is a good idea, in the spirit of academic freedom. Having said that, this experience is changing how we’re doing things in the lab. Prior to this, we had no mechanisms or conventions in place to easily locate previous datasets that were collected 7 or 9 years ago, let alone give them to somebody given the high standard for IRB confidentiality agreements we’ve been using for 10 years.
Going forward, we’re now going to try and make major datasets we collect from now on – particularly field study data – publicly available about the time we publish a paper. Since some of this research in grocery stores or restaurants is often proprietary, in the past we have signed agreements saying we wouldn’t share sales data with anyone. But moving forward, I think we can loosen those up a bit. Just last week we modified the template agreement letters that subjects sign, so that we will protect their confidentiality but still ask them if we can share some aspects – like age, height, and weight – if they consent. In the past, we promised them we wouldn’t share anything about them. That’s pretty restrictive.
We’ve already changed this with our lab studies, and we’ll be doing something similar with the next field studies we run. I’m sure there’s going to be some learning with this, but I think it will also result in a useful and more general set of guidelines and protocols that other labs can use when they do field studies as part of their research.
RW: Critics have identified a number of numeric inconsistencies in your papers – just this week, an article in Medium pointed out problems in papers other than the four you’ve agreed to re-analyze. How do you respond to allegations that some of the numbers in your papers don’t add up?
BW: Studies in elementary school lunchrooms are different than running a reaction time study on a computer keyboard. Nobody starts a food fight or steals an apple during a reaction time study
In elementary school studies, there were times that the math of what food kids were given and what they ate vs. left behind didn’t add up to a perfectly round number because this study was done in elementary schools (and based on the well-cited quarter plate data collection method referenced in that paper). For anybody who can remember school lunches as an 8 year old – some amount of it ends up on the floor or in pockets. Also, to be conservative, when we state percentage increases or decreases, we usually try to calculate them from the average level of the range and not from the top or bottom.
With regards to the four papers that were originally questioned, we haven’t gotten the final report from the non-coauthor econometrician, the one in our lab, but many of the inconsistencies noted through granularity testing will be due to people skipping survey questions. For instance, you might report that there are 30 people in one condition, but for any given question anywhere from 26-30 might answer it, so it would get flagged by a granularity test. These people were eating lunch and they could skip any question they wanted – maybe they were eating with their best buddy or girlfriend and didn’t want to be distracted. So that explains many of the differences in the base figures some readers had noticed.
Also, we realized we asked people how much pizza they ate in two different ways – once, by asking them to provide an integer of how many pieces they ate, like 0, 1, 2, 3 and so on. Another time we asked them to put an “X” on a scale that just had a “0” and “12” at either end, with no integer mark in between. It was a just a silly way to ask the question. That’s the one inconsistency we’ve identified so far, but fortunately when the coauthors themselves have since looked at these, the conclusions are about the same as when integers are used.
Across all sorts of studies, we’ve had really high replication of our findings by other groups and other studies. This is particularly true with field studies. One reason some of these findings are cited so much is because other researchers find the same types of results. When other people start finding the same things, that moves social science ahead. That’s why replication studies are useful. Still, even replication studies need some toeholds to get started. It’s kind of strange to think of some of your studies as being toeholds, but at least they’ve been useful toeholds.
Like Retraction Watch? Consider making a tax-deductible contribution to support our growth. You can also follow us on Twitter, like us on Facebook, add us to your RSS reader, sign up on our homepage for an email every time there’s a new post, or subscribe to our daily digest. Click here to review our Comments Policy. For a sneak peek at what we’re working on, click here.
I’m sympathetic to Brian Wansink’s position. I read in it a statement of how to conduct an exploratory study–namely, to state a broad hypothesis to guide wide-sweeping data collection, followed by a multi-dimensional exploration of the data, then followed by report of what was learned (and not), then finally, by specification of a more precise set of hypotheses to guide future research on the topic. The APA Standards for Educational and Psychological Testing (2014) offers definitions and procedures for assessing the reliability and validity of measurement procedures that may be helpful going forward for Dr. Wansink and colleagues at Cornell University and for researchers who wish to reanalyze or replicate his studies. http://www.apa.org/science/programs/testing/standards.aspx
The Journal of Pizza Quality Research sounds like an excellent venue for salami publications.
Also, I doubt anyone would object to Prof Wansink’s exploratory studies if he actually never published them in the first place and instead spent his time coming up with tighter study designs that would properly test the hypotheses that purportedly came out of those explorations. But of course that would not lead to quick glory, daytime television interviews, and clickbait headlines.
I would be interested to know how Prof Wansink decides which of his exploratory results to present to politicians as policy advice, and which to ignore as soon as the paper is published. For instance, his report that “men eat more in the company of women” implies that over-consumption can be addressed by having gender-segregation in restaurants (or alternating male-only and female-only nights), but Wansink has been focusing on different advice. It’s almost as if he doesn’t take that paper seriously himself.
RW asked:
“As in – here’s a dataset, let’s try to get some papers out of it. How do you respond to accusations of p-hacking or HARKing?”
Wansink responded:
“Well, we weren’t testing a registered hypothesis, so there’d be no way for us to try to massage the data to meet it. From what I understand, that’s one definition of p-hacking.”
Yes it is, just not the one you were asked about.
Does he even understand the accusation? Could someone please show him the XKCD comic with the green jelly beans?
https://www.xkcd.com/882/
Something is unclear to me here.
Alison linked to a blog post by Jordan Anaya, which described some problems in an article that studied the consumption of vegetables in school lunchrooms. Dr. Wansink’s reply includes the following: “there were times that the math of what food kids were given and what they ate vs. left behind didn’t add up to a perfectly round number because this study was done in elementary schools (and based on the well-cited quarter plate data collection method referenced in that paper).” I think that it’s reasonable to conclude, from that reply and the link to Jordan’s blog, that the article in question here is “Attractive names sustain increased vegetable intake in schools” (10.1016/j.ypmed.2012.07.012).
The “Attractive names” article contains this sentence on p. 330, at the end of the Methods section: “Following lunch, the *weight* [emphasis added] of any remaining carrots was subtracted from their starting weight to determine the actual amount eaten.” So apparently the carrots were *weighed*, which strongly implies that the quarter-plate method was *not* used. I don’t understand why Dr. Wansink would claim that it was, or that the “Attractive names” article cites an article about the quarter-plate method (it doesn’t).
Note also that the subtraction of the final weight from the starting weight ought not to introduce any errors that could explain the discrepancy (noted in Jordan Anaya’s blog post) between the weight of carrots “taken” and the sum of the weights “eaten” and “uneaten”. The sentence I quoted above shows how the amount “eaten” was calculated. Given the initial weight and the final weight, the rest is counted as “eaten,” regardless of whether the food that was no longer on the plate was eaten, dropped, or thrown.
There are several other things about the method of this study that make very little sense (e.g., why the results present numbers of carrots instead of their weight in grams, or how the initial weight — or even number — of carrots on the plates of 113 students was accurately determined by “surreptitious” observation), but this comment is probably dangerously close to loser length already.
I wrote a brief rebuttal that is in the same vein as Nick Brown’s comment:
https://medium.com/@OmnesRes/cornell-and-the-first-law-of-fooddynamics-cb2ed34d7e7f
I also don’t feel that the mathematical errors I brought up in my previous post were adequately addressed in this interview:
https://medium.com/@OmnesRes/cornells-alternative-statistics-a8de10e57ff
I’m not familiar with the particulars of this case, but it is important to note that not all exploratory studies constitute “p-hacking”. P-hacking involves deceptions – the researcher slices and dices the data in order to get a desired result. Then, and this is essential, the researcher deceives the public by covering up the nature of the analysis. He/she hides the hidden agenda and further hides the fact that a variety of null results were thrown out, and that only significant results were presented.
However, there is nothing wrong with a researcher honestly engaging in and presenting an exploratory analysis without a single pre-defined hypothesis. As long as these studies are presented honestly, they can provide useful insights and generate useful hypotheses that can later be verified (or debunked) through attempts to replicate, often by other researchers in the field.
Scientific rigor is on a spectrum, involving projects such as case-only studies, case-control studies, pre-registered prospective studies, and finally, if possible, the exalted double blinded randomized clinical trial. And don’t get me wrong, clinical trials are great, but it takes a while to get there. Much work, effort, and published exploratory analyses are often needed before the necessary knowledge is obtained (or drug is proposed) in order for a clinical trial to take place. Without such exploratory efforts, scientists would rarely ever get to the stage needed to conduct the confirmatory clinical trials at all.
“Aimless” and others who deny this kind of thing is p-hacking,
If you do an exploratory study you need to adjust the null hypothesis to correspond to your methods, in particular, looking at the data many different ways. The significance tests these people do are useless, since the null hypothesis is being violated by the very study design. The only reason to do these “tests” and display the p-values is… p-hacking. There is zero possibility of a legitimate use.