Jens Förster, the Dutch social psychologist accused of misconduct, has posted an open letter on his lab’s website in which he denies wrongdoing.
The letter, in English and dated May 11, offers a detailed rebuttal to the investigation’s conclusions. It also offers a rationale for Förster’s decision not to post his data on the Internet. And it’s followed by a briefer letter from Nira Liberman, who identifies herself as a collaborator of Förster’s.
We present the letter in full below:
Dear colleagues, some of you wonder how I am doing, and how I will address the current accusations. You can imagine that I have a lot of work to do, now. There are many letters to write, there are a number of emails, meetings, and phone calls. I also started the moving process. And there is my daily work.
I keep going because of the tremendous support that I experience. This is clearly overwhelming!
The publication of the LOWI report came unexpectedly, so forgive me that I needed some time to write this response. Another reason is that I still hesitate to share certain insights with the public, because I was asked to remain confidential about the investigation. It is hard for me to decide how far I can go to reveal certain reviews or results. This is especially difficult to me because the Netherlands is a foreign country to me and norms differ from my home country. In addition, this week, the official original complaint was posted to some chatrooms. Both papers raise questions, especially about my Förster et al. 2012 paper published in SPPS.
First and foremost let me repeat that I never manipulated data and I never motivated my co workers to manipulate data. My co author of the 2012 paper, Markus Denzler, has nothing to do with the data collection or the data analysis. I had invited him to join the publication because he was involved generally in the project.
The original accusation raises a few specific questions about my studies. These concerns are easy to alleviate. Let me now respond to the specific questions and explain the rules and procedures in my labs.
Origin of Studies and Lab-Organization During that Time
The series of experiments were run 1999 – 2008 in Germany, most of them Bremen, at Jacobs University; the specific dates of single experiments I do not know anymore. Many studies were run with a population of university students that is not restricted to psychology students. This is how we usually recruited participants. Sometimes, we also tested guests, students in the classrooms or business people that visited. This explains why the gender distribution deviates from the distribution of Amsterdam psychology students. This distribution closely resembles the one reported in my other papers. Note that I never wrote that the studies were conducted at the UvA, this was an unwarranted assumption by the complainant. Indeed, the SPSS files on the creativity experiments for 2012 paper include the 390 German answers. This was also explicitly noted by the expert review for the LOWI who re analyzed the data.
During the 9 years I conducted the studies, I had approximately 150 co-workers (research assistants, interns, volunteers, students, PhDs, colleagues). Note that the LOWI interviewed two research assistants that worked with me at UvA, their reports however do not reflect the typical organization at for example Bremen, where I had a much larger lab with many more co workers. However, former co workers from Bremen invited by the former UvA commission basically confirmed the general procedure described here.
At times I had 15 research assistants and more people (students, interns, volunteers, PhDs, etc.) who would conduct experimental batteries for me. They (those could be different people) entered the data when it was paper and pencil questionnaire data and they would organize computer data into workable summary files (one line per subject, one column per variable). For me to have a better overview of the effects in numerous studies, some would also prepare summary files for me in which multiple experiments would be included. The data files I gave to the LOWI reflect this: To give an example for the SPPS (2012) paper, I had two data files, one including the five experiments that included atypicality ratings as the dependent variable, and one including the seven experiments that included the creativity/analytic tasks. Coworkers analyzed the data, and reported whether the individual studies seemed overall good enough for publication or not. If the data did not confirm the hypothesis, I talked to people in the lab about what needs to be done next, which would typically involve brainstorming about what needs to be changed, implementing the changes, preparing the new study and re-running it.
Note that the acknowledgment sections in the papers are far from complete; this has to do with space limitations and with the fact that during the long time of running the studies. Unfortunately, some names got lost. Sometimes I also thanked research assistants who worked with me on similar studies around the time I wrote a paper.
Amount of Studies
The organization of my lab also explains the relatively large number of studies: 120 participants were typically invited for a session of 2 hours that could include up to 15 different experiments (some of them obviously very short, others longer). This gives you 120 X 15 = 1800 participants. If you only need 60 participants this doubles the number of studies. We had 12 computer stations in Bremen, we used to test participants in parallel. We also had many rooms, such as classrooms or lecture halls that could be used for doing paper and pencil studies or studies with laptops. If you organize your lab efficiently, you would need 2-3 weeks to complete this “experimental battery”. We did approximately 30 of such batteries during my time in Bremen and did many more other studies. Sometimes, people were recruited from campus, but most of them were recruited from the larger Bremen area, and sometimes we paid their travel from the city center, because this involved at least half an hour of travel. Sometimes we also had volunteers who helped us without receiving any payment.
None of the Participants Raised Suspicions and Outliers
The complainant also presumes that the participants are psychology students, typically trained in psychological research methods who are often quite experienced as research participants. He finds it unlikely that none of the participants in my studies raised suspicions about the study. Indeed, at the University of Amsterdam (UvA) undergraduates oftentimes know a lot about psychology experiments and some of them might even know or guess some of the hypotheses. However, as noted before, the participants in the studies in question were neither from UvA nor were they entirely psychology students. Furthermore, the purpose of my studies and the underlying hypotheses are oftentimes difficult to detect. For example, a participant who eats granola and is asked to attend to its ingredients is highly unlikely to think that attending to the ingredients made him or her less creative. Note also that the manipulation is done between participants: other participants, in another group eat the granola while attending to its overall gestalt. Participants do not know and do not have any way to know about the other group: they do not know that the variable that is being manipulated is whether the processing of the granola is local versus global. In those circumstances it is impossible to guess the purpose of the study. Moreover, a common practice in social psychological priming studies is to use “cover stories” about the experiments, which present the manipulation and the dependent measure as two unrelated experiments. We usually tell participants that for economic reasons, we test many different hypotheses for many different researchers and labs in our one to three hour lasting experimental sessions. Each part of a study is introduced as independent from the other parts or the other studies. Cover stories are made especially believable by the fact that most of the studies and experimental sessions indeed contain many unrelated experiments that we lump together. And in fact, many tasks do not look similar to each other. All this explains, I think, why participants in my studies do not guess the hypothesis. That being said, it is possible that the research assistants who actually run the studies and interview the participants for suspicion, do not count as “suspicion” if a participant voices an irrelevant idea about the nature of the study. For example, it is possible that if a participant says “I think that the study tested gender differences in perception of music” it would be counted as “no suspicion raised” – because this hypothesis would not have led to a systematic bias or artifact in our data.
Similarly, the complainant wonders how comes the studies did not have any dropouts. Indeed, I did not drop any outliers in any of the studies reported in the paper. What does happen in my lab, as in any lab, is that some participants fail to complete the experiment (e.g., because of computer failure, personal problems, etc.). The partial data of these people is, of course, useless. Typically, I instruct RAs to fill up the conditions to compensate for such data loss. For example, if I aimed at 20 participants per condition, I will make sure that these will be 20 full-record participants. I do not report the number of participants who failed to complete the study, not only because of journals’ space limitations, but also because I do not find this information informative: when you exclude extreme cases, for example, it could be informative to write what would the results look like had they been not excluded. But you simply have nothing to say about incomplete data.
Size of Effects
The complainant wonders about the size of the effects. First let me note that I generally prefer to examine effects that are strong and that can easily be replicated in my lab as well as in other labs. There are many effects in psychology that are interesting but weak (because they can be influenced by many intervening variables, are culturally dependent, etc.) – I personally do not like to study effects that replicate only every now and then. So, I focus on those effects that are naturally stable and thus can be further examined.
Second, I do think that theoretically, these effects should be strong. In studying global/local processing, I thought I was investigating basic effects that are less affected by moderating variables. It is a common wisdom in psychology that perceptual processes are less influenced by external variables than, for example, achievement motivation or group and communication processes. All over the world people can look at the big picture or at the details. It is what we call a basic distinction. Perception is always the beginning of more complex psychological processes. We perceive first, and then we think, feel, or act. Moreover, I found the global/local processing distinction exciting because it can be tested with classic choice or reaction time paradigms and because it is related to the neurological processes. I expected the effects to be big, because no complex preconditions have to be met (in contrast to other effects, that occur, for example, only in people that have certain personality traits). Finally, I assume that local (or global) processing styles are needed for analytic (or creative) processing- without them there is no creativity or analytic thought. If I trigger the appropriate processing style versus the antagonistic processing style, then relatively large effects should be expected. Note also, that the same effect can be obtained by different routes, or processes that could be potentially provoked by the experimental manipulation. My favorite one is that there are global versus local systems that are directly related to creativity. However, others suggested that a global processing style triggers more intuitive processing – a factor that is known to increase creativity in its own right. Yet others suggested that global processing leads to more fluid processing, yet a third factor that could produce our effects. Thus, the same manipulation of global (vs. local) processing could in principle trigger at least three processes that may produce the same effect in concert. From this perspective too, I believe that one would expect rather big effects.
Moreover, the sheer replicability of the effects further increased my confidence. I thought that the relatively large number of studies secures against the possibility of artifacts. My confidence explains why I did not question the results nor did I suspect the data. Of course I do thorough checks, but I could not see anything suspicious in the data or the results. Moreover, a large number of studies conducted in other labs found similar effects. The effects seem to (conceptually) replicate in other labs as well.
Dependent Measure of Analytic Task in the 2012 SPPS Paper
The complainant further wonders why performances on analytic tasks in general were so poor for undergraduates and are below chance level. The author probably assumes that because the task is given in a multiple-choice format with five alternatives, there is a 0.2 probability to answer each single question by chance. However, in our experiment, participants had only 4 minutes to do the task. If a participant was stuck on the first question, did not solve it correctly, and did not even attempt question 2-4 (which happened a lot), then we consider all 4 responses as incorrect, and the participant receives a score of 0. In other words, participants were not forced to just circle an answer for every question, but rather could leave questions unanswered that we counted as “not solving it” and thus “incorrect”. I think that there is no meaningful way to compute the chance level of answering the question in these studies.
The LOWI found the statistical analyses by the experts convincing. However, note that after almost 2 years of meticulous investigation, they did not find any concrete or behavioral evidence for data manipulation. The LOWI expert who did the relevant analysis always qualifies his methods, even though he is concerned about odd regularities, too. However, after having described his analysis, he concludes:
“Het is natuurlijk mogelijk dat metingen het waargenomen patroon vertonen.”
—->It is of course possible that the observed pattern was obtained by measurements.
This reviewer simply expresses an opinion that I kept repeating from my first letter to the UvA-commission on: Statistical methods are not error free. The choice of methods determines the results. One statistician wrote to me: “Lottery winners are no fraudsters, even though the likelihood is 1: 14 Millions to win the lottery.”
Even though I understand from the net that many agree with the analyses, however, I also received emails from statisticians and colleagues criticizing the fact that such analyses are the major basis for this negative judgment.
I even received more concrete advice suggesting that the methods the complainant used are problematic.
To give some examples, international colleagues wonder about the following:
1) They wonder whether the complainant selected the studies he compared my studies with in a way that would help the low likelihoods to come out.
2) They wonder whether the chosen comparison studies are really comparable with my studies. My answer is “no”. I do think that the complainant is comparing “apples with oranges”. This concern has been raised by many in personal emails to me. It concerns a general criticism with a method that made sense a couple of years ago; now many people consider the choice of comparison studies problematic.
3) They are concerned about hypothesis derivation. There are thousands of hypotheses in the world, why did the complainant pick the linearity hypothesis?
4) They complain that there is no justification whatsoever of the methods used for the analyses was provided, alternatives are not discussed (as one would expect from any scientific paper. They also wonder whether the the data met the typical requirements for the analyses used.
5) They mentioned that the suspicion is repeatedly raised based on unsupported assumptions: data are simply considered “not characteristic for psychological experiments” without any further justification.
6) They find the likelihood of 1:trillion simply rhetorical.
7) Last but not least, in the expert reviews, only some QRP were examined. Some people wondered, whether this list is exhaustive and whether „milder“ practices than fraud could have led to the results. Note however, that I never used QRP- if they were used I have unfortunately to assume that co workers in the experiments did them.
Given that there exist deviating opinions, and that many experts raise concerns, I am concerned that the analyses conducted on my paper need to be examined in more detail before I would retract the 2012 paper. I just do not want to jump to conclusions now. I am even more concerned that this statistical analysis was the main basis to question my academic integrity.
Can I Exclude Any Conceivable Possibility of Data Manipulation?
Let me cite the LOWI reviewer:
“Ik benadruk dat uit de datafiles op geen enkele manier is af te leiden, dat de bovenstaande bewerkingen daadwerkelijk zijn uitgevoerd. Evenmin kan gezegd worden wanneer en door wie deze bewerkingen zouden zijn uitgevoerd.”
—->I emphasize that from the data files one can in no way infer that the above adjustments have actually been done. Nor can be said when and by whom such adjustments would have been done.
Moreover, asked, whether there is behavioral evidence for fraud in the data, the LOWI expert answers:
“Het is onmogelijk, deze vraag met zekerheid te beantwoorden. De data files geven hiertoe geen nieuwe informatie.”
—->It is not possible to answer this question with certainty. The data does not give new information on this issue.
Let me repeat that I never manipulated data. However, I can also not exclude the possibility that the data has been manipulated by someone involved in the data collection or data processing.
I still doubt it and hesitated to elaborate on this possibility because I found it unfair to blame somebody, if even in this non-specific way. However, since I have not manipulated data, I must say that in principle it could have been done by someone else. Note that I taught my assistants all the standards of properly conducting studies and fully reporting them. I always emphasized that the assistants are not responsible for the results, but only for conducting the study properly, and that I would never accept any “questionable research practices”. However, theoretically, it is possible that somebody worked on the data. It is possible that for example some research assistants want to please their advisors or want to get their approval by providing “good” results; maybe I underestimated such effects. For this project, it was obvious that ideally, the results would show two significant effects (global > control; control > local), so that both experimental groups would differ from the control group. Maybe somebody adjusted data so that they would better fit this hypothesis.
The LOWI expert was informative with respect to the question how this could have been done. S/he said that it is easy to adjust the data, by simply lowering the variance in the control groups (deleting extreme values) or by replacing values in the experimental groups with more extreme values. Both procedures would perhaps bring the data closer to linearity and are easy to do. One may speculate that for example, a co worker might have run more subjects than I requested in each condition and replaced or deleted “deviant” participants. To suggest another possibility, maybe somebody reran control groups or picked control groups out of a pool of control groups that had low variance. Of course this is all speculation and there might be other possibilities that I cannot even imagine or cannot see from this distance. Obviously, I would have never tolerated any behavior such as this, but it is possible that something has been done with the goal in mind of having significant comparisons to the control group, thereby inadvertently arriving at linear patterns.
Theoretically, such manipulation could have affected a series of studies, since, as I described above, we put different studies into summary files in order to see differences, to decide what studies we would need to run next or which procedural adjustments (including different control variables etc.) we would have to make for follow ups. Again, I repeat that this is all speculation, I simply try to imagine how something could have happened to the data, given the lab structure back then.
During the time of investigation I tried to figure out who could have done something inappropriate. However, I had to accept that there is no chance to trace this back; after all, the studies were run more than 7 years ago and I am not even entirely sure when, and I worked with too many people. I also do not want to point to people just because they are for some reason more memorable than others.
Responsibility for Detecting Odd Patterns in my Data
Finally, one point of accusation is:
“3. Though it cannot be established by whom and in what way data have been manipulated, the Executive Board adopts the findings of the LOWI that the authors, and specifically the lead author of the article, can be held responsible. He could or should have known that the results (`samenhangen`) presented in the 2012 paper had been adjusted by a human hand.”
I did not see the unlikely patterns, otherwise I would have not sent these studies to the journals. Why would I take such risk? I thought that they are unproblematic and reflect actual measurements.
Furthermore, in her open letter, Prof. Dr. Nira Liberman (see on this page #2) says explicitly how difficult it is to see the unlikely patterns. I gave her the paper without telling her what might be wrong with it and asked her to find a mistake or an irregularity. She did not find anything. Moreover, the reviewers, the editor and many readers of the paper did not notice the pattern. The expert review also says on this issue:
Het kwantificeren van de mate waarin de getallen in de eerste rij van Tabel A te klein zijn, vereist een meer dan standaard kennis van statistische methoden, zoals aanwezig bij X, maar niet te verwachten bij niet- specialisten in de statistiek.
—->Quantifying the degree to which numbers in the first row of Table A are too small, affords a more than standard knowledge of statistical methods, a knowledge that X has, but that one cannot expect in non experts of statistics.
I can only repeat: I did not see anything odd in the pattern.
This is a very lengthy letter and I hope it clarifies how I did the study, and why I believe in the data. Statisticians asked me to send them the data and they will further test whether the analyses used by the expert reviewer and by the complainant are correct. I am also willing to discuss my studies within a scientific setting. Please understand that I cannot visit all chatrooms that currently discuss my research. It would also be simply too much to respond to all questions there and to correct all the mistakes. Many people (also in the press) confuse LOWI reports or even combine several ones; and some postings are simply too personal.
This is also the reason why I will not post the data on the net. I thought about it, but my current experience with “the net” prevents me from doing this. I will share the data with scientists who want to have a look at it and who are willing to share their results with me. But I will not leave it to an anonymous crowd that can post whatever it wants, including incorrect conclusions and insults.
I would like to apologize to everyone that I caused so much trouble with my publication. I hope that in the end we can only learn from this. I definitely learned my lesson and will help to work on new rules and standards that make our discipline better. I would like to go back to work.
Regards, Jens Förster
Here’s Liberman’s letter:
Let me first identify myself as a friend and a collaborator of Jens Förster. If I understand correctly, in addition to the irregular pattern of data, three points played a major role in the national committee’s conclusion against Jens: That he could not provide the raw data, that he claimed that the studies were actually run in Germany a number of years before submission of the papers, and that he did not see the irregular pattern in his results. I think that it would be informative to conduct a survey among researchers on these points before concluding that Jens’ conduct in these regards is indicative of fraud. (In a similar way, it would be useful to survey other fields of science before concluding anything against social psychology or psychology in general.) Let me volunteer my responses to this survey.
Providing raw data
Can I provide the original paper questionnaires of my studies published in the last five years or the original files downloaded from the software that ran the studies (e.g., Qualtrics, Matlab, Direct-Rt) dated with the time they were run? No, I cannot. I asked colleagues around me, they can’t either. Those who think they can would often find out upon actually trying that this is not the case. (Just having huge piles of questionnaires does not mean that you can find things when you need them.) I am fairly certain that I can provide the data compiled into workable data files (e.g., Excel or SPSS data files). Typically, research assistants rather than primary investigators are responsible for downloading files from running stations and/or for coding questionnaires into workable data files. These are the files that Jens provided the investigating committees upon request. It is perhaps time to change the norm, and request that original data files/original questionnaires are saved along with a proof of date for possible future investigations, but this is not how the field has operated. Until a few years ago, researchers in the field cared about not losing information, but they did not necessarily prepare for a criminal investigation.
Publishing old data
Do I sometimes publish data that are a few years old? Yes, I often do. This happens for multiple reasons: because students come and go, and a project that was started by one student is continued by another student a few years later; because some studies do not make sense to me until more data cumulate and the picture becomes clearer; because I have a limited writing capacity and I do not get to write up the data that I have. I asked colleagues around me. This happens to them too.
The published results
Is it so obvious that something is wrong with the data in the three target papers for a person not familiar with the materials of the accusation? I am afraid it is not. That something was wrong never occurred to me before I was exposed to the argument on linearity. Excessive linearity is not something that anybody checks the data for.
Let me emphasize: I read the papers. I taught some of them in my classes. I re-read the three papers after Jens told me that they were the target of accusation (but before I read the details of the accusation), and after I read the “fraud detective” papers by Simonsohn (2013; ” Just Post it: The Lesson from Two Cases of Fabricated Data Detected by Statistics Alone”), and I still could not see what was wrong. Yes, the effects were big. But this happens, and I could not see anything else.
The commission concluded that Jens should have seen the irregular patterns and thus can be held responsible for the publication of data that includes unlikely patterns. I do not think that anybody can be blamed for not seeing what was remarkable with these data before being exposed to the linearity argument and the analysis in the accusation. Moreover, it seems that the editor, the reviewers, and the many readers and researchers who followed-up on this study also did not discover any problems with the results or if they discovered them, did not regard them as problematic.
And a few more general thoughts: The studies are well cited and some of them have been replicated. The theory and the predictions it makes seem reasonable to me. From personal communication, I know that Jens is ready to take responsibility for re-running the studies and I hope that he gets a position that would allow him to do that. It will take time, but I believe that doing so is very important not only personally for Jens but also for the entire field of psychology. No person and no field are mistake proof. Mistakes are no crimes, however, and they need to be corrected. In my career, somehow anything that happens, good or bad, amounts to more work. So here is, it seems, another big pile of work waiting to be done.
Hat tip: Rolf Degen