Jens Förster, the Dutch social psychologist accused of misconduct, has posted an open letter on his lab’s website in which he denies wrongdoing.
The letter, in English and dated May 11, offers a detailed rebuttal to the investigation’s conclusions. It also offers a rationale for Förster’s decision not to post his data on the Internet. And it’s followed by a briefer letter from Nira Liberman, who identifies herself as a collaborator of Förster’s.
We present the letter in full below:
Dear colleagues, some of you wonder how I am doing, and how I will address the current accusations. You can imagine that I have a lot of work to do, now. There are many letters to write, there are a number of emails, meetings, and phone calls. I also started the moving process. And there is my daily work.
I keep going because of the tremendous support that I experience. This is clearly overwhelming!
The publication of the LOWI report came unexpectedly, so forgive me that I needed some time to write this response. Another reason is that I still hesitate to share certain insights with the public, because I was asked to remain confidential about the investigation. It is hard for me to decide how far I can go to reveal certain reviews or results. This is especially difficult to me because the Netherlands is a foreign country to me and norms differ from my home country. In addition, this week, the official original complaint was posted to some chatrooms. Both papers raise questions, especially about my Förster et al. 2012 paper published in SPPS.
First and foremost let me repeat that I never manipulated data and I never motivated my co workers to manipulate data. My co author of the 2012 paper, Markus Denzler, has nothing to do with the data collection or the data analysis. I had invited him to join the publication because he was involved generally in the project.
The original accusation raises a few specific questions about my studies. These concerns are easy to alleviate. Let me now respond to the specific questions and explain the rules and procedures in my labs.
Origin of Studies and Lab-Organization During that Time
The series of experiments were run 1999 – 2008 in Germany, most of them Bremen, at Jacobs University; the specific dates of single experiments I do not know anymore. Many studies were run with a population of university students that is not restricted to psychology students. This is how we usually recruited participants. Sometimes, we also tested guests, students in the classrooms or business people that visited. This explains why the gender distribution deviates from the distribution of Amsterdam psychology students. This distribution closely resembles the one reported in my other papers. Note that I never wrote that the studies were conducted at the UvA, this was an unwarranted assumption by the complainant. Indeed, the SPSS files on the creativity experiments for 2012 paper include the 390 German answers. This was also explicitly noted by the expert review for the LOWI who re analyzed the data.
During the 9 years I conducted the studies, I had approximately 150 co-workers (research assistants, interns, volunteers, students, PhDs, colleagues). Note that the LOWI interviewed two research assistants that worked with me at UvA, their reports however do not reflect the typical organization at for example Bremen, where I had a much larger lab with many more co workers. However, former co workers from Bremen invited by the former UvA commission basically confirmed the general procedure described here.
At times I had 15 research assistants and more people (students, interns, volunteers, PhDs, etc.) who would conduct experimental batteries for me. They (those could be different people) entered the data when it was paper and pencil questionnaire data and they would organize computer data into workable summary files (one line per subject, one column per variable). For me to have a better overview of the effects in numerous studies, some would also prepare summary files for me in which multiple experiments would be included. The data files I gave to the LOWI reflect this: To give an example for the SPPS (2012) paper, I had two data files, one including the five experiments that included atypicality ratings as the dependent variable, and one including the seven experiments that included the creativity/analytic tasks. Coworkers analyzed the data, and reported whether the individual studies seemed overall good enough for publication or not. If the data did not confirm the hypothesis, I talked to people in the lab about what needs to be done next, which would typically involve brainstorming about what needs to be changed, implementing the changes, preparing the new study and re-running it.
Note that the acknowledgment sections in the papers are far from complete; this has to do with space limitations and with the fact that during the long time of running the studies. Unfortunately, some names got lost. Sometimes I also thanked research assistants who worked with me on similar studies around the time I wrote a paper.
Amount of Studies
The organization of my lab also explains the relatively large number of studies: 120 participants were typically invited for a session of 2 hours that could include up to 15 different experiments (some of them obviously very short, others longer). This gives you 120 X 15 = 1800 participants. If you only need 60 participants this doubles the number of studies. We had 12 computer stations in Bremen, we used to test participants in parallel. We also had many rooms, such as classrooms or lecture halls that could be used for doing paper and pencil studies or studies with laptops. If you organize your lab efficiently, you would need 2-3 weeks to complete this “experimental battery”. We did approximately 30 of such batteries during my time in Bremen and did many more other studies. Sometimes, people were recruited from campus, but most of them were recruited from the larger Bremen area, and sometimes we paid their travel from the city center, because this involved at least half an hour of travel. Sometimes we also had volunteers who helped us without receiving any payment.
None of the Participants Raised Suspicions and Outliers
The complainant also presumes that the participants are psychology students, typically trained in psychological research methods who are often quite experienced as research participants. He finds it unlikely that none of the participants in my studies raised suspicions about the study. Indeed, at the University of Amsterdam (UvA) undergraduates oftentimes know a lot about psychology experiments and some of them might even know or guess some of the hypotheses. However, as noted before, the participants in the studies in question were neither from UvA nor were they entirely psychology students. Furthermore, the purpose of my studies and the underlying hypotheses are oftentimes difficult to detect. For example, a participant who eats granola and is asked to attend to its ingredients is highly unlikely to think that attending to the ingredients made him or her less creative. Note also that the manipulation is done between participants: other participants, in another group eat the granola while attending to its overall gestalt. Participants do not know and do not have any way to know about the other group: they do not know that the variable that is being manipulated is whether the processing of the granola is local versus global. In those circumstances it is impossible to guess the purpose of the study. Moreover, a common practice in social psychological priming studies is to use “cover stories” about the experiments, which present the manipulation and the dependent measure as two unrelated experiments. We usually tell participants that for economic reasons, we test many different hypotheses for many different researchers and labs in our one to three hour lasting experimental sessions. Each part of a study is introduced as independent from the other parts or the other studies. Cover stories are made especially believable by the fact that most of the studies and experimental sessions indeed contain many unrelated experiments that we lump together. And in fact, many tasks do not look similar to each other. All this explains, I think, why participants in my studies do not guess the hypothesis. That being said, it is possible that the research assistants who actually run the studies and interview the participants for suspicion, do not count as “suspicion” if a participant voices an irrelevant idea about the nature of the study. For example, it is possible that if a participant says “I think that the study tested gender differences in perception of music” it would be counted as “no suspicion raised” – because this hypothesis would not have led to a systematic bias or artifact in our data.
Similarly, the complainant wonders how comes the studies did not have any dropouts. Indeed, I did not drop any outliers in any of the studies reported in the paper. What does happen in my lab, as in any lab, is that some participants fail to complete the experiment (e.g., because of computer failure, personal problems, etc.). The partial data of these people is, of course, useless. Typically, I instruct RAs to fill up the conditions to compensate for such data loss. For example, if I aimed at 20 participants per condition, I will make sure that these will be 20 full-record participants. I do not report the number of participants who failed to complete the study, not only because of journals’ space limitations, but also because I do not find this information informative: when you exclude extreme cases, for example, it could be informative to write what would the results look like had they been not excluded. But you simply have nothing to say about incomplete data.
Size of Effects
The complainant wonders about the size of the effects. First let me note that I generally prefer to examine effects that are strong and that can easily be replicated in my lab as well as in other labs. There are many effects in psychology that are interesting but weak (because they can be influenced by many intervening variables, are culturally dependent, etc.) – I personally do not like to study effects that replicate only every now and then. So, I focus on those effects that are naturally stable and thus can be further examined.
Second, I do think that theoretically, these effects should be strong. In studying global/local processing, I thought I was investigating basic effects that are less affected by moderating variables. It is a common wisdom in psychology that perceptual processes are less influenced by external variables than, for example, achievement motivation or group and communication processes. All over the world people can look at the big picture or at the details. It is what we call a basic distinction. Perception is always the beginning of more complex psychological processes. We perceive first, and then we think, feel, or act. Moreover, I found the global/local processing distinction exciting because it can be tested with classic choice or reaction time paradigms and because it is related to the neurological processes. I expected the effects to be big, because no complex preconditions have to be met (in contrast to other effects, that occur, for example, only in people that have certain personality traits). Finally, I assume that local (or global) processing styles are needed for analytic (or creative) processing- without them there is no creativity or analytic thought. If I trigger the appropriate processing style versus the antagonistic processing style, then relatively large effects should be expected. Note also, that the same effect can be obtained by different routes, or processes that could be potentially provoked by the experimental manipulation. My favorite one is that there are global versus local systems that are directly related to creativity. However, others suggested that a global processing style triggers more intuitive processing – a factor that is known to increase creativity in its own right. Yet others suggested that global processing leads to more fluid processing, yet a third factor that could produce our effects. Thus, the same manipulation of global (vs. local) processing could in principle trigger at least three processes that may produce the same effect in concert. From this perspective too, I believe that one would expect rather big effects.
Moreover, the sheer replicability of the effects further increased my confidence. I thought that the relatively large number of studies secures against the possibility of artifacts. My confidence explains why I did not question the results nor did I suspect the data. Of course I do thorough checks, but I could not see anything suspicious in the data or the results. Moreover, a large number of studies conducted in other labs found similar effects. The effects seem to (conceptually) replicate in other labs as well.
Dependent Measure of Analytic Task in the 2012 SPPS Paper
The complainant further wonders why performances on analytic tasks in general were so poor for undergraduates and are below chance level. The author probably assumes that because the task is given in a multiple-choice format with five alternatives, there is a 0.2 probability to answer each single question by chance. However, in our experiment, participants had only 4 minutes to do the task. If a participant was stuck on the first question, did not solve it correctly, and did not even attempt question 2-4 (which happened a lot), then we consider all 4 responses as incorrect, and the participant receives a score of 0. In other words, participants were not forced to just circle an answer for every question, but rather could leave questions unanswered that we counted as “not solving it” and thus “incorrect”. I think that there is no meaningful way to compute the chance level of answering the question in these studies.
Statistical Analyses
The LOWI found the statistical analyses by the experts convincing. However, note that after almost 2 years of meticulous investigation, they did not find any concrete or behavioral evidence for data manipulation. The LOWI expert who did the relevant analysis always qualifies his methods, even though he is concerned about odd regularities, too. However, after having described his analysis, he concludes:
“Het is natuurlijk mogelijk dat metingen het waargenomen patroon vertonen.”
—->It is of course possible that the observed pattern was obtained by measurements.
This reviewer simply expresses an opinion that I kept repeating from my first letter to the UvA-commission on: Statistical methods are not error free. The choice of methods determines the results. One statistician wrote to me: “Lottery winners are no fraudsters, even though the likelihood is 1: 14 Millions to win the lottery.”
Even though I understand from the net that many agree with the analyses, however, I also received emails from statisticians and colleagues criticizing the fact that such analyses are the major basis for this negative judgment.
I even received more concrete advice suggesting that the methods the complainant used are problematic.
To give some examples, international colleagues wonder about the following:
1) They wonder whether the complainant selected the studies he compared my studies with in a way that would help the low likelihoods to come out.
2) They wonder whether the chosen comparison studies are really comparable with my studies. My answer is “no”. I do think that the complainant is comparing “apples with oranges”. This concern has been raised by many in personal emails to me. It concerns a general criticism with a method that made sense a couple of years ago; now many people consider the choice of comparison studies problematic.
3) They are concerned about hypothesis derivation. There are thousands of hypotheses in the world, why did the complainant pick the linearity hypothesis?
4) They complain that there is no justification whatsoever of the methods used for the analyses was provided, alternatives are not discussed (as one would expect from any scientific paper. They also wonder whether the the data met the typical requirements for the analyses used.
5) They mentioned that the suspicion is repeatedly raised based on unsupported assumptions: data are simply considered “not characteristic for psychological experiments” without any further justification.
6) They find the likelihood of 1:trillion simply rhetorical.
7) Last but not least, in the expert reviews, only some QRP were examined. Some people wondered, whether this list is exhaustive and whether „milder“ practices than fraud could have led to the results. Note however, that I never used QRP- if they were used I have unfortunately to assume that co workers in the experiments did them.
Given that there exist deviating opinions, and that many experts raise concerns, I am concerned that the analyses conducted on my paper need to be examined in more detail before I would retract the 2012 paper. I just do not want to jump to conclusions now. I am even more concerned that this statistical analysis was the main basis to question my academic integrity.
Can I Exclude Any Conceivable Possibility of Data Manipulation?
Let me cite the LOWI reviewer:
“Ik benadruk dat uit de datafiles op geen enkele manier is af te leiden, dat de bovenstaande bewerkingen daadwerkelijk zijn uitgevoerd. Evenmin kan gezegd worden wanneer en door wie deze bewerkingen zouden zijn uitgevoerd.”
—->I emphasize that from the data files one can in no way infer that the above adjustments have actually been done. Nor can be said when and by whom such adjustments would have been done.
Moreover, asked, whether there is behavioral evidence for fraud in the data, the LOWI expert answers:
“Het is onmogelijk, deze vraag met zekerheid te beantwoorden. De data files geven hiertoe geen nieuwe informatie.”
—->It is not possible to answer this question with certainty. The data does not give new information on this issue.
Let me repeat that I never manipulated data. However, I can also not exclude the possibility that the data has been manipulated by someone involved in the data collection or data processing.
I still doubt it and hesitated to elaborate on this possibility because I found it unfair to blame somebody, if even in this non-specific way. However, since I have not manipulated data, I must say that in principle it could have been done by someone else. Note that I taught my assistants all the standards of properly conducting studies and fully reporting them. I always emphasized that the assistants are not responsible for the results, but only for conducting the study properly, and that I would never accept any “questionable research practices”. However, theoretically, it is possible that somebody worked on the data. It is possible that for example some research assistants want to please their advisors or want to get their approval by providing “good” results; maybe I underestimated such effects. For this project, it was obvious that ideally, the results would show two significant effects (global > control; control > local), so that both experimental groups would differ from the control group. Maybe somebody adjusted data so that they would better fit this hypothesis.
The LOWI expert was informative with respect to the question how this could have been done. S/he said that it is easy to adjust the data, by simply lowering the variance in the control groups (deleting extreme values) or by replacing values in the experimental groups with more extreme values. Both procedures would perhaps bring the data closer to linearity and are easy to do. One may speculate that for example, a co worker might have run more subjects than I requested in each condition and replaced or deleted “deviant” participants. To suggest another possibility, maybe somebody reran control groups or picked control groups out of a pool of control groups that had low variance. Of course this is all speculation and there might be other possibilities that I cannot even imagine or cannot see from this distance. Obviously, I would have never tolerated any behavior such as this, but it is possible that something has been done with the goal in mind of having significant comparisons to the control group, thereby inadvertently arriving at linear patterns.
Theoretically, such manipulation could have affected a series of studies, since, as I described above, we put different studies into summary files in order to see differences, to decide what studies we would need to run next or which procedural adjustments (including different control variables etc.) we would have to make for follow ups. Again, I repeat that this is all speculation, I simply try to imagine how something could have happened to the data, given the lab structure back then.
During the time of investigation I tried to figure out who could have done something inappropriate. However, I had to accept that there is no chance to trace this back; after all, the studies were run more than 7 years ago and I am not even entirely sure when, and I worked with too many people. I also do not want to point to people just because they are for some reason more memorable than others.
Responsibility for Detecting Odd Patterns in my Data
Finally, one point of accusation is:
“3. Though it cannot be established by whom and in what way data have been manipulated, the Executive Board adopts the findings of the LOWI that the authors, and specifically the lead author of the article, can be held responsible. He could or should have known that the results (`samenhangen`) presented in the 2012 paper had been adjusted by a human hand.”
I did not see the unlikely patterns, otherwise I would have not sent these studies to the journals. Why would I take such risk? I thought that they are unproblematic and reflect actual measurements.
Furthermore, in her open letter, Prof. Dr. Nira Liberman (see on this page #2) says explicitly how difficult it is to see the unlikely patterns. I gave her the paper without telling her what might be wrong with it and asked her to find a mistake or an irregularity. She did not find anything. Moreover, the reviewers, the editor and many readers of the paper did not notice the pattern. The expert review also says on this issue:
Het kwantificeren van de mate waarin de getallen in de eerste rij van Tabel A te klein zijn, vereist een meer dan standaard kennis van statistische methoden, zoals aanwezig bij X, maar niet te verwachten bij niet- specialisten in de statistiek.
—->Quantifying the degree to which numbers in the first row of Table A are too small, affords a more than standard knowledge of statistical methods, a knowledge that X has, but that one cannot expect in non experts of statistics.
I can only repeat: I did not see anything odd in the pattern.
This is a very lengthy letter and I hope it clarifies how I did the study, and why I believe in the data. Statisticians asked me to send them the data and they will further test whether the analyses used by the expert reviewer and by the complainant are correct. I am also willing to discuss my studies within a scientific setting. Please understand that I cannot visit all chatrooms that currently discuss my research. It would also be simply too much to respond to all questions there and to correct all the mistakes. Many people (also in the press) confuse LOWI reports or even combine several ones; and some postings are simply too personal.
This is also the reason why I will not post the data on the net. I thought about it, but my current experience with “the net” prevents me from doing this. I will share the data with scientists who want to have a look at it and who are willing to share their results with me. But I will not leave it to an anonymous crowd that can post whatever it wants, including incorrect conclusions and insults.
I would like to apologize to everyone that I caused so much trouble with my publication. I hope that in the end we can only learn from this. I definitely learned my lesson and will help to work on new rules and standards that make our discipline better. I would like to go back to work.
Regards, Jens Förster
Here’s Liberman’s letter:
Let me first identify myself as a friend and a collaborator of Jens Förster. If I understand correctly, in addition to the irregular pattern of data, three points played a major role in the national committee’s conclusion against Jens: That he could not provide the raw data, that he claimed that the studies were actually run in Germany a number of years before submission of the papers, and that he did not see the irregular pattern in his results. I think that it would be informative to conduct a survey among researchers on these points before concluding that Jens’ conduct in these regards is indicative of fraud. (In a similar way, it would be useful to survey other fields of science before concluding anything against social psychology or psychology in general.) Let me volunteer my responses to this survey.
Providing raw data
Can I provide the original paper questionnaires of my studies published in the last five years or the original files downloaded from the software that ran the studies (e.g., Qualtrics, Matlab, Direct-Rt) dated with the time they were run? No, I cannot. I asked colleagues around me, they can’t either. Those who think they can would often find out upon actually trying that this is not the case. (Just having huge piles of questionnaires does not mean that you can find things when you need them.) I am fairly certain that I can provide the data compiled into workable data files (e.g., Excel or SPSS data files). Typically, research assistants rather than primary investigators are responsible for downloading files from running stations and/or for coding questionnaires into workable data files. These are the files that Jens provided the investigating committees upon request. It is perhaps time to change the norm, and request that original data files/original questionnaires are saved along with a proof of date for possible future investigations, but this is not how the field has operated. Until a few years ago, researchers in the field cared about not losing information, but they did not necessarily prepare for a criminal investigation.
Publishing old data
Do I sometimes publish data that are a few years old? Yes, I often do. This happens for multiple reasons: because students come and go, and a project that was started by one student is continued by another student a few years later; because some studies do not make sense to me until more data cumulate and the picture becomes clearer; because I have a limited writing capacity and I do not get to write up the data that I have. I asked colleagues around me. This happens to them too.
The published results
Is it so obvious that something is wrong with the data in the three target papers for a person not familiar with the materials of the accusation? I am afraid it is not. That something was wrong never occurred to me before I was exposed to the argument on linearity. Excessive linearity is not something that anybody checks the data for.
Let me emphasize: I read the papers. I taught some of them in my classes. I re-read the three papers after Jens told me that they were the target of accusation (but before I read the details of the accusation), and after I read the “fraud detective” papers by Simonsohn (2013; ” Just Post it: The Lesson from Two Cases of Fabricated Data Detected by Statistics Alone”), and I still could not see what was wrong. Yes, the effects were big. But this happens, and I could not see anything else.
The commission concluded that Jens should have seen the irregular patterns and thus can be held responsible for the publication of data that includes unlikely patterns. I do not think that anybody can be blamed for not seeing what was remarkable with these data before being exposed to the linearity argument and the analysis in the accusation. Moreover, it seems that the editor, the reviewers, and the many readers and researchers who followed-up on this study also did not discover any problems with the results or if they discovered them, did not regard them as problematic.
And a few more general thoughts: The studies are well cited and some of them have been replicated. The theory and the predictions it makes seem reasonable to me. From personal communication, I know that Jens is ready to take responsibility for re-running the studies and I hope that he gets a position that would allow him to do that. It will take time, but I believe that doing so is very important not only personally for Jens but also for the entire field of psychology. No person and no field are mistake proof. Mistakes are no crimes, however, and they need to be corrected. In my career, somehow anything that happens, good or bad, amounts to more work. So here is, it seems, another big pile of work waiting to be done.
N. Liberman
Hat tip: Rolf Degen
From a statistical standpoint, this piece of writing is simply shocking. He states:
“Similarly, the complainant wonders how comes the studies did not have any dropouts. Indeed, I did not drop any outliers in any of the studies reported in the paper. What does happen in my lab, as in any lab, is that some participants fail to complete the experiment (e.g., because of computer failure, personal problems, etc.). The partial data of these people is, of course, useless. Typically, I instruct RAs to fill up the conditions to compensate for such data loss. For example, if I aimed at 20 participants per condition, I will make sure that these will be 20 full-record participants. I do not report the number of participants who failed to complete the study, not only because of journals’ space limitations, but also because I do not find this information informative: when you exclude extreme cases, for example, it could be informative to write what would the results look like had they been not excluded. But you simply have nothing to say about incomplete data.”
There are 2 comments which I consider beyond comprehension in 2014.
1) “The partial data of these people is, of course, useless.” Totally false, ignorant, and statistically appallingly stupid.
2) “Typically, I instruct RAs to fill up the conditions to compensate for such data loss.” This utterly destroys the standard psychology fig-leaf of “random selection”.
Many scientists complain about the pervasive intrusion of statisticians in the various substantive fields. This single paragraph is the most powerful argument I have ever read for MANDATORY participation of statisticians, who are NOT beholden to the PI for their employment. Had I been the statistician with this fool, I would NEVER under any circumstances have allowed this garbage to go forward.
Wow, as of 10:50 AM CDST 5/12 I have 3 “Thumbs-down”. I wonder if some of those responders would care to enlighten us as to what in my comment is a “thumbs-down”-worthy aspect. Probably the “mandatory” statistician thing.
I found the tone a bit inappropriate. “Shocking”, “beyond comprehension”, “Totally false, ignorant, and statistically appallingly stupid”, “utterly destroys the standard psychology fig-leaf of ‘random selection'”, “fool”, “garbage”. Even if the practices really warranted such strong language (and that can and should be debated), I doubt that we are going to convince psychologists by insulting them. Or is that a consideration too complex for a statistician to follow 😉
Well, I was shocked. Honestly. If this was an accurate report of his experimental practice, it’s appalling.
Consider an experiment. You have 10 data collection points for a person. You start a subject, and the subject finishes 9 of the 10, but not the 10th. What do you do?
Consider the next 18 subjects. You have 10 points. You get 14 of the 18 who complete 9 of the 10. So, at this point, you have run 19 subjects, and collected 4 complete sets of data.
If you are going to analyse those 4, and ignore the 15 incomplete cases, you are committing a fraud. This would never be allowed in a clinical trial. There, you must indicate how many cases start, how many finish, and why they dropped out.
Statistical Observer wrote: “you must indicate how many cases start, how many finish, and why they dropped out.”
How about the statements (see below) in some other papers with Jens Förster as one of the authors?
“Ten participants indicated a lack of familiarity with the TV shows used in the procedure (see below) and thus were excluded from the analyses.”
“Data from four participants were not recorded due to computer error and was thus excluded from the analyses.”
“Two participants were excluded from the analysis because a majority of their lexical decisions were incorrect.”
“One participant had to be excluded because he was not a native German speaker.”
Good practice? An accurate report with enough details (all of them also list how many participants started)?
IMO OK. Agreed?
Those generally would fall within acceptable practice, and would provide acceptable transparency. In the case of the “majority of the lexical decisions were incorrect”, I have often thought about such cases. In some ways excluding them is not appropriate. On the other hand, if the analysis involves “German speakers who were fully acquainted with the language” or something like that, it is appropriate.
Thanks for your reply. The student who was not a native German speaker was part of an experiment carried out in Bremen (Germany) and reported in http://www.sciencedirect.com/science/article/pii/S0022103108001716 During these experiments, the participants needed to listen to all kind of stories in German. This paper even has a footnote with language issues:
“A complete list of words can be found in the supplementary online materials. Note that due to translation problems, the curse words may seem awkward or old fashioned. In the German version we made sure that the curse words were commonly used by this sample to express intense aggression.”
IMO very detailed and very accurate. Agreed?
The whole point is transparency, and these examples seem quite transparent.
However, I’m still quite unsettled and unsatisfied about the incomplete cases. I am very busy right now preparing for a conference, but as soon as that is done, i will look more into this issue. As I noted, I once was a psychologist. If the state of the field is that such cases are simply ignored, the field is in huge trouble.
Your use of adjectives: “ignorant” and “stupid” and of nouns “fool” and garbage” was the reason I awarded you with a thumb down. If mandatory participation of statisticans in research would imply mandatory exposure to such language in academic discourse, I would prefer to find another job.
Joke time to lighten the mood: I was unaware Mr. T from the A-team held a degree in statistics.
On a more serious note: as far as I’ve been reading, RW is an open and in-depth platform for discussing the dark side of science. Let’s keep it that way. Throwing mud won’t advance our understanding of what happened with Jens Foerster’s work & what needs to be done to fix and prevent things like this.
Well, I’ve been an experimental psychologist for many years, and I’ve never heard of, nor have I ever seen published, an analysis of incomplete data of the sort mentioned in point #1. Furthermore, in the US, at least, if a subject dropped out of your study, you would generally not be permitted to use his or her data, for ethical reasons. I think there may be a confusion between epidemiological studies (or the like) and experimental studies in which incomplete data simply do not allow one to draw the within-subjects comparisons that the experimental logic requires. In some situations, failure to complete the study means one simply doesn’t have relevant data for that subject (e.g., time to complete a task, often used in my field).
No doubt studies should report drop-outs by condition (for between-subject designs), and if these differ between conditions, then one ought to be worried about what this means. That is not an unusual issue in the review process. However, I’m not at all sure that *not* replacing subjects leads to a better result than sticking with an unequal n situation even in this case.
You make two statements which are incorrect:
“Furthermore, in the US, at least, if a subject dropped out of your study, you would generally not be permitted to use his or her data, for ethical reasons.”
That is incorrect. When a person signs a consent to a study, the data up to the point of withdrawal is still under consent. Once consent is withdrawn, data AFTER that point is no longer permitted to be used. Of course, it is not since it is no longer collected.
” I’ve never heard of, nor have I ever seen published, an analysis of incomplete data of the sort mentioned in point #1″
Every clinical trial I have ever seen involves incomplete data. Every single one. Yet, they are all analyzed. For repeated effects.
“However, I’m not at all sure that *not* replacing subjects leads to a better result than sticking with an unequal n situation even in this case.”
There is NO problem in analyzing data with unequal ns. No statistician has believed that since 1986.
Really, your understanding of methods is pre-1986 or 1984. Several key papers published then describe clearly how to analyze data from incomplete studies. In particular, Jennrich and Schluchter (which uses a “personal” covariance matrix for each individual derived from the overall covariance matrix) and Liam and Zeger (which also describe the use of incomplete data) have led to a revision of analysis. A revision which occurred 30 years ago.
SAS, SPSS, R, Stata, all modern multi-level analysis packages allow the analysis of repeated measures data which is incomplete. SAS: MIXED, GENMOD, GLIMMIX, NLMIXED – all allow such analyses. I am less familiar with R and SPSS. The methods in MIXED, GLIMMIX, NLMIXED follow Jennrich and Schluchter. The analysis in GENMOD is along the lines of Liam and Zeger.
As someone with the word Psychology on my PhD diploma, I am amazed at the continual lack of advancement in the area. Neither unequal ns nor missing data are bars to analysis in repeated measures. Of course, if most of the data is missing, that’s not good. But spotty missing data is not a problem. I just finished the analysis of a 2×3 (2 group, 3 visit) “split-plot” analysis. Probably 10 % missed the middle time point. That in no way removes them from the overall endpoint, Visit 3. Anyone who withdrew consent would not be included, of course.
In the UK at least a subject may withdraw their consent up to a reasonable time (up to 48 hours) after the experiment has completed and all data associated with that subject must be erased if consent is withdrawn.
Regarding the dropout issue, it’s worth noting that in most of Foerster’s previous lead-authored papers, he explicitly notes both at what university the data were acquired, and how many subjects were omitted from analysis (and why). For example, in Experiment 4 of Foerster, Friedman, & Liberman (2004), we’re told that “Three participants had to be excluded from the analyses, 2 because of experimenter error and 1 because he or she did not want to continue”. Most of the other Experiments have similar notes. These data were acquired at Bremen, presumably by Foerster himself (as the other authors work in other countries), so it doesn’t seem accurate for Foerster to suggest that this kind of thing is utterly uninformative and that he “simply has nothing to say about incomplete data”. Or perhaps he just had a minor methodological epiphany sometime between 2004 and 2012.
Indeed, a note of this kind is important. But excluding a person for incomplete data is not appropriate. If consent is withdrawn, at least in clinical trials, data up to that point can be used. It is certainly possible that other forms of consent exist. If the consent document read “If you wish to withdraw, all data from your participation will be excluded from analysis”, that is a different matter.
Ethics requirements vary.
Generally in the UK participants have the right to withdraw (including their data) up to a cut off (often after analysis but before publication). On the other hand can understand why someone might be cautious about data from a participant who withdraws part-way through (as the default for many ethics committees is to assume that withdrawal from a study also means withdrawal of data*). However, it is rare for participants to withdraw in that sense (once in my recollection for my own experiments – and that was before the testing phase).
If you have partial data it is easy enough to analyse using modern methods such as multilevel models (now standard in many fields but apparently not experimental social psychology).
On a practical note there isn’t really any major advantage to including partial data from one or two participants in a study as the loss of power and degree of bias would be minimal – but it is important to report how their data were treated. I have certainly used this information as a reviewer or editor to request further information or alternative analyses.
I find it problematic if Foerster’s earlier work includes details of exclusions etc. but later studies do not.
I am sure that ethics requirements vary. In a clinical trial, a participant consents and is included. The trial itself may last months or years. Thus, a participant may, after 3 years, be tired of the trial, experience life events that interfere, etc, and withdraw. That is very different from the situation in which an experimental subject agrees to participate, and does everything in one session of 25 minutes.
The point is that it is quite conceivable that an ethics committee might require or be interpreted as requiring that partial data have been withdraw and that data have also been withdrawn. This might or might not be a sensible interpretation (depending on the nature of the study and the participants). Regardless, the withdrawn data should be noted in write-up.
Iam surprised by such strong and – in my view – not very professional statements.
It might be that this is considered wrong in some fields and not in others. And one needs of course to debate about the standards etc. However, this dogmatic statements sounds as if this is THE TRUTH, which of course would be in itself a violation of academic/scientifc standards, which holds that science might be (and in my eyes is) the best way with the ultimate goal of finding the TRUTH at the same time knowing that this goal cannot be reached.
What is about research from let’s say 30 years ago? Is it useless, because it did not follow these standards?
What about many people who reviewed, edited, and read the papers?
I think the tone of the post distracts from its actual content.
The observers first quotes Jens and then says: “2) “Typically, I instruct RAs to fill up the conditions to compensate for such data loss.” This utterly destroys the standard psychology fig-leaf of “random selection””.
What does random selection has to do with this? The author admitted to the use of a convenient sample (as most of us do). The observer might mean “random assignment to conditions.”
To amplify my coments above:
In the area of clinical trials, where I spend my professional time, the reproducibility of the process is a key factor. If you cannot understand the experiment and cannot reproduce it, you should not be considering the conclusions. The sample selection, the inclusion/exclusion criteria, the manner in which participants are consented and managed during the study, the availability of other treatments external to the experimental manipulation, all are highly important in the understanding of the study.
Thus, roughly in 2000, an initiative created the CONSORT standards. These provide quite specific guidelines to the design, analysis, and reporting of clinical trials. In particular, very careful attention is paid to the manner in which subjects proceed through the trial. How many are screened? How many are randomized? Of those in Conditions 1, 2, etc, how many are retained until the primary measurement visit? What are the reasons for loss of subjects? How many adverse events are observed in each conditions which can be attributed to study participation?
This letter defines a huge number of highly questionable practices in subject handling, data manipulation, and other key features. So much of science involves the MANAGEMENT of subjects and data, and in many cases (Potti, Baltimore), is THE key feature of the problem. “Filling up the experiment with RAs” – I almost gagged. What COULD he have been thinking? And to think that he actually wrote this out in a letter!! It’s a shocking UNFORCED admission of HIGHLY HIGHLY questionable science.
There are many places to examine the CONSORT effort.
http://www.consort-statement.org/
The CONSORT effort has led to similar efforts in many other areas.
I do not think JF meant that RAs were included in the study – he probably meant that he instructed his RAs to keep running participants until the cell size was as planned.
YOu may be right. That is a reading. However, I’ll stick with my comment about the incomplete data and dumping.
Still, if one condition would be more complex, more exhausting, or otherwise causing more “missing values” than the other conditions, simply adding more and more participants until enough complete datasets exist could introduce a bias as only the most tenacious participants would end up as valid data sources. Although this does not necessarily have to happen, reporting all dropouts should go without saying.
@ Just a little correction & nickxdanger,
“Sixty students were recruited (31 women, 29 men; average age = 21.30 years) to participate in a battery of unrelated experiments that lasted approximately 2 hr and for which they received 20 euros. The one-factorial design with the factor priming (love vs. sex vs. control) was realized between participants. Note that each cell had 20 participants with a balanced gender distribution. The main dependent measures were a creative insight and an analytic task. Participants could not see each other and experimenter gender had no effects.”
Quote from http://www.ncbi.nlm.nih.gov/pubmed/19690153
I understand from this quote that participants were recruited to fill a particular cell size with 50% males and 50% females. Please correct me when I am wrong.
– Co-workers (pre-)analyzed the data for Förster, yet they all are apparently not co-authors on the papers (which they should be if that was the case), nor does he even remember their names? That does sound a bit unusual.
– I personally think that storing information about when/where a study was actually conducted is not too much to ask for from a scientist.
– 1:Trillions is not rhetorical, it’s mathematics.
– “120 participants were typically invited for a session of 2 hours that could include up to 15 different experiments. […] This gives you 120 X 15 = 1800 participants”.
So, in order for this argument to make sense, wouldn’t multiple studies in this set have to involve some kind of global/local manipulation? If yes, this raises a few other methodological concerns, wouldn’t it?
– “maybe somebody reran control groups or picked control groups out of a pool of control groups that had low variance”. The people who should have access to such a pool are probably not ordinary research assistants, and probably people Förster knows the names of.
– “Lottery winners are no fraudsters, even though the likelihood is 1: 14 Millions to win the lottery.”
Well, winning the most unlikely-to-win lottery three times in a row, does raise some suspicion.
– Also, it is not just the linearity of the conditions, but also the similarity of the F-Values (as the datacolada blog points out) that is problematic.
In relation to propersurgeon above:
– Co-workers that do not participate in the writing of the paper in addition to their part in data collection should not be co-authors. So not making assistents co-authors would be fine; not acknowledging is debatable.
– I find the 120×15=1800 confusing too. I must assume that reporting -say- 600 subjects cannot refer to only 120 subjects that were measured 5 times. However, you could enter 1800 people into 15 unrelated studies by having 15 sessions in each of which 120 subjects participate in all 15 studies in one go; then you could have “120 subjects per session” and “15 studies in 15 sessions” while at the same time achieving “1800 subjects per study”. Possibly.
In relation to nickxdanger previously:
– Not mentioning that certain subjects were excluded or for what reason is indeed bad practice. The primary reason being -AFAIK- that the cause for the exclusion could be correlated with the outcome of interest (for instance, if your research question is “do people object to filling out questionnaires?” then you will get a distorted view if you exclude all people that are not willing to participate in your questionnaire study; duh). I am merely a neuroimager, but I am well aware of this potential bias. So reasons for exclusion should at least be reported. Not doing so is naive for a psychology professor. Worse, even if it slipped through, it is naive to retrospectively remain of the opinion that this is not a QRP.
Another QRP that strikes me is “if the data did not confirm the hypothesis, I talked to people in the lab about what needs to be done next, which would typically involve brainstorming about what needs to be changed, implementing the changes, preparing the new study and re-running it”. I know that that is a common practice, unfortunately, and I don’t expect Forster to be holier than the pope, but again it seems naive to not qualify that as QRP even after the fact when things have escalated. I find this clinging to “I did not use QRPs” while at the same time giving various examples of them makes him much less believable.
Finally, none of those aspects relate to the most serious problem: the unlikeliness of the superlinearity, among others. No explanation is given for that, apart from the claim that we must trust this to be an example of that one-in-a-zillion occurence that is not entirely impossible in principle, especially after a correction for the number of papers one could apply this analysis to and the number of suspect effects one could test for.
This makes Forster either regretless and naive, or a damn unlucky guy.
“120 participants were typically invited for a session of 2 hours that could include up to 15 different experiments. […] This gives you 120 X 15 = 1800 participants”.
This means that the subjects were exposed to a new experiment on average every 8 minutes. Even if the experiments are tolalle unrelated, do they not curry over to some extent to the next, as fatigue if nothing else. We also have to remember that the participants were full debriefed alter each ezperiment, which also takes some time.
Social psychology is an interesting field.
Trillions is rhetorical because it’s based on mathematics which only provides approximations, and those approximations tend to be biased, and that bias “works” in a multiplicative way. ie additive bias in a logarithmic scale.
One quick comment:
” which would typically involve brainstorming about what needs to be changed, implementing the changes, preparing the new study and re-running it.”
That is authorship.
Authorship requires a substantial contribution to the writing of the paper as well, and agreement with the final version of the paper. Involvement in study-design and data-collection is not enough. Sure, one could offer such contributors to be involved in the writing, and acknowledgement would be appropriate in any case (if not objected to); but not authorship per se.
Your comments about authorship are absolutely false. APA rules are:
“Authorship credit should the individual’s contribution to the study. An author is considered anyone involved with initial research design, data collection and analysis, manuscript drafting, and final approval. However, the following do not necessarily qualify for authorship: providing funding or resources, mentorship, or contributing research but not helping with the publication itself. The primary author assumes responsibility for the publication, making sure that the data is accurate, that all deserving authors have been credited, that all authors have given their approval to the final draft, and handles responses to inquiries after the manuscript is published.”
Logically, the ‘and’ between “drafting” and “final approval’ should be an “or”. If it weren’t, Förster couldn’t be an author on his own paper since he didn’t do the data collection.
I’m not sure about the APA, and the quoted placement of ANDs and ORs is ambiguous, but changing an AND to an OR is not necessary. “Manuscript drafting OR final approval” would be bad, because that means an author could disagree with the paper but still be author because they participated in the drafting.
But anyway, numerous sources are clearer that there are 3 requirements that should each be satisfied (paraphrasing: 1 involvement in study design, or data collection, or analysis; 2 involvement in writing; 3 agreement with submitted manuscript). Here for instance previous ICMJE guidelines, which the Committee on Publication Ethics also endorses I believe, as listed by Elsevier: http://www.elsevier.com/editors/ethical-issues , and here the newer ones with a fourth criterion related to accountability: http://www.icmje.org/recommendations/browse/roles-and-responsibilities/defining-the-role-of-authors-and-contributors.html
It’s also p-hacking.
Quite revealing. Some observations:
1) If the data were indeed manipulated, then only Dr. Forster himself could have done it. Because the data were collected in an entirely haphazard way in different locations by armies of SAs who were all blind to the study’s purposes, the only alternative explanation is an international conspiracy of student assistants.
2) Interesting: “If the data did not confirm the hypothesis, I talked to people in the lab about what needs to be done next, which would typically involve brainstorming about what needs to be changed, implementing the changes, preparing the new study and re-running it.” That is a fascinating philosophy of science: rerun experiments until they confirm your hypothesis, and discard whatever doesn’t. Popper, anyone?
3) “120 participants were typically invited for a session of 2 hours that could include up to 15 different experiments. […] This gives you 120 X 15 = 1800 participants”. No, Dr. Forster. That gives you 120 participants and a lot of dependency in the data that would have to be accounted for. Is Dr. Forster seriously saying that he reported 120 people doing 15 experiments as 1800 different people?
I reviewed a grant where a researcher had a non-significant elevation following a treatment.
He proposed to repeat the experiment, at larger and larger n-numbers, until he got statistical significance.
Just for completeness: this may be acceptable, provided the proper sequential tests are performed. A fixed alpha would not be proper indeed. E.g.: http://www.ncbi.nlm.nih.gov/pubmed/16817515
On point (2), fascinating phil sci indeed. The presumption must be that the hypothesis is correct, and his job is to fix things so that it comes out confirmed.
Jens Förster: “It is hard for me to decide how far I can go to reveal certain reviews or results. This is especially difficult to me because the Netherlands is a foreign country to me and norms differ from my home country.”
Poor guy…. No one over here speaks German, no one over here speaks English, half of the population is illiterate, 90% of the male students are walking around in a longhi and everything is written in Mon. On top of that, we have a lunisolar calendar identical to the one used in Myanmar.
So I can imagine very well that scientists from Germany get totally confused when they move from Germany to The Netherlands.
Well, from the LOWI report, I got the impression that Forster’s apparent failure to “explain” his findings was crucial to determine his guilt. From a legal perspective, this looks like the reversal of the burden of proof and a violation of the procedural norm that “defendents” should be considered innocent until proven guilty. I am not sure if there are some cultural differences…
But LOWI is not a court, and this was not a trial. One might argue that “proof” is needed to find someone guilty – but I ask you what proof in such a case could look like other than catching the researcher red-handed filling out questionnaires. LOWI made the decision not based on such proof, nor on just the fact that Förster could not explain his findings which I agree does not tell us very much on its own. However, LOWI did it because of the *nature* of those findings. And looking at the analyses from the report, in addition to the excellent simulation by Nelson & Simonsohn, ANY explanation other than manipulation would be helpful. So far, no one – neither Förster nor someone else – has come up with anything other than chance.
That is similar to calling it injustice when the guy with the weapon, motive, and opportunity is convicted because he had no explanation why he had all of these three.
The “proof” is in the analyses that were supplied, along with weaker circumstantial evidence. All that compelling evidence is open to counter-evidence, but if no such counter-evidence is given (or, here, in my opinion, only some of the circumstantial evidence is countered), then only the original evidence remains. And that is what the decision is based on.
Jens Förster on 11 May 2014: “The series of experiments were run 1999–2008 in Germany, most of them Bremen, at Jacobs University.” Jens Förster in the LOWI-report: “Het standpunt van Beklaagde. Veel experimenten waarvan verslag werd gedaan in de … artikelen zijn in … uitgevoerd en niet uitsluitend in …. ”
Jens Förster on 11 May 2014: “I still hesitate to share certain insights with the public. It is hard for me to decide how far I can go to reveal certain reviews or results.”
Please disclose the dots in the above sentence in the LOWI-report.
Jens Förster on 11 May 2014: “Research assistants and more people would conduct experimental batteries for me. They entered the data when it was paper and pencil questionnaire data.They would organize computer data into workable summary files (one line per subject, one column per variable).”
Where are these computerized data (including the digital information already supplied by the participants) with the testing results of these in total at least 2242 undergraduates, broken down by sex, age, date, site, compensation, etc? In Bremen? Somewhere else in Germany? At UvA? Lost in cyber? On floppy disks? Lost? Printed and lateron thrown away? Please disclose.
Jens Förster on 11 May 2014: “The series of experiments were run 1999–2008 in Germany, most of them Bremen, at Jacobs University”.
Jens Förster & Markus Dentzer in their 2012 paper (N = 690 undergraduates): “Participants were paid 7 Euros or received course credit.”
Please show me papers with results of such kind of psychology experiments conducted in the Bremen area / at the Jacobs University and/or elsewhere in Germany where it is noted that the participants got 7 Euro as compensation.
No. Dr. Forster was not found guilty of fraud. The data were found to be manipulated, and Dr. Forster, as lead investigator, was considered responsible. That is quite a difference.
This kind of cheap and unjustified sarcasm does not fit well with the professionalism this site tries to foster, I feel…
On the “evil RA” explanation.
The line of defense Forster seem to take is what I would like to call the “evil RA” explanation, where a person would intentionally change the summary files.
– It is unclear why a person would do this. Forster writes that an evil RA may “want to please their advisors or want to get their approval by providing “good” results; maybe I underestimated such effects.” That indeed would be their only potentially motivation, since they did not get any credit in acknowledgements, let alone co-authorship (which is questionable conduct in itself).
– The original raw is not available anymore. The LOWI report mentions it was destroyed in a hard drive crash in 2012. As the data was collected until 2008 that would give at least 3 years (and probably more) of overlap between the co-presence of the raw data and the summary files. It would seem quite risky for an evil RA to have such inconsistent data around as these would provide direct evidence for data manipulation, which would be grounds for immediate termination of their contract and possible other steps.
Yet, so far it may seem like that Forster can play the “plausible deniability” card on this front. But there is more:
– Potentially the “killer” argument: in the LOWI report it states that Forster gave an account on how data collection was performed. LOWI interviewed two RAs, and concluded “the accounts provided by the research assistants corroborated the opinion of the Accused [Forster] that the assistants were not aware of the goal and hypotheses of the experiments”. In other words, according to the LOWI report Forster had claimed that RAs were unaware of the goal and hypotheses of the experiments – which seems inconsistent with the “evil RA” explanation.
– A large number – 42 – of experiments seem to be affected. Forster claims that “manipulation could have affected a series of studies, since […[] we put different studies into summary files.” As Forster wrote they used batteries of 120 participants for up to 15 experiments, but as the studies in Forster 2012 was between-subject with no recycling of participants (according to the methods) it would require at least 10 of such batteries. According to the “evil RA” explanation this person would have been involved in all these batteries. Indeed, this explanation is refuted by the LOWI: “The LOWI adds the remark that the established, suspicious patterns therefore could not have arisen during data collection, as the LOWI excludes the possibility that the many research assistants, who were involved in the data collection in these experiments, could have created such exceptional statistical relations.”
Altogether, “The LOWI deems the Accused […] responsible for [the conscious adjustment of research data]”. It may well be the case that only the “evil RA” explanation, despite how implausible it may seem, could potentially save Forster’s career. But given the points above I would by highly surprised if there would be any reasonable neutral observers (i.e. people other than those who know Forster personally) who would be convinced by it.
Re your “killer argument”: The LOWI interviewed two RAs from UvA, How is their testimony relevant for what could or could not have happened seven years before that in Bremen?
Quote: “In other words, according to the LOWI report Forster had claimed that RAs were unaware of the goal and hypotheses of the experiments – which seems inconsistent with the “evil RA” explanation.”
It seems also inconsistent with JF’s statement “If the data did not confirm the hypothesis, I talked to people in the lab about …”.
These latest comments by Forster etc seem very much to obscure the fact that the data is gone, and was never given to the investigators. When they refer to data it is what came from the papers. There also seem to be no printouts of anything else except for some sd and n which didn’t appear in the papers. So he is either a combination of extremely sloppy with looking after research materials and extremely unlucky, or there has been some dishonesty. I can’t help feeling that manipulation may not have occurred because it was falsification. Maybe echoes of Clinton’s “I did not have sexual relations with that woman”
Plus the fact that the very papers, deviating from a previously set modus operandi with several coauthors, provide no mention on where, with whom and when the experiments were run, or whether anyone was excluded, and they have no co-authors, except one in one of the three papers, who was not involved with the data. And no alledged participants have appeared in public or in defence until present.
hi tekija, how about the protocols of all 42 experiments? Anyone any idea where the protocols with all the details of all these 42 experiments have been stored / can be found? Any idea how one can conduct a replication study of any of the 42 experiments when these protocols are not available anymore?
Ken, you write: “These latest comments by Forster etc seem very much to obscure the fact that the data is gone, and was never given to the investigators. When they refer to data it is what came from the papers.”
In my reading, the paper questionnaires were destroyed, but the digitalized data were available to investigators. How would they have conducted the subgroup analyses with only the information taken from the papers?
hi CA, Jens Förster wrote on 11 May 2014: “The series of experiments were run 1999 – 2008 in Germany, most of them Bremen, at Jacobs University; the specific dates of single experiments I do not know anymore.”
Please be aware that I am not a psychologist.
I am a biologist and I am working with a large variety of digital data / data files with specific details of individual birds, trapped in the wild and afterwards released with a metal ring with a unique code. These datafiles often contain also detailed information, eg, about particular measurements of parts of the body, plumage characteristics, weight, age (often in 2 or 3 classes), behavioural characteristics (eg, breeding bird) etc. Quite a few of the records in these -large- datafile have been collected tens of years ago.
Invariably, date (accurate up to the day) and site (accurary a bit variable, can be up to 10 sq m, can be up to 5 sq km, eg) is always known and always for any record (quite a few of these individuals have several encounters). In case these details are lacking and/or inaccurate, this is indicated (accuracy +/- one month, +/- half a year, etc.).
Can any psychologist please explain why Jens Förster is unable to retrieve the precise data on date and site (and age?) of all these 2242 undergraduates?
Excuse me very much, but I don’t understand why a researcher is unable to retrieve this kind of information from his study subjects, in particular because all data were gathered in Germany in 1999-2008 and under -close- supervision by Jens Förster himself.
Biologists also are often working with paper files and/or with note books, in particular when collecting the primary data in the field and under harsh conditions, but all data on site and date (etc.) are entered into a database, excel files (etc.) when calculations are carried out.
Klaas, I think you’re describing only know what biologists you know _say_ they do.
Jerry, I was just describing my own experiences with data files I am using for my own research on wild birds, supplemented by information from other researchers working in the same field.
This does not mean that I am argueing that all biologists in my field are working according to these strict rules, and this also does not mean that all these data files are perfect. No way. There also exists a large data base with alot of problems in regard to the reliability of the raw data. I even have published a short paper with a focus on doubts about the reliability of parts of the data in a subset of this large database.
“If the data did not confirm the hypothesis, I talked to people in the lab about what needs to be done next, which would typically involve brainstorming about what needs to be changed, implementing the changes, preparing the new study and re-running it.” – This to me, is the most troublesome part of his letter, because it is extremely normal practice in psychology today.
confirmation bias imo
I am just amazed that he did not even notice it, it was just so natural to him. Of course, he has a lot of other stuff on his mind at the moment, but….
if we look hard in academia, it’s everywhere.
we shouldn’t be surprised that psychologists today do not really see their own confirmation bias.
the computer was thrust upon many research disciplines as if it was supposed to make a big difference. people forgot how to do things like rigorous statistical analysis. too many toolboxes and “one click” solutions have lead us down this path.
its’ unfortunate, but academia itself needs to face the truth: the use of a computer in almost all academic disciplines has not yielded the benefits you’d expect.
then again, it’s not easy to use a computer. maybe computer literacy isn’t as simple as “i’ve used a computer and microsoft office”, but i digress~~
I noticed something strange. Which of the following comes from Foerster’s letter, and which from my April Fool’s post from 2013? (near the very end)
“If the data did not confirm the hypothesis, I talked to people in the lab about what needs to be done next, which would typically involve brainstorming about what needs to be changed, implementing the changes, preparing the new study and re-running it.”
“If the experiment does not confirm the hypothesis, it is our fault, and we do it over til it works right. We change the subjects or the questionnaire, we find which responses are too small and must be fixed.
http://errorstatistics.com/2014/05/10/who-ya-gonna-call-for-statistical-fraudbusting-r-a-fisher-p-values-and-error-statistics-again/comment-page-1/#comment-31636
“If the data don’t fit the theory, so much worse for the theory.”
— David McClelland (1984; Motives, personality, and society: Selected papers)
Sorry, I can’t get too excited about this. If you are convinced about the truth of a hypothesis and your first experiment fails to support your prediction, you give up? Quite honestly, I would think hard, change the conditions and try again. With repeated nonconfirmations, I might change my hypothesis. Perhaps, I may fail to convince the audience, but that’s science as it should be. That’s the critical discourse (sensu Popper) that is often neglected when we talk about QRP and the like.
I’m sorry, but… WHAT?!
“If you are convinced about the truth of a hypothesis…” then you shouldn’t be doing science!
Sure, if your data looks like complete garbage (it’s, I don’t know, unbelievably linear with tiny SDs, for example) then yes you should look closely at the methods to see if you’d made a mistake somewhere.
But if “your first experiment fails to support your prediction” then I’m afraid that’s the answer! By all means look for a fundamental flaw in the methods but just ‘tweaking’ things slightly until you get a p < 0.05 (which will happen by chance soon enough) is absurd.
Is this really what happens in social psychology?!
I agree that holding on to hypothesis at all cost is not proper science. But one failed experiment is no evidence that the effect does not exist. In fact, this is exactly one of the points that make Förster’s data look suspicious: The fact that all experiments revealed an effect with small sample sizes. As, I think Power Nerd explained in this thread, even with a large effect there should be experiments with null findings. So there is a reason to re-run studies if there is good theoretical reason to assume there should be an effect. Of course, all those studies should be reported then, and not just the one that eventually provided statistical significance.
Of course, after a series of failed replications, one should stop to look for something that might not be there. The number of experiments needed to find an effect (depending on sample size) could even be determined a priori.
“Is this really what happens in social psychology?!”
The Levelt report on the Diederik Stapel case indicated that they found this type of behaviour there as well: https://www.commissielevelt.nl/wp-content/uploads_per_blog/commissielevelt/2013/01/finalreportLevelt1.pdf.
Page 48: “One of the most fundamental rules of scientific research is that an investigation must be designed in such a way that facts that might refute the research hypotheses are given at least an equal chance of emer ging as do facts that confirm the research hypotheses. Violations of this fundamental rule, such as continuing to repeat an experiment until it works as desired, or excluding unwelcome experimental subjects or results, inevitably tend to confirm the researcher’s research hypotheses, and essentially render the hypotheses immune to the facts.
Procedures had been used in the great majority of the investigated publications that lead to what is referred
to here as verification bias. There are few articles in which all the mentioned violations of proper scientific
method were encountered simultaneously. On the other hand, publications were rarely found in which one
or more of these strategies was not used”
Thanks P.
@Hellson: “The number of experiments needed to find an effect (depending on sample size) could even be determined a priori.”
Or, alternatively, a power calculation *should* be done a priori!
Where I come from the ‘number of experiments’ is one, while the power calculation tells you how many subjects to include. Of course at 80 – 90% power you might miss an effect, but that’s the risk you take. Repeating the experiment at that point will either just confirm a null effect or leave you with one significant and one non-significant result which is essentially uninterpretable. So unless you can repeat the (identical) experiment so many times that you are sure you’re hitting significance at a rate higher than your power then it’s generally not worth it. You just have to accept that the results confirm the null hypothesis and move on.
Anwyay, as far as I can interpret it Forster did not keep repeating the identical experiment to see whether it came out significant more than 80% of the time. He ‘tweaked’ the methods until it came out significant (presumably once!) and then reported that one, before moving on to a related but different experiment where he (presumably) did the same. That’s at least one reason why there were no ‘null’ findings, as you mention.
The more I read the more I can’t help but think that there are different ‘norms’ between scientific disciplines!
I’m not a social psychologist, but there may be a confusion of order of scale here. By testing whether a construct or some idea is valid, you have to consider this idea as embedded in many sources of unexplained variance, i.e. your effects are always fixed and not random, to some degree.
So even if your experiment is not signfiicant, this might be because the effect is a lot smaller than you postulated (not in terms of depth and effect size, but in terms of how broadly applicable it is — so it’s of a different scale), and it is “probability of idea given constraints A B C” or something similar that will be the effect you’ve postulated. The huge downside is that this also means you can only find out by repeatedly testing, so the chances of spurious findings are increased. This also illustrates the importance of theorizing why you apply certain constraints to adjust your initial experiment.
Therefore, this requires replication (should be mandatory), and it requires publishing null findings: being completely open about the fact that your initial idea is not supported. Publishing a different study where you test “probability of idea given constraints X Y Z” is useful as well, but will require replication in its own right (it is essentially the same procedure).
So, your idea is no longer of that grander scale you postulated, but it is subdivided into “idea given constraint A B C” and “idea given constraint X Y Z” (which should be taken together at all times), which could even be entirely different constructs altogether.
Therefore, this requires meta-analyzing the portfolio of studies we have now established. Only then can we say something about our initial idea with some degree of certainty.
What does this mean? Scientific progress is actually a lot slower than people seem to want it to be. Hyper-competition is promoting the quick and dirty bandwagon, which is literally counter-productive, but unavoidable in the current climate. I’d like to think that psychology today is a lot different from what it used to be, but we haven’t quite adjusted yet. A lot of the more grand effects (that are not the result of a contingency of spurious findings) have been unveiled, so we are gradually moving to smaller effects, but are still in that mindset where making a great discovery is equal to being awesome (which makes sense, but is a flaw).
What do you think of my rambling? This reasoning seems sound to me, but I don’t want to be caught in a line of thinking that is incorrect, so I am looking forward to insights.
I would propose some system where a group investigating an effect has one big observational study from which a few empirical studies spring, that are then replicated and reported in a meta analysis. This takes many, many years, but will produce actual knowledge.
No.. you update your knowledge of the conditions under which the hypothesis produces accurate predictions, vary the conditions to test another regime and report all your findings, including and especially the failed ones. As long as the experiment was correctly performed and tests an a priori plausible prediction of the hypothesis, its results should be reported and discussed. Otherwise, we get exactly the sort of mess that exists in psychology (and most other empirical fields) today.
Exactly! We don’t need a science police to enforce all kinds of prescriptions. There is no methdologically guaranteed route from data to truth that does not go through people’s minds. Therefore, it is important that we truthfully report what we have done and the results we have obtained. Be assured, a critical audience (reviewers and editors included) that is both conceptually and methodologically informed will give you the appropriate response. .
“Be assured, a critical audience (reviewers and editors included) that is both conceptually and methodologically informed will give you the appropriate response.” Sure, that’s exactly the reason why Greg Francis found an excess of significant findings (http://www2.psych.purdue.edu/~gfrancis/Publications/Francis2014PBR.pdf ). 😉
Maybe we should not myopically focus on alpha levels but make use of our own reason (“sapere aude”). In other words, reducing the influence of inferential statisticians might be a first step in the right direction.
This letter speaks volumes about its author!
The IT department at Jabobs University in Bremen probably has a list over hard drive replacements.
This explanation (raw data only on a hard drive that crashed) in itself smells like QRP. Who, especially as a principal investigator, saves their raw data on a local hard drive, especially without a backup on a server?
Is his raw data really that unimportant to be saved so carelessly? To me, it is only a good approach to take if it is unnecessary data anyway… Then, why take the data in the first place?
I know that this is kind of the standard explanation, but that does not make it any better.
Also an extensive story in a German newspaper: http://www.sueddeutsche.de/wissen/wissenschaftsbetrug-zu-gut-um-wahr-zu-sein-1.1958613 In German. English version is behind a paywall in http://www.sciencemag.org/content/344/6184/566.summary
Before this controversy I had never heard of Forster or his research. A colleague alerted me to the early reports on RW. Ironically, having just visited Salem, MA and being reminded of true witch hunts and the difficulty of defending one’s self as not being a witch, I was initially sympathetic to Forster’s case. But then I read his first denial letter/email. It seemed weak and to not address the statistical issues. As a statistician, I decided to take my own look at the 2012 SPSP paper at a time before the original complaint was available and before the analysis in the data colada blog was available. I am interested in effect sizes and statistical power. When I read through the article I was struck by the amazing similarity of the graphs across very disparate modalities for manipulating the global vs. local focus. Further, the size of the error bars for a social psychological study with 20 observations per cell were stunningly small. Given that the data are discrete (only scores of 1, 2, 3, 4, 5, 6, and 7 were possible), one can come pretty close to reconstructing the original data from the means and standard errors. For one of the control groups, the only distribution of 20 scores that came close to the reported mean and s.e. was
2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4
an amazingly tight cluster of scores. If we change one of the 2’s to a 1, then we have to change a 4 to a 5 to maintain the mean. but then to maintain the s.e., almost all of the other scores have to be 3’s as in
1, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 5
these don’t look like real data to me. but even more implausibly, all the scores for all the control groups had to have essentially the same tight cluster of scores. this doesn’t happen in the real world.
I then computed the effect sizes for all the studies and they were unusually high for this type of research and unusually consistent. But that is just an impression. The clincher for me was that even for the unusually high effect sizes, the statistical power of a replication with only 20 observations was not very high. Using the mean of the effect sizes as the true effect size, I used a non-central beta distribution to generate the distribution of effect sizes one would expect from conceptual replications. Needless to say, the variation in that distribution was much greater than the variation in the reported effect sizes and there were a number of instances in which the randomly-generated effect size was not statistically significant. In real data, with 20 observations per cell, there should have been some replication failures! If there were p-hacking in the sense of doing another study with n=20/cell until getting not just significant results but p < 0.001 results would have required an implausible number of studies. Hence, my own analyses before I saw either the original complaint or the data colada blog convinced me these were not real data. The data had either been faked or strongly manipulated. I don't see how any statistician could reach a different conclusion.
But who did the data manipulation?
Forster makes it clear that his co-author had nothing to do with the data collection or analysis (hmm, why was he a co-author then?) so the co-author couldn't have been the culprit
Forster makes it clear the data were collected over a considerable period of time by as many as 150 research assistants. As others have noted, it would have required an elaborate conspiracy across time and people for the research assistants to be doing the faking.
So, who is left as the only plausible culprit?
Quote: „the only distribution of 20 scores that came close to the reported mean and s.e. was 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4”
Do the reported mean and s.e. agree with this distribution, or should, if this were the actual distribution, different mean and s.e. have been reported?
Background to my question: Diederik Stapel had reported impossible percentages for a group size of 32: 44% and 15%. The latter would indicate 2.4 persons. Now 2 persons would have been 13% and 3 persons would have been 18%. Stapel was not officially caught by the discovery of this flawed fraud, because there had also been other clues.
http://www.pepijnvanerp.nl/2012/09/the-stapel-fraud-anniversary-and-the-psychology-of-meat/
I was mistaken. the raters used 1-7 scales but there were 4 raters so each participant’s score was an average, not exactly a whole number between 1 and 7. However, the lack of variation would still constrain the scores to a very small range around the mean. It would be very difficult for a participant in a control condition to have had a score near either 1 or 5. individual differences in “creativity” would have had to have been particularly nil to get those means and standard deviations.
Interesting analysis indeed. Just a question: could you please make the R code available?
hi PowerNerd, great piece of work and very easy to understand. Your conclusion: “The data had either been faked or strongly manipulated. I don’t see how any statistician could reach a different conclusion.”
Jens Förster on 11 May 2014: “I also received emails from statisticians and colleagues criticizing the fact that such analyses [= the methods the complainant used (and reviewed by three different experts of UvA and LOWI)] are the major basis for this negative judgment.”
I would like it very much if one of more of these statisticiants (no problem when using a nickname) would comment on your conclusion.
Do we have any independent verification of these mails and their contents. Dr Förster has not struck me as the most accurate reciter of documents that can be independently verified.
Isn’t it a lesson from all these fraud-detection enterprises going on currently that real data don’t look like real data? I ask all statisticians planning to use statements of the type “how data should look like” to visit a social psychology lab for a few weeks. Even then, statements like this one should be backed up by solid evidence or not be made at all.
Regarding the opening line of the article: while Jens Förster is working in the Netherlands, he is in fact German, not Dutch.
Dear Jens,
I would like to have a chat with you about QRP (= questionable research practises = sloppy science, etc.) in relation to your firm statements “I never used QRP”, “because I did never did something even vaguely related to questionable research practices” and “Note that I taught my assistants all the standards of properly conducting studies and fully reporting them.”
I will start with a quote from Drenth et al (2010, Fostering Research Integrity in Europe, Executive Report, http://www.esf.org/ ):
“Good research practices. 1. All primary and secondary data should be stored in secure and accessible form, documented and archived for a substantial period. It should be placed at the disposal of colleagues. The freedom of researchers to work with and talk to others should be guaranteed.”
So what have you taught all your assistants and all your students in any of the European countries you have carried out experiments about storing all the primary and all the secondary data of such kind of experiments?
So what will you do when you get an e-mail from a psychologist who wants to conduct a replicative study of all the 42 experiments published in the three papers (Förster & Dentzler 2012, Förster 2011 and Förster 2009)? Are you able to send him the whole protocol, so he can soon start with this replication study?
Are you aware of the report ‘Fostering professionalism and integrity in research’ (October 2013, final report of the Taskforce Scientific Integrity of Erasmus University Rotterdam)?
See http://www.rsm.nl/fileadmin/Images_NEW/News_Images/2014/Taskforce_Scientific_Integrity_EUR.pdf
A quote: “Researchers are responsible for storing data and documentation at various moments during a study. The minimum that must be stored consists of both the raw data and the data underlying any submitted or published publication, the project plan, documentation that describes and explains major changes to the earlier plan(s), as well as the submitted version of the publication. (..). It is essential that PhD supervisors and research group leaders are also role models for young researchers. Similarly, commitment and the willingness to share good practices are more important than protocols and covenants. (..). Being a role model requires continuous reflection on one’s own practice as researcher, research leader and supervisor.”
So LOWI has decided that your decision to throw away raw data before you had presented the results (in three different papers) was a violation of the Code of Conduct. LOWI has also decided that you broke rules of APA by submitting papers to journals who follow the rules of APA that you must keep all raw data at least 5 years after the paper has been published.
Can you please explain why you still hold the opinion that you never ever have conducted QRP? What’s your definition of QRP?
You state: “I can also not exclude the possibility that the data has been manipulated by someone involved in the data collection or data processing.”
Have you already send an e-mail to the editors of both journals (JEPG and SPPS) with your concerns that something might be wrong with (parts of) the data in these three papers, because an unknown person might have been manipulated data? Are you right now working day and night by checking all the raw data to find out the truth? Have you also already made your apologies to the editors of both journals that you need to bother them with your problems?
Where is the line between ‘good practise’ and ‘QRP’? What’s your opinion about the behaviour of the scientist Mladen Pavičić, as described in http://retractionwatch.com/2012/11/30/poignancy-in-physics-retraction-for-fatal-error-that-couldnt-be-patched/#more-10910 ? Good practise? Normal? What’s your suggestion when students of psychology are asking you questions about this behaviour of Mladen Pavičić?
What’s your opinion of psychologists who publish peer-reviewed papers on the behaviour of Humans on blogs and who set out experiments to find out more about the behaviour of Humans on blogs? See http://scienceblogs.com/pharyngula/files/2014/03/fpsyg-04-00073.pdf for a nice example of such a paper.
Would you recommend psychology students to read this paper and would you recommend psychology students to conduct such kind of online experiments?
Thanks in advance for a response.
in the SPPS paper Forster writes:
Participants and design. For each of the 10 main studies, 60
different undergraduate students (number of females in the
studies: Study 1: 39; Study 2: 30; Study 3: 29; Study 4: 26;
Study 5: 38; Study 6: 32; Study 7: 30; Study 8: 30; Study 9a:
35; and Study 10a: 28) were recruited for a 1-hour experimental
session including ‘‘diverse psychological tasks.’’ In Studies 9b
(31 females) and 10b (25 females), 45 undergraduates took
part. Gender had no effects. Participants were paid 7 Euros
or received course credit. All studies were based on a 3 Priming
(global, local, and control) between-factorial [sic] design.
But in his most recent defense he writes:
Many studies were run with a population of university students that is not restricted to psychology students. This is how we usually recruited participants. Sometimes, we also tested guests, students in the classrooms or business people that visited.
He also writes in the most recent defense that
120 participants were typically invited for a session of 2 hours that could include up to 15 different experiments (some of them obviously very short, others longer). This gives you 120 X 15 = 1800 participants.
So in his written denials he has now implicitly admitted that he in his Methods section about participants. They weren’t necessarily undergraduates recruited for 1-hour experimental sessions in a between-factorial [sic] design, but instead they might have been “guests, students in the classrooms or business people that visited” and if they were undergrads it wasn’t for an hour but for a couple hours for a lot of different experiments averaging 8 mins/experiment in a possible within-participants design.
If for no other reason, his paper should be retracted because he has now admitted lying about “Participants and Design” in the Methods section of a published paper.
And this is all ignoring the non-independence of the different experiments. When you run 10 experiments, and they all use the same subjects, isn’t there a huge issue of overall alpha control going on? These are not independent random samples. In this case, they are the same “random sample” or at least one with a lot of overlap. That too sounds dicey.
There is not just the statistical problem, although it certainly would be one like you say. *If* multiple studies like these were really conducted in a row then the data have to be interpreted with extreme care because the same participant might have been exposed to local and global primes in a very brief amount of time. This would certainly create carry-over effects and further biases in all dependent measures after the first study.
Running multiple experiments with the same participants is already problematic for many reasons (even when the order of experiments is randomized). But using the same participants for conceptually similar experiments (with related independent and dependent variables), to me, definitely classifies as a “questionable research practice”.
In the journal article he clearly claims the data were from independent participants with his statement the design was “between-factorial [sic] design.” It is only his recent defense where he seems to suggest all the data were from the same participants. but elsewhere in his various defenses he suggests the data were collected over many years and the various experiments often had to be redesigned to get the hypothesized results. that would suggest a claim for independent data. He is starting to make too many conflicting claims that all can’t be true.
In my reading, the multiple studies in a series were on independent topics.
Jens Förster wrote in his first message (29 April 2014): “The UvA states that it is not clear, who could have manipulated data if this had been done. But UvA thinks that I am still responsible.”
The quote refers to three different papers, Förster (2009), Förster (2011) and Förster & Denzler (2012).
Excuse me very much, but I fail to understand why Jens Förster should not be resposible (anymore) for the entire contents of all these three papers (in particular because his co-author declared that he was not involved in anything which was related to collecting and interpreting (etc.) the raw data).
Towards my opinion an author of a peer-reviewed paper in a journal with just one author has two options:
(1): he is 100% responsible for 100% of the contents.
(2): he cannot stand anymore 100% behind 100% of the contents and he immediately sends an e-mail to the EiC of the journal in which he explains why this is the case and what should be done to warn the readers of his paper that there might be problems with (parts of) the contents.
Any other options?
Förster’s experiments were mostly run in Bremen on unspecified dates. He explains the typical Bremen experiments: “120 participants were typically invited for a session of 2 hours that could include up to 15 different experiments”.
The experiments in the Förster and Denzler article use 60 participants in all experiments, and the participants “were recruited for a 1-hour experimental session including ‘‘diverse psychological tasks.’’ “(page 110). “In all studies, participants were told that they would work on two different, unrelated tasks.” (page 110) If the participants work on only two tasks (the two tasks described in the article), these experiments cannot be part of a typical experimental battery but were run as stand-alone experiments (if the general procedure section of the article is correct).
Someone just alerted me to the paper “Distancing From Experienced Self: How Global-Versus-Local Perception Affects Estimation of Psychological Distance”, written by Nira Liberman and Jens Förster and published in 2009 in Vol. 97 of the Journal of Personality and Social Psychology.
The reported experiments have the same global-control-local manipulations and experimental setup as the work criticized in the whistleblower’s complaint. The behavior of the means in the different conditions is striking. For example, in Table 2 on p. 208, the mean for the local condition equals -17.14 (SD=29.5) and the mean for the global condition equals 17.05 (SD=20.8). Guess what the mean for the control condition is? Right: 0.15 (SD=30.95).
The same pattern appears to be present in several of the other experiments reported in that paper. The paper reports data that were allegedly gathered at the University of Amsterdam, so the structure of the Bremen data acquisition process cannot explain the results.
Maybe Dr. Liberman, as first author, would care to comment?
I just checked the paper. Two out of seven reported global-control-local means show near-perfect linearity, the other 5 do not (in the ego condition).
I also checked the paper: of the severn experiments only 3 were conducted in Amsterdam, (not all three at the university). One experiment has been conducted in Bremen, and three more have an unspecified origin.
Ah, the same dr Nira Liberman from Tel Aviv who writes in Förster’s defence on his blog http://www.socolab.de/main.php?id=66
PS who couldn’t see anything funny in the Förster and Denzler paper
What are these two last contributions supposed to mean?
I understood that the paper has been reviewed, so a few more people did not see anything “funny” about the paper.
Moreover, in one of the reports it has been stated that the questionable pattern was not easy to detect, right?
Of course, from hindisght it is easy and effortless to judge, but is it informative? Is it fair?
I clearly have my doubts here…..
My experience with the 2012 paper was not hindsight. I was alerted that there had been a critique of that paper related to data authenticity but I had not yet seen the original complaint, nor the data colada blog before I examined the paper. When I glanced at the paper, the problem jumped off the page at me. Social psychologists, among many other disciplines, do not think enough about statistical power and its implications for replicability. But I am a social psychologist who does. The moniker I user in these posts was given to me and not made up by me. With 20 observations per cell, replications even with large effect sizes, are very difficult in the sense of even getting a significant result in the replication. The replications were not even exact but conceptual replications using different modalities for manipulating the global/local focus. Nevertheless, the results were stunningly identical–visually in the graphs (even though they were not arranged to highlight the linearity issue, their similarity was striking) and statistically (F’s, p-values, and effect sizes). Statistically, this just cannot happen. I think the estimate of one in a trillion is an over estimate. If you take the means and standard deviations Forster provided as veridical and simulate random normal data, it would take a lifetime before you generated such perfectly consistent data. Anyone with a little understanding of power with cell sizes of 20 would immediately be suspect. I’m sorry that the original reviewers and editors did not have that understanding.
I believe that the failure to do power analyses is a root cause of QRP. If you do a study which is adequately powered, you will evaluate a null-finding as falsification (the failure to reject the null hypothesis is informative, in an adequately power studied). If your study is underpowered but you do not know it, you may end up tinkering with the data (perhaps this data point is an outlier?; perhaps I could condition on sex? perhaps I should try a different test? perhaps I should try a different dependent variable? perhaps that RA is not doing his best? perhaps I should and X Y or Z as a covariate). Prior power analysis should be mandatory (one may even argue that it is an ethical requirement: an underpowered study is really a waste of resources). So: More power to PowerNerd, and be proud of your moniker.
The pattern is easy to spot. The mean of the control condition lies exactly between the local and global conditions, while the SDs are substantial.
And who is a co-author of Förster on other papers.
See http://www.ncbi.nlm.nih.gov/pubmed/?term=jens+f%C3%B6rster for a limited list of co-authors on other papers of Jens Förster.
From the helpful list that Klaas van Dijk posted, I found another paper first-authored by Denzler (Denzler, Hafner, and Forster, PSPB 2011) for comparison to the Forster & Denzler 2012 paper. It is easy to believe that Denzler did not participate in the reporting of the results in the 2012 paper. The reporting of statistical results in Denzler et al. meets the highest standards in terms of details. And as one would expect for the given sample sizes, not all the studies produced significant results.
“Because our analyses did not yield a significant difference
in all studies, we conducted a meta-analysis to assess
the overall strength of the obtained effect for a reduced
accessibility of aggression-related constructs upon playing a
violent computer game in service of a goal. For each study,
we calculated Hedges’s g (Hedges & Olkin, 1985) as an
effect size estimate comparing the measurement before goal
fulfillment with the measurement after goal fulfillment for
the conditions that fulfilled a goal: Study 1, g = .295; Study 2,
g = .445; and Study 3, g = .983.”
top-quality. Doing a similar meta-analysis on the results in Forster & Denzler would have revealed the preposterous consistency in effect sizes.
As a social psychologist, I would just like to remind readers that Foerster and Stapel etc are not typical social psychologists. The conversations in my department and with my collaborators, like the discussions on this blog, have expressed expressed disbelief at the practices Foerster describes and his cavalier attitude towards these poor research practices.
The view often stated is that it is necessary to indulge in QRP (or worse) in order to become successful in social psychology. Stapel was certainly a high flyer, and Förster was about to be handed a bag of research funds to the tune of five million.
Do you share this view?
I do not. During the time period prior to approximately 2012 or so would engaging in poor research practices help you be successful? Yes. Could you be successful without them? Yes.*
Plus, things are changing very quickly in social psychology. I’ve been on a few papers now where the paper was rejected because the results seemed too good to be true, the results of some medium-powered studies were messy and the authors were encouraged to resubmit after running a high powered replication, or the in the review process the authors were encouraged to be open about their methods and data etc. That is, many reviewers (or at least reviewers on papers that I review) are sensitized to these issues and are actively working to keep poor research practices out of the literature.
I think it is also worth noting that in social psychology there is an implicit split between social psychologists in psychology departments and social psychologists in business/marketing departments. These latter departments often have stricter requirements of where to published and the prestige of the journals than do psychology departments. It is also often more important for members of business/marketing departments to get media attention (presumably with the help of flashy results). In my mind, it is the business/marketing departments that seem to have the most skewed incentives.
These are just my impressions and anecdotes. But keep in mind that Foerster, Stapel, and the like are also anecdotes and should not be used to indite the entire field – especially when the field has started to make clear and concrete steps for improving the situation (see e.g. new explicit norms at journals)
*This is not to say that there is a group of perfect social psychologists (or biologists, or etc etc)
Another clarification: If I remember correctly, Stapel, Sanna, and Foerster work(ed) in psychology departments. Only Smeesters worked in a marketing department.
That is true. I am not trying to shift the blame, but rather highlighting where I see the misaligned incentives based on my own experiences and the experiences of people that I know. Remember that these fraud cases and my own experiences are not random samples of people from social psychology (or any other discipline) and so by counting the number of frauds or my impressions we aren’t going to accurately represent the current state of affairs.
But since many (not all, and maybe not even most) people commenting seem to paint all social psychologists with the same brush using extreme and anecdotal data, I thought I would chime in with my own.
I honestly don’t get the sense that things are changing in the review process. Do you really claim that the field as a whole, the entrenched structural system, the editors and reviewers, all of a sudden have converged upon a mature vision of statistical power, etc., in 2 years?
I’ve seen and heard recent stories of reviewers accusing authors of QRP’s and HARKing… While in some cases this might be true, it seems that the field is forming a circular firing squad, and that very many people are getting caught in the cross-fire.
What I find most interesting, and which hasn’t been mentioned much on these forums, is the way the review process has typically worked in social psychology. Simply put, editors and reviewers demanded statistical and empirical perfection as a pre-condition of publication, especially in the more prestigious journals (see Giner-Sorolla, 2012, Perspectives on Psychological Science for an eloquent and thoughtful description of this issue). I’m not excusing any action of fraud, but given such unreasonable selection standards, you can see why researchers would be tempted to engage in unacceptable practices in order to get their papers published.
And while I absolutely agree with “not typical” (below) that there are people who do a careful job with their research and take a long time to publish 1 paper in a prestigious outlet, it is very difficult for junior people in the field to do so. Indeed, the selection process for academic positions is so competitive, and so much weight is placed upon the number of (top “quality”) publications, that people who tend to be careful, cautious, and measured (all hallmarks of good science and good scientists) are less competitive on the job market. They therefore are less likely to stick around in the field, serve as reviewers and editors, judge grant applications, etc.
If a real change will happen with the review process and selection procedures (both for publications and for jobs), this will take at least a generation. I personally don’t see the scientific social structure changing, or the field gaining methodological and statistical sophistication, any time soon.
You’re quite right. I just received feedback from a psychology journal in which the editor asks me and my coauthors to (a) drop experimental conditions from the paper that did not yield significant results and (b) drop the only figure in which scatterplots show the true state of affairs in our data. I’m afraid that some editors and many reviewers are not even aware of the discussions we engage in here and within our respective fields.
Just to clarify – this is really just one corner of social psychology. Most of the social psychologists I know don’t do this kind of lab based experimental work.
As a social psychologist, I agree with “Anons” that these “high profile researchers” are not typical social psychologists. I never conducted 2-h “battery of experiments” myself, nor did I publish papers in very prestigious outlets. People in my lab can spend 4 or 5 years to publish a single paper in JPSP or JEPG. The more Forster attempts to justify his findings, the more doubtful it is to me. I find shocking that now Forster is blaming RAs without assuming his own responsibility (even if the RA conspiracy explanation is true, something I cannot buy, he must take the responsibility as the lab director). I understand that the temptation to dismiss social psychology as an entire field of study is strong given the recurrence of the Stapel, Smesters, and Forster scandals. Yet, I believe that people should not paint all social psychologists with the same brush. We, the unskilled and unknown social psychologists were the first victims of such practices. We simply cannot compete with the 15-experiment per paper standards that these guys have created. Of course, we couldn’t reproduce their findings neither (e.g., http://www.psychfiledrawer.org/search.php). That said, I do believe that it is still possible to be quite successful without faking data in my field. We often replicate the findings of (Dutch) social psychologists. Indeed, we know which findings are replicable or not. I teach my students to be skeptical about the papers they read (“you only know that it is true when you can successfully replicate it”). To me, the most successful researchers are not the “big stars of the media” but rather the ones who publish highly replicable findings. On a more positive and constructive side, it is good news that now (failed and successful) replications can find a home in some academic journals.
Its rather easy to harvest an array of other papers co-authored by Jens Förster and/or by some of his co-workers. I am a biologist and I have read parts of these papers, including papers with prof. Lieberman as one of the co-authors. Invariably, such papers present results of 2-4 experiments.
I get the strong impression that all these experiments have been conducted with a very strict protocol, and I get the strong impression that many methodological details and pitfalls are presented in these papers, often in a very detailed form. Please tell me that I am wrong, but I have the opinion that nothing is wrong with these papers. Sure, the sample size is often quite low, and sure, the participants can tell what they want, but all has been documented very, very good. Long lists of people who helped with the experiments, grants are listed, even footnotes with side information about preliminary experiments, etc, etc, etc.
All the information in these papers is in strong contrast with the the information in the 2012 paper of Jens Förster & Markus Dentzer (any one any idea about him?), and with the information others have told me about both other papers with only Jens Förster as author (2009 and 2011). On top of that, the additional explanations of Jens Förster are opaque and a huge amount of details are still lacking.
I fully agree with the opinion of ‘Anons’ and of ‘not typical’. No problem at all to use a nickname and keep posting.
I would not read too much into the lack of detail. SPPS – as well as PsychScience – were specialized on relatively short reports, trading detail for brevity. A little bit like science, where (in times before online supplements) procedural specificities often were omitted.
I totally agree. The lack of information does not indicate anything, and – some time ago – was even a standard in some of the most prestigious outlets such as science and nature.
I am also surprised that now some anonymous Social Psychologist claim that there procedures are completely different (and always haven been), without any reliable evidence (apart from “my personal experience”).
I remember vaguely that the nobel prize winner Daniel Kahneman wrote an open letter that the field of Social Psychology needs to regain its credibility. This “we are different” atttitude is understandable but does not increase credibility.
The STAPEL type of cases were clearly unacceptable also in Social Psychology at that time.
Ambigious cases such as the current one (Foerster) are in my eyes something completely different, because data fabrication/fraud has not been proven (and has also not been ruled out completely).
Some (social) psychologists and people from other disciplines apply their (current) standards to a paper that has been accepted before the whole field started to change (as outlined above).
I guess that this is what a Social Psychologist would call a hindsight-bias pheonomenon…..
A link to the Kahneman letter: http://www.decisionsciencenews.com/2012/10/05/kahneman-on-the-storm-of-doubts-surrounding-social-priming-research/
Yes, the lack of methods details does not prove anything in itself, but despite their brief reports of methods in Forster & Denzler 2012, Forster in his various defenses has provided contradictory descriptions of the participants and the time of the tasks and whether they were between-particpant or within-participant studies. In other words, in his defense he had admitted to lying in his brief report of his methods. this is unambiguous.
I do not see how at this point you can say that the current case is ambiguous. The very same statistical methods that we use to support our research conclusions show far, far beyond a reasonable doubt that the data reported cannot be real. What part of the statistical argument do you disagree with? Results like those could not have occurred. the same statistical techniques find that Mendel’s data are too good. the methods are well-established. the data reported are unambiguously fraudulent. Who did it and who is responsible are the only doubts.
Hi PowerNerd, thanks for the link to the Kahneman letter. Daniel Kahneman his line is very clear.
Statistician Richard Gill of Leiden University (see one of the other threads on RW about Jens Förster) suggested that the Alexander von Humboldt foundation should give the grant of 5 million euro to Jens Förster.
With the condition that Jens should conduct a precise replication study of all 42 experiments which have been published in the three papers (Förster & Dentzer 2012, Förster 2011, Förster 2009). Anyone over here who wants to comment on this proposal of Richard Gill?
Jens Förster has admitted that he has thrown away (al large part of) the raw data, but how about the protocols of all these 42 experiments? I am not a psychologist, but I assume these protocols have been stored. Am I right?
Within The Netherlands, the Erasmus University Rotterdam has recently also published an extensive report with more or less the same proposal as what’s listed in the letter of Daniel Kahneman. The report of the “Taskforce Scientific Integrity Erasmus University Rotterdam” has been published in October 2013. It can be downloaded from http://www.rsm.nl/fileadmin/Images_NEW/News_Images/2014/Taskforce_Scientific_Integrity_EUR.pdf
Highly recommended (only minor parts are in Dutch).
————————————————————————————
Appendix 4: “Letter about scientific integrity from Professor Verbeek to ERIM fellows, members and doctoral students, 04/07/2012 (reference MV/tv 0012.003840).
General recommendations for storing research data
1 Always maintain copies of the original, “raw” research data. In case of paper and pencil questionnaires, this means storing the actual forms. In case of electronic data, it means the original completed electronic forms. In case of qualitative research, it means the original audio files or transcripts of interviews, or field notes. In case of secondary data or data collected by others, it means the originally obtained data (data ownership issues permitting). Thus, while the nature and form of the actual raw data may vary, the basic principle applies that the researcher should be able to convincingly demonstrate that this original version of the raw research data has
not yet undergone any selection, purification or transformation steps.
2 We recommend tying the original data to the identity of the research informant/participant, even in the case of confidential data. Confidentiality can be maintained by separately storing a key, controlled by the (lead) researcher. At a minimum, the identity of individual research informants/participants should be recorded (without necessarily relating this to a specific response). The key principle here is that anonymity can be guaranteed (if necessary) with respect to published data, without sacrificing the identity of research participants for the original data collected.
3 The data collection process should be clearly described. This includes the names and roles of the researchers involved and/or the organisations providing the data (such as research agencies). The descriptions should be detailed to the extent that the process can fully be traced back.
4 The data input and analysis procedure should be documented in detail, so that the analysis can be replicated exactly. This includes major analysis steps that may in the end not be reported in further publications, but which have been instrumental in steering the analysis process. All substantial files should be stored, including for instance specific software syntax, diagrams, graphical presentations, etcetera. Again, the names and roles of the researchers involved should be provided.
5 For each crucial data compilation, purification or transformation step, it is recommended that clearly identifiable and described data sets are stored. (Crucial steps transform data such that it is impossible to revert to the rawer data when only the transformed data is available.)
6 All original, “raw” data and the documentation of the data collection and analysis process should be stored for a minimum of five years after publication of the most recent publication using this data. This applies as long as specific professional or journal policies do not require a longer storage period.
7 All (electronic) “raw” data and the documentation of the data collection, input and analysis process should be stored in duplicate. At least one set of data should be stored on a university or external network, with appropriate safeguards regarding anonymity and data ownership. Data collected on paper should be transferred to an electronic medium in its entirety and stored electronically (where possible by scanning the entire documents).
8 In the case of co-authored papers where another person is executing data collection, input and/or analysis, we recommend storing a copy of the raw data yourself as well (confidentiality and data ownership issues permitting), and storing the documentation of data collection, input and data analysis procedures.”
Klaas, interesting points there. Such details should form part of any research institute. Do you know, or can you find out, how Erasmus University Rotterdam defines authorship, and who can be an author on a paper? I ask this because one of his co-authors sounds like a guest author, even according to JF’s statement: “My co author of the 2012 paper, Markus Denzler, has nothing to do with the data collection or the data analysis. I had invited him to join the publication because he was involved generally in the project.”
JATdS, I am quite sure that Erasmus University Rotterdam will not have very precise rules who can be an author of a paper.
The Netherlands Code of Conduct for Scientific Practice states: “I.4 Authorship is acknowledged. Rules common to the scientific discipline are observed.” I tend to think that this means that the rules of the journal must be followed.
The Netherlands Code of Conduct for Scientific Practice applies for all Dutch universities, so also for UvA and also for Erasmus University. See
http://www.uu.nl/SiteCollectionDocuments/The%20Netherlands%20Code%20of%20Conduct%20for%20Scientific%20Practice%202012.pdf (UU = Utrecht University).
Please be aware that the case Smeesters will have been the trigger for Erasmus University to install such a “Taskforce Scientific Integrity Erasmus University Rotterdam”.
Thanks, Klaas, for that revelation. It is actually fascinating that one of the basic pillars of publishing, authorship, should be determined by the publishers and journals. This makes the system open to biased standardization, and possible abuse. I think, to downplay the power by publishers, that research institutes evaluate all members of a team in a paper and decide if those individuals can or should be authors. Authorship disputes arise when universities have strict ethics, but lax or no rules on authorship. Finally, who loses, are the authors who are victimized.
Klaas, I really like the idea that Forster could keep the 5 million euros if he used it to try to replicate, under observation, the research reported in those papers.
I am not excited, however, about ideas for adding a lot of rules and expectations for archiving data. Over the course of my career (I’ve recently retired but still keep a hand in), there was a steady increase in the number of non-scientific chores I needed to do before I could do science. Institutional review boards for human participants, accounting forms, space audits, etc., etc. I am wary of adding to those lists. And why do we have those things? A few people abused human participants, a few people used research funds for things they shouldn’t have, a few people hoarded space, etc., etc. And the remedy was to place additional burdens on all of us. I don’t think that is a good thing for science.
Even with all those data reporting and archiving rules, I’m pretty sure I could figure out how to cheat. The evidentiary chain will never be perfectly verifiable. Many data in my field are collected online and stored in electronic files. But editing or even creating those files without there ever having been any data collection would be easy and very difficult to detect. We have Mendel’s “raw” data as archived in his hand-written notebooks. But statistically his data are too good to be true. We don’t know whether he consciously cheated or whether his knowledge of his hypothesis influenced his visual classifications. But requiring the “raw” data is no real barrier to cheating. I find it remarkable that a smart guy like Forster did such a poor job of faking data. The R codes we’ve used to catch him could be turned around to be used to generate realistic-looking data that would not be detectable using statistical means.
There will always be some cheats, but science eventually gets it right. We might worry about the embarrassment to funds that give 5 million euros to researchers, but I don’t worry about science. Ideas either get replicated or not, lead to other interesting research or not, lead to substantive improvements in daily life or not.
I was an early adopter of computer methods for collecting data. if someone wants to question my dissertation data, I still have the paper punch tapes on which the data were recorded. If anyone can read them, they can check my data…
PowerNerd, I fully agree with you that you don’t need a large amount of strict rules in regard to storing raw data and in regard to keep a very accurate record of any protocol. Any good researcher will be aware of these conditions and any good researcher will always work according to these standards.
I am not the person who is deciding what to do right now with the 5 million euro of the Alexander von Humboldt foundation (public money, as far as I am aware) designed to give to Jens Förster. There are reasons that it should not be given anymore to Jens Förster, but there are also reasons to give (parts of) it to Jens Förster. You are totally right that the second option should only be done when Jens Förster is willing to replicate the 42 studies under close observation, with the condition that all details of all protocols of all 42 studies are still present.
I would like to end with a recent quote of Dave Fernig (http://ferniglab.wordpress.com/):
“We should note that once again science is becoming a small village. While the days when the world’s entire molecular biology community could meet at Cold Spring Harbor are long gone, the internet and social media bring back important elements of that time. We can all fit into “one room” albeit a virtual one, with many conversations happening simultaneously. This leads to far greater public scrutiny of scientific output (aka papers). While no one can read every paper, someone, somewhere has read your paper, very, very carefully. If there is something they are unhappy about, we will all know about it in due course.”
The Humboldt Foundation will not make any decision about the Jens Förster case for the time being. I asked the Humboldt Press Office if the award will be conferred to Förster and received this response: “We will, if necessary, seek independent expertise and, if necessary, re-submit the nomination to the selection committee in case the prior tests will turn out positive. The selection committee will meet again in October 2014.”
Rolf, thanks for your update. The statement is quite vague, but I understand that there is a delay of at least several months before the Alexander von Humboldt foundation might make a new decision about the grant. Does this delay implies that Jens Förster is currently also unable to start with his new job at Ruhr-Universität Bochum?
I assume that he will not start with his new job in Bochum, as the Humboldt Award is supposed to provide “an internationally competitive salary” for the laureate.
Jens Förster on 11 May 2014: “The series of experiments were run 1999–2008 in Germany, most of them Bremen, at Jacobs University. The specific dates of single experiments I do not know anymore. (..). During the 9 years I conducted the studies, I had approximately 150 co-workers. At times I had 15 RA’s who would conduct experimental batteries for me. They entered the data when it was paper and pencil questionnaire data and they would organize computer data into workable summary files (one line per subject, one column per variable).”
Jens Förster on 29 April 2014: “(..) the dumping of questionnaires (that by the way were older than 5 years and were all coded in the existing data files) because I moved to a much smaller office. I regretted this several times in front of the commissions. However, this was suggested by a colleague who knew the Dutch standards with respect to archiving. I have to mention that all this happened before we learned that Diederik Stapel had invented many of his data sets.”
http://www.dgps.de/index.php?id=143&tx_ttnews%5Btt_news%5D=1138&L=0&cHash=a1711fd5ecd8508a289d1caaf727134b
“Jens Förster ist seit 2007 Professor für Psychologie an der Universität von Amsterdam in den Niederlanden. Der gebürtige Deutsche ging nach der Promotion in Trier zunächst nach Würzburg und von dort als Postdoktorand an die Columbia University in New York, USA, bevor er zurück nach Würzburg kam. Darauf folgte eine Vertretungsprofessur in Duisburg, bis er 2001 als Heisenberg-Stipendiat abermals an die Universität Würzburg zurückkehrte. Von 2001 bis 2007 war Jens Förster Professor an der Jacobs University in Bremen.”
So the dumping of the questionnaires took place between 2007 and 2011 when Jens Förster was working at UvA. At that time, the questionnaires were older than 5 years.
2011 – 5 = 2007 or earlier
2010 – 5 = 2006 or earlier
2009 – 5 = 2005 or earlier
2008 – 5 = 2004 or earlier
2007 – 5 = 2003 or earlier
Lacking as well is alot of the primary data (all primary data?) of the 2242 undergraduates who were partipipants in the 42 experiments which were run in 1999-2008 in Germany, mostly at Jacobs University.
https://www.jacobs-university.de/academic-integrity-code
http://www.mpg.de/199493/regelnWissPraxis.pdf
http://www.dfg.de/download/pdf/dfg_im_profil/reden_stellungnahmen/download/empfehlung_wiss_praxis_1310.pdf
“Sicherung guter wissenschaftlicher Praxis. Primärdaten als Grundlagen für Veröffentlichungen sollen auf haltbaren und gesicherten Trägern in der Institution, wo sie entstanden sind, zehn Jahre lang aufbewahrt werden.” Please read the last part of page 22 (German version).
“Primary data as the basis for publications shall be securely stored for ten years in a durable form in the institution of their origin. The published reports on scientific misconduct are full of accounts of vanished original data and of the circumstances under which they had reputedly been lost. This, if nothing else, shows the importance of the following statement: The disappearance of primary data from a laboratory is an infraction of basic principles of careful scientific practice and justifies a prima facie assumption of dishonesty or gross negligence.”
Jens Förster on 29 April 2014: “I cannot understand the judgments by LOWI and UvA.”
This reads like a lot of sloppy science by Forster, and a disregard for that sloppiness. I am also truly appalled by the amount of funds directed towards his rather obscure (outside social psychology). It’s frankly all really disgusting. I was hoping there would be some silver lining to this case, but I can’t find it…
Cheer up, TS: Cases like this will inspire a new generation of social psychologists to join the ranks of the professional social psychologists out there, to take control of the field, by conducting their science, in all its aspects, rigorously by the book. That is the silver lining.
Thank you for this very sensible comment here, I am sad but also very happy that this happens now and not in ten years so that we can make an effort to make things better from now on
From QRP to QRR (Questionable Research Results)…..I suspect this will keep us busy for a while.
Let’s face the real problem.
The real problem is not whether JF manipulated the data or not.
If he says he didn’t and there are no means to prove otherwise I would personally give him the benefit of the doubt.
The real problem is that the data IS manipulated and that neither him nor many other influential representatives of the system saw it or even see it now, after it has been thorougly explained.
The real problem is, that our science demands, supports and promotes this sort of bad science.
AS JF admits openly in his letter (again without the slightest idea what might be wrong about it), he conducts 80-100 experiments a year and publishes those that confirm his hypothesis.
The empirical part of those articles is mere rethoric and by no means what the president of the German Psychological Association called “methodologically rigorous”.
The real problem is, that all this pseudo-empirical data is worthless at best and misleading at worst. Under these circumstances, one has to applaud the Stapel approach for at least saving tax payers`money from being wasted for zero evidence by just making up the data.
Just a short addition: In my first comment I said that I would give JF the benifit of the doubt concerning the question whether it was him or somebody else who manipulated the data.
A different question is of course whether somebody who is not able to see what is wrong with the data, – even after it was explained to him – should instruct students about an empirical science.
Nothing wrong with running multiple experiments and publishing the ones that “replicate consistently”. The problem is, data missing and all, there is no way to check for cherry-picking. Also no problem with under-reporting dropouts (or not calling them dropouts), if they dropped out before they generated any usable data for the study (only filled out a personality questionnaire, for example, with no experimental measures). There is a BIG problem with Forster feeling cornered and implying that one (or more) of his co-workers may have conceivably manipulated his data (*conceivably* the big bang is all a big fat lie and so is evolution, *conceivably*). The problem with this is that the only people able to have manipulated data consistently would have been the ones who worked on those projects the longest. Forster himself is a good candidate, and the people who are listed as authors, not the undergraduate research assistants Forster implies would have wanted to please him. Those poor bastards worked in his lab 1-2 years at most. Denzler, whom Forster so eagerly does not want to implicate, has been his right hand managing his lab, and one of the people who, alongside Forster, had access to ALL of the data collected in the last 4 years in Germany (at least). *Conceivably*, one could think that Forster does not want to implicate Denzler (and lies about Denzler not having any role in data collection) because he knows that he doctored the data himself. That’s how he’s sure of Denzler’s innocence. *Conceivably*, this could be what’s happening. Or Denzler is one of the people who still has the original files and can rat out at any minute. In any case, it is a huge red flag to me that Forster is choosing to shift blame to undergraduate research assistants who blindly ran subjects, and entered data one line per subject and one column per variable.