Anatomy of an inquiry: The report that led to the Jens Förster investigation

forster-j-a
Jens Förster

We have obtained a copy of the report that led to the investigation of Jens Förster, the social psychologist at the University of Amsterdam, which is calling for the retraction of a 2012 article by the researcher for manipulated data.

As we reported earlier, Förster has denied any wrongdoing in the matter.

But as the report makes clear, investigators caught several red flags in Förster’s work. Here’s the abstract, which makes for interesting reading:

Here we analyze results from three recent papers (2009, 2011, 2012) by Dr. Jens Förster from the Psychology Department of the University of Amsterdam. These papers report 40 experiments involving a total of 2284 participants (2242 of which were undergraduates). We apply an F test based on descriptive statistics to test for linearity of means across three levels of the experimental design. Results show that in the vast majority of the 42 independent samples so analyzed, means are unusually close to a linear trend. Combined left-tailed probabilities are 0.000000008, 0.0000004, and 0.000000006, for the three papers, respectively. The combined left-tailed p-value of the entire set is p= 1.96 * 10-21, which corresponds to finding such consistent results (or more consistent results) in one out of 508 trillion (508,000,000,000,000,000,000). Such a level of linearity is extremely unlikely to have arisen from standard sampling. We also found overly consistent results across independent replications in two of the papers. As a control group, we analyze the linearity of results in 10 papers by other authors in the same area. These papers differ strongly from those by Dr. Förster in terms of linearity of effects and the effect sizes. We also note that none of the 2284 participants showed any missing data, dropped out during data collection, or expressed awareness of the deceit used in the experiment, which is atypical for psychological experiments. Combined these results cast serious doubt on the nature of the results reported by Dr. Förster and warrant an investigation of the source and nature of the data he presented in these and other papers.

Read the whole report here.

Please see an update on this post, including the final LOWI report.

Like Retraction Watch? Consider supporting our growth. You can also follow us on Twitter, like us on Facebook, add us to your RSS reader, and sign up on our homepage for an email every time there’s a new post.

239 thoughts on “Anatomy of an inquiry: The report that led to the Jens Förster investigation”

  1. “one out of 508 trillion (508,000,000,000,000,000,000)” ?

    MInd those zeros! Correct me if I am wrong but a “trillion” in the business world is a thousand billion: that is, a 1 followed by 12 zeros, not by 18 zeros.

    Nevertheless, I do agree that a probability of one in 508,000,000,000,000 is “slight”.

    And “peer reviewed” science by famous scientists sometimes brings much more serious issues in basic maths: see Questions 2, 3 and 4 in http://www.australianparadox.com/pdf/quickquizresearch.pdf

    1. it’s a (long scale) trillion, i.e. (one million)^3 or 18 zeros. I’m not au fait as to whether the Dutch are long or short scalers by rule, but it’s 508 quintillion (short scale)

  2. It is interesting that the “defendents'” names are freely disseminated (despite the committee’s regulations to the contrary) while the “prosecutors” remain anonymous, In a scientific exchange, articles have authors. Not here.

    1. There is always the concern that identification may lead to some retribution by the “defend nets” or their associates. There seems little point in identifying them, the analysis is based purely on data available in the paper or obtained from the investigators. Anyone can question the analysis. While I think they could have explained things in a better way, the analysis used is essentially sound. The points lie unusually close to a line.

      An interesting point is that the author claimed that he had waited 5 years from the study to destroy the original forms. In the Australian system the requirement is 5 years from date of publication. It is not clear if that applies to original forms or only derived data. A prudent option seems to be that if you are still analysing data you need to have access to the originals, otherwise how can you verify any data queries. Use of a scanner may be acceptable. Throwing it out when you still want to publish seems dubious.

      1. Well, how about some “concern” that the premature (as there is no decision yet, or is it?) identification of the defendant has consequences for him, something like: ruining his career? Yes, ethical conduct is important but shouldn’t it go both ways?
        I for instance find it interesting that (a) that the investigation was only the second try, after earlier proceedings regarding the same matter did not ascertain any misconduct; someone seems to have an interest here that seems to go beyond mere observation and report; (b) no justification whatsoever of the methods used for the analyses was provided, alternatives are not discussed (as one would expect from any scientific paper); (c) suspicion is repeatedly raised based on unsupported assumptions: data are simply considered “not characteristic for psychological experiments” without any further justification; (d) the report is not authored (was the author identical with the “prosecutor”? is that a good thing?)–not sure any court would accept unauthored reports. And yes, throwing away the questionnaires was a stupid mistake; but what we we had them, would that rule out fabrication? How?
        All in all, as much as I appreciate critical thinking, I feel that our science went over the top: What about respect and trust (until the final verdict, of course), are they no longer part of our ethical values?

        1. “Someone seems to have an interest here that seems to go beyond mere observation and report”

          That’s speculation and, even it were true, quite irrelevant. The motives of the critic(s) are nothing; the validity of their argument is all that matters here.

          1. This case is not in the process of being evaluated. The LOWI is the legal institution charged with identifying fraud and misconduct. Their lengthy and meticulous investigation has now been finished.

            Hence, Forster does not currently have the status of “accused”. The investigation has concluded that the data are fraudulent (so this is now a legal fact). Both LOWI and the University of Amsterdam hold Forster responsible for the data-manipulation.

            So these aren’t hypotheses that now need to be investigated. They are conclusions that the community and relevant organizations now need to act on.

          2. There is no doubt in my mind that Forster should sue about this. The outrageous way LOWI acted should be challenged.

          3. Nonsense. LOWI’s legal role is merely to advise the university on a course of action. The university elected to follow LOWI’s advice. If anyone can be sued, it is the university.

            Still, having read LOWI’s report, I fail to see what is so outrageous about their actions. They were asked to review the university’s proposed decision, and have done so.

          4. I agree. From the published documents it is not clear, why the UvA changed their initial decision. This demonstrates that the report of the complainants or its implications are not as crystal clear as it is suggested by the report. This, of course, is only true, if both organizations (UvA and LOWI) did a proper investigation.
            This leads me again to the conclusion (as others have claimed as well) that science should take over and that more (maybe (also direct) replications are necessary.

          5. The only way to rule that speculation out is to find out who the accuser was. From Forster we know quite a bit: he has exposed himself and his work repeatedly to judgments of his fellow-scientists; he has justified his methods explicitly and has applied standard stats that have been tested for validity before. He has exposed all his work for scrutiny, as any other scientist.
            We have not the slightest idea whether all that applied to the anonymous accuser. Do we know his/her method is valid? I don’t–the guy provides no evidence. Has it been rigorously tested and validated before? I don’t know–the guy provides no evidence. Can it be applied to the present (non-metric) data? I don’t know–the guy provides no evidence. Are there better methods? I can go on like this. Much of that might be due to my (admittedly) limited expertise but the very point of scientific products is that they provide EVIDENCE even for the uniformed reader that he/she is doing proper science. Normally, I would have looked up the writer’s CV to get a rough idea (as I do trust the critical judgments of other scientists working as editors and reviewers, so relevant articles in scientific journals count for me) but I can’t. I also don’t know who has evaluated the report and checked its validity.
            I think there is a reason why scientific products carry the name of their creator; does that not apply here?

          6. Yet in spite of all these judgments from fellow scientists, none of them apparently ever noticed that his data was impossible. The LOWI report also mentions that various other culpable errors were found by the university, something that these fellow scientists also seem to have missed.

            We technically don’t know who are the complainants. I don’t see why you’re so interested in their CVs. Either their work stands up on its own merit or it doesn’t.

          7. Because that is beyond my expertise (and time constraints); it is the obligation of writers with scientific ambitions to make their case in a way the reader understands, provide the necessary references etc. Right? Plus, as Tom Johnstone rightly points out, we actually have no idea what we are discussing here (i, ii or iii)!

  3. On the basis of this report, it seems clear that Förster fabricated the data and it also appears that his skills in generating data are inferior to his skills in paper writing.

  4. What is interesting is that the data would appear equally fake in all three papers mentioned, yet only one is being retracted. What’s going on here?

    1. The raport recommends retraction on the basis of similarity in means and does not discuss the linearity argument. The similarity in means has a much smaller probability for the 2012 paper than for the other 2.

  5. “The combined left-tailed p-value of the entire set is p= 1.96 * 10-21, which corresponds to finding such consistent results (or more consistent results) in one out of 508 trillion (508,000,000,000,000,000,000).”

    Lucia de Berk anyone?

    1. Lucia de Berk’s case was not a “random sample”. The whole point of scientific procedure is that these data ARE supposed to reflect random samples. So the comparison fails.
      Perhaps there is an argument to be made here that for some reason these samples behaved non-randomly (like: the same class of 30 students was repeatedly measured over 70 times, although that would be equally damning from the science viewpoint). If so, make that argument. Just saying “there have been other unrelated cases with low p” is not sufficient.

  6. The question whether the data was actually faked (just thought up) or whether the anomalous results are due to a Questionable Research Practice needs further research. The anomalies are unusual. I never saw anything like this before. Is this the result of incredibly naive data fabrication, or of an “innocent” QRP such as removing subjects from a much larger pool of respondents because the results get better when you do that, so obviously the removed subjects were “outliers”? Or were the studies simply repeated many times till a good result came out? Either way: does that generate *linearity* as well as just “better p-values”? The fact that every study has an identical number of subjects and all are perfectly balanced – i.e. text book experimental design – is incredibly suspicious. There is no good reason to insist on a perfect experimental design. It just looks “more scientific”. The recent affair concerning the publications of the clinical psychologist Elke Geraerts are exactly of this nature. The way the subjects are recruited definitely does not give initially an equal number of subjects in each group … with a complete set of scores. In fact it generates a lot of “supernumerary” subjects in some groups. Later the numbers were reduced – so as to make the study look more scientific??? – but opening a window of opportunity for deliberate or accidental bias, by selecting subjects to fit the theory.

    1. No, the data must be created. Basically you have a control group, and two groups one that should produce an increase and the other a decrease. In all experiments in the 2012 paper the increase above control is almost exactly equal to the decrease in the other group. What the reviewers do is to plot the 3 groups showing that they are in a straight line and then show that the variation is much less (unbelievably less) than would be expected by chance.

      The next step has to be a review of all papers, ever, published by this guy.

      1. You are right saying “Basically you have a control group, and two groups one that should produce an increase and the other a decrease. In all experiments in the 2012 paper the increase above control is almost exactly equal to the decrease in the other group. What the reviewers do is to plot the 3 groups showing that they are in a straight line and then show that the variation is much less (unbelievably less) than would be expected by chance.” But what you mean by “created”? Do you mean “just made up”? One can also “create” impossible results by careful selection procedures which, under the name “questionable research practices”, are actually very very common research practices. Moreover, in this particular field, if you do not indulge is just-mentioned practices you’ll never ever get a significant result so never get a publication – at least, not often enough to build a career. So the top people in the field are not necessarily top fakers, they can also be top intuitive users of QRP’s. Vörster’s total lack of understanding of the statistical issues here certainly could explain why his faking, if he faked, was so transparent. But it could also mean that he inadvertently faked the data by instinctive and really skilled use of QRP’s. I guess we will never know for sure. But does it matter? The integrity of the science has been proven nihil. The integrity of the scientist becomes irrelevant. They are incompetent, or worse.

        1. PS and now it should not be a question of the University trying to exact revenge on a cheating employee: they should be asking themselves how come they ever hired someone like this? I think we need to take the morals out of this integrity business and put back the science. If you are a great university you hire lots of really promising people and they become professors and basically have the freedom to follow their scientific instincts and do the research they (not society or business or whatever) needs to be done. Academic freedom. One in a hundred will turn out to be duds. Bad apples. OK so you clearly label them as such and separate them a bit from the other apples so they don’t go bad too, and you carry on, business as usual. You review your selection and hiring and promotion procedures, you worry perhaps whether the accent on quantity above quality in academia is has led to this kind of disaster, and you let the poor bad apples try to recover their reputation by doing some good research for a change. Some very careful replication experiments, for instance.

          1. I don’t think anyone has hiring practices that are sufficiently sophisticated to allow us to say, post hoc, that this or that factor indicates a high chance that the candidate may later commit fraud. Certainly not without an unacceptably high false-positive rate.

            However, we all (including hiring committees, and the press offices who report the results of this “transfer market”) need to stop fawning over people who get great results simply because their results are great. (Compare financial traders: one in 32 will out-perform the average 5 years straight just by flipping coins for every decision.) There simply aren’t that many highly insightful individuals who have hypothesised something that is at once novel, reproducible, and has a large effect size, because there simply aren’t that many phenomena like that out there that we weren’t already aware of from Plato, Shakespeare, and common sense. In fact I suspect that many of these superstars’ gee-whiz results in social psychology are outliers, due to luck, publication bias, p-hacking, or, more often than we like to admit, massaging of data.

            I also worry that we’re only catching the people who are not very good at making up their numbers. It really isn’t very hard to use Excel generate two sets of normally-distributed random numbers, whose means differ by just the right amount to give you whatever effect size you need to get invited to deliver that TED talk. (Of course, those random numbers also have to have ecological validity. One of Smeesters’ mistakes was to use numbers that were too random, for the prices that people stated they were willing to pay for an item; he should have used mostly round numbers, but instead he had them all quoting prices to the cent, thus requiring up to believe that an undergraduate, when asked how much they would pay for this T-shirt, might say something like “Uh, I dunno, $10.39 maybe?”)

          2. A sophisticated faker could certainly make detection harder but it is very difficult to mimic the real data if you don’t have the real data to copy. That is unless the data conform to a very simple physical process (which psychological data generally don’t). Notably it is hard to get the intercorrelations between elements of the data set correct.

            What may be difficult is distinguishing fabrication from QRPs.

    2. It would indeed be interesting to know whether questionable research practices can lead to such an outcome. For example, if one engages in selective reporting, how many studies would you need to run to find results that are that “good”? On the other hand, the problem with this argument is that there is no motive to produce data that is so astonishingly perfect. What counts is the right pattern and the significant differences between the 3 means. There is no need to hack the data until it conforms to perfect symmetry.

    3. The authors of the report considered the QRP possibility; they say it is unlikely that there were other studies, or variables, unreported because of the sheer amount of studies you would have to run to find these patterns by chance.

      The possibility that you could create linear results by excluding datapoints is an interesting one, though that would itself imply a high degree of calculated data manipulation and would certainly justify the retraction of the papers involved IMO.

      1. I think what dr. Gill means is that the anomalies are different from anomalies found in papers from other fraudsters and researchers known for QRP.

    4. It is unclear to me how any combination of QRPs would lead to this pattern of results. It seems to me that the LOWI has correctly concluded that the data have been manipulated. You can use QRPs to show that listening to the Beatles makes you younger. But not to get the condition means of dozens of studies to lie dead weight on a linear regression line.

    5. Speculation: What if Foerster recruited a part of his N but fabricated the rest to make his outcomes and claims more impressive (and perhaps removed a bunch of outliers)? This could explain the figures & is in line with the strange discrepancies between number of enrolled undergrads @ UvA and those he claims to have recruited for his studies (another point of criticism in the report). This can imply the effects he investigated could be real (or not), but were made to look more impressive. Like a shot of steroids.

    6. If Förster used those QRP, then surely he would have made such statement during the investigation? In that case, the committee would have had to decide on how grave that practice was. He obviously did not make that defense. His defense is that the numbers are possible to acquire in the way it is stated in the paper, i.e. without QRP.
      It is not our job to make up possible excuses, discuss which is least severe, for Förster to then pick that one in his defense.

  7. I am astonished by the tone of the report. It sounds like the authors were on a mission. I am not saying that the analyses are wrong or the like, but it is definitely not a scientific report.
    I would love to see a scientific approach to this issue and the allegations. Science can never ever be as strong as it appears to be in the report and it is typically more cautious. This is one of the strengths of science.
    As a first step in my eyes, the authors should reveal their identities, which would increase their credibility.

      1. Because tone and identity tell us something about motives: true interest in scientific integrity or witch-hunt. Why not focusing on respect? Or can you tell from (not well motivated, anonymously performed) statistical analyses (whom you somehow seem to trust more than Foerster) who did what and with what intention? Isn’t that what a fair trial is supposed to judge? “Evidence” never speaks for itself, it needs to be weighted and interpreted–that’s what I thought science is about…

        1. Evidence needs to be interpreted which is why I’m suggesting that we look at the evidence, not at personalities.

          The analyses in the report provide prima facie evidence of wrongdoing. But maybe the analyses are flawed, or have been misinterpreted. To find out, we’ll have to look at the analyses… and not get distracted by worrying over who produced them.

        2. “Because tone and identity tell us something about motives: true interest in scientific integrity or witch-hunt.”
          Precisely. The most likely scenario is a close competitor who may have found a pretext to slim down the competition a bit. I hope this information emerges at some point, perhaps if Forster sues and the names become public. Of course, we’ll never know if the accuser would have made the same accusation against a friend or a friendly collaborator.

          1. The analyses come from methods people, who have are no competitors whatsoever, but were only concerned about the integrity of Forster.

          2. Why does it matter where the complaint came from, so long as their arguments are valid, as seems to be the case here?

            That is , unless you’re implying that all scientists in the field endulge in QRP and all complaints would be valid. If this is the case I should think the field is in urgent need for an overhaul.

          3. ” while others are interested in who’s friends with who.”

            Yes! You have hit the nail with this, and it is an important point as far as the lack of action on many claims of misconduct.

            I too am surprised by the number of posters here who seem to be taking the line that the case against Forster must be based on bias or jealousy.

          4. That may be true, but it does not mean that is the main factor in this case.

          5. “I too am surprised by the number of posters here who seem to be taking the line that the case against Forster must be based on bias or jealousy.”

            Perhaps that’s a social psychology thing…

          6. It is quite possible that accuser does hold a grudge against Foerster – I can’t for the life of me understand why anyone would take on the stress and personal risk of carrying through a complaint otherwise.

            But the fact of this (hypothetical) grudge can’t be used to automatically exonerate Foerster.

          7. Whistleblowers do this for the sake of science. They carry a huge risk to do this. They put a lot of work into this to make it a compelling case. The evidence against Forster is overwhelming. Many people seem to forget that these analytical techniques have been used before, based on which many papers have been retracted and professors left their jobs.

          8. “Whistleblowers do this for the sake of science. ”
            I am sure that is what they (or rather we) say, but in this world of sin motives can often be mixed. Besides, sometimes someone might have excellent reasons for holding grudge. Possibly the type of person who commits falsification might also be the sort of high functioning sociopath – alternating between charming and bullying – that tends to build grudges.
            A lot of cases you read here can be traced back to personality differences – a splendid one involving a researcher in obesity in Kentucky – his accuser claimed that scientist in question had killed his mice in front of him and before going to the ORI took a number of sexual harassment complaints on behalf of 3rd parties. I am pretty sure he had a grudge! But there is no doubt the scientist complained about had been using photoshop quite heavily.

            All I am saying is the existence of a grudge should be irrelevant in such matters, it only comes into play if the complaint is found to be frivolous.

      2. Because of you know the identity of the author, you can try to undermine their credibility without having to go through the effort of checking their work.

    1. I do not see this as a scientific report. It is a complaint. Please note that we scientists have an obligation to address potential misconduct and a right of complaint. The following excerpt is taken from the uva website:

      “The University of Amsterdam endorses the principles of the Netherlands Code of Conduct for Scientific Practice (see right hand column on webpage). This Code obliges scientific practitioners not only to respect the principles of meticulousness, reliability, verifiability, impartiality and independence, but also to do everything within their power to promote and ensure compliance with these principles in their academic environment.
      One way to verify academic integrity is to exercise the right of complaint when employees of the University of Amsterdam have violated or are suspected of having violated academic integrity. To this end, the UvA has Academic Integrity Complaints Regulations, describing the appropriate measures in the event of a possible violation of academic standards.”

      In the end, tone of the report is not relevant, neither is the identity of the complaitant. What the complain entails, and the decision of the appropriate organisations (the uva and the lowi) that was published last monday is relevant.

  8. I like that this discussion is out in the open – science should be much more transparent, if anything.
    I don’t agree with the standard and tone of some comments here, though. Scientists should vet research, and find truth – but not assign intent, and ask for consequences.
    That’s someone else’s job, and under much, much more stringent norms, with greater protections for the (under that standard) “accused”.
    So let’s stick to that.
    More here: http://www.maxheld.de/blog/research/2014/05/01/linear-means/

  9. Professionalism.

    There can be little doubt that there was a lack of professionalism (and possibly worse) by Förster in his research practices. But I tend to agree with both Henk Koost and Bernhard Hommel above in that there has also been a lack of professionalism in the way the inquiry has been pursued and findings released.

    When this story broke, it wasn’t through what you would call an official and transparent release of all the relevant information and findings. Instead, it seems, partial information was released, in a manner that did not permit the scientific community to make a properly informed judgement. Either the inquiry is complete and the findings ready for official release, or they are not. If they are not, then it is highly improper to point the finger and name the scientist(s) being investigated, and worse yet to pass judgement. On the other hand, if the report is ready for release, then surely that must be done in a fully open, scientifically transparent manner?

    As for the “confidential” report linked (leaked?) above: it might well be that the analyses are valid and stand up to statistical scrutiny. But the correct way to write a report of this type is to present the facts separate from their interpretation, opinion and conclusions. When you mix facts with opinion, you are engaging in an exercise designed to convince the reader of your point of view. While perhaps (?) you might justify that when arguing a scientific opinion in a journal article, when reporting on allegations of misconduct it is clearly not appropriate.

    1. As has been noted by Richard Gill in another comment on this site, Foerster pretty much reveals the identity of his accuser in his email. This also seems a breach of confidentiality.

      1. Really? Maybe I overread the name? Where is the identity revealed of the authors of the report? The methods department at the University of Amsterdam has roughly 30 people.

        In any case, is your argument that an alleged breach of confidentiality by Forster can justify a previous violation of confidentiality by anonymous parties?

        1. I’ve seen comments on Facebook from people claiming to know now who the accuser is. I am definitely not claiming that one breach of confidentiality justifies another. I’m just pointing out that there is blame to share all around.

    2. Just to be clear: Förster was not identified in the decision of the Universiteit van Amsterdam to request retraction of the 2012 article. He was identified afterwards by journalist Frank van Kolfschooten, who was in possession of the report by the critics/complainants.

    3. And of course the report tries to convince the reader (the University’s Committee for scientific integrity) of a certain point of view: namely that there is sufficient reason to open an inquiry. I don’t see why that is a problem.

  10. Here is one way a QRP can lead to a pattern like this. You do the experiment over and over again, a new sample of 60 psychology students every time, three groups of 20, till you get an extremely significant result for the hypothesis of your theory, which you want confirmed. Because the power of the F test for equality of means of three equally sized groups is maximal when the three means are equally spaced, you will tend to stop and publish when you hit on a sample with the three averages closer than “natural” to a straight line.

    I’m supposing that the three true means are different and do have the right order (low < control < high)

    OK, this is just a hypothesis by me at the moment. Easy to test by a little simulation experiment.

    1. The LOWI report states that such a pattern cannot be obtained by QRP. Is this an exaggeration?

      1. I think it is an exaggeration. At present, no-one can imagine that this pattern would result from a QRP. Maybe we just didn’t think hard enough, yet.

        1. I think something like this could work, but is extremely implausible. The issue is the number of participants required for such QRPs. The reported data are 2000+ hours of subject time from independent subjects … repeating the experiment a few times gives you 10s of thousands of subject-hours. Without a very large number of students and a small army of RAs that isn’t possible. The practical limits of the approach make it impossible …

          Besides that approach leads to a standard QRP of running a couple of dozen small sample studies and then selecting the best ones (but not bothering to add the more stringent criterion that the means show a near linear pattern).

  11. By the way, I think these kinds of numbers are just stupid ” Combined left-tailed probabilities are 0.000000008, 0.0000004, and 0.000000006, for the three papers, respectively. The combined left-tailed p-value of the entire set is p= 1.96 * 10-21, which corresponds to finding such consistent results (or more consistent results) in one out of 508 trillion (508,000,000,000,000,000,000)”. As far as I can see all those p-values are based on some theoretical assumptions or approximations (not sure if they were based on normal theory or on semi-parametric bootstrap). It’s a good rule of thumb to halve the number of zero’s whenever anyone tries to impress you with a very big or a very small number. And if after the exercise, the number is still impressive, do it again. The combination of p-values per paper already is based on combining results from many sub-studies which maybe were based on the same subjects.

    They should have said: the probability is so small it’s more or less inconceivable that … .

    If I were Förster, I would sue the LOWI for doing this kind of thing with these numbers. At best they should have been thought of as “descriptive”, only suitable for experts and to be taken with a pinch of salt.

    If I were the Humboldt foundation / University Bochum I would give Förster his position and money and let him use it to redo these experiments in a proper controlled and open environment, so as to correct all the damage he has done.

    1. But if you say he has done damage, he has done this while employed by the University of Amsterdam. Why would the Humboldt Foundation have to pay for cleaning up the mess. And, of course, if the results don’t replicate in Bochum, then we might into an endless discussion about “moderators.”

  12. So probably all of the about 30 people of the methodology department at Psychology, UvA, stand behind this report and the results that came out. Seems to me that the “prime mover” has to remain strictly anonymous. We must not turn this kind of “job” into a new career path: felling tall trees in the forest. This leads to conflict of interest and to witch hunts. Credit should not be given to a known person for being an anonymous whistleblower, once they have chosen that path.

    Of course the alternative option for the whistleblower is to publish openly (scientifically), not report to integrity committees. That course of action has advantanges and disadvantages. (I’m supposing all the time, that the whistleblower in question is a person of integrity).

  13. PS in fact the initial report seems to me to be a very competent work of scientific integrity so it should hardly be an issue, what motives lay behind its creation.

  14. In principle, Förster should have 2000 plus former students and 40 plus business people as witnesses to his perforning the said experiments – this process has been ongoing for two years or so now, and I find odd that none seem to have surfaced in his defence. Similarly it appears that he single handedly run the sessions and recorded and computerised the results as no assistants have surfaced and he has no orde wof whondid help him. The only coauthor did not participate in any of this either. The colleague who suggested shredding the forms to release office space is likewise absent as is anyone wittnessing the computer crash which destroyed the original data files. Are all theše people thus part of a consiracy or to afraid to make themselves identifiable.

    1. If tekija’s suggestions are right, the similarities with the Stapel case are compelling: Doing all the participant work himself (c’mon, what are grad students for?), doing all the data input himself, no trace of any of the participants… yes, it’s all circumstantial, but plenty of legal systems allow for convictions on the base of overwhelming circumstantial evidence when there’s no dead body or confession. (Stapel confessed, which made it easier for everybody; perhaps he would have avoided the worst of the public opprobrium that was heaped on him if he had continued to deny and obfuscate, because it appears that none of the evidence against him was any stronger than what we see here.)

      And, frankly, the incompetence of losing your only copy of a dataset in an alleged computer crash ought to be grounds for a serious professional misconduct investigation on its own. This is not 1988 with people just learning to use their computers. A USB stick takes up about 7cm³ and even an old 128MB version would hold these datasets many times over. Dragging that precious file to Dropbox or a Gmail draft folder takes 10 seconds. This is like a surgeon claiming to have just “forgotten” to wash her hands, or an airline pilot failing to check the fuel level before take-off: some errors of omission just aren’t forgiveable among serious professionals (even if we believe in the series of dogs that conspired to eat Förster’s homework).

  15. As the post notes, the linked “report” is not an official report from LOWI or UvA. It is a complaint submitted to UvA or LOWI concerning the integrity of data. What we don’t know, as far as I can tell:

    i) was the linked complaint initiated by the original accuser and/or colleagues, or was it requested by UvA or LOWI?

    ii) what is the complete and official finding by UvA and LOWI?

    iii) is there a full report on the matter by UvA or LOWI and if so, can it be publicly accessed?

    Much of what is written here seems to be premised on the linked document being an UvA or LOWI report, but I don’t think it is. If it isn’t, then perhaps we should temper our opinions until such time as we have a more complete picture?

    1. The linked doc in this post is the complaint. The linked doc in the newspaper article is the LOWI & university decisions based on the complaint.

      UvA initially concluded that no evidence of fraud was present.
      The complainant then went to LOWI who decided, not on the basis of linearity but on the basis of homogeneity in means (like part of the complaint in the Smeesters case) that the 2012 paper shows evidence of misconduct and should be retracted.

      The last decision got leaker to the press together with Forsters name. After that, Forsters response as well as the complaint appeared online.

  16. I think the discussion should focus on the data and the statistical techniques used to verify the validity of the data, rather than on who made the complaint or the motives of the complainant. It seems that similar techniques have been used before in other (alleged) fraud cases, on the basis of which multiple articles were retracted and professors were forced to leave their jobs. I think that it would be a good idea if the university would post a statement to investigate all of Forster’s work (obviously, things look very suspicious), to answers questions about the validity of the data, how he collected data, how he analyzed data, and so forth. That way, Forster’s integrity can be scrutinized, which can clear him (or not) from all blame.

  17. Obviously one can almost never 100% prove data fabrication/manipulation, even less when no original data is available. Thus all we can and should rely on in a case like this, is the probability of reported outcomes. And i’d say if the evidence in the present case is not sufficient, i don’t how it could ever be (given the report is sound).

    The only valid counter argument I have read above is that with thousands of studies, just by chance, there will be a study with anomalous statistics. Granted. But with *such a degree of perfection* of the results? In *three* studies in *four years* by the *same author*? Involving an anomaly in not just one, but at least *two measures* (similarity in means, linearity)? And just by chance this already unlikely case involves a study (or all three?) where the *original data is lost*? I don’t know what’s the combined p-value for that, but common sense appears to say it’s sufficiently small.

  18. Exactly. One can almost never 100% prove data fabrication/manipulation/novel effects of QRP’s, but altogether, I would say that it is beyond reasonable doubt that the research papers we are talking about completely lack scientific integrity. They should be withdrawn. Whether the papers need to be withdrawn because some particular person lacks moral integrity or is merely incompetent, should not be the primary issue for science and for scientists. Obviously it might be a primary issue for people who put up research money, or in some other way have some direct financial and legal interest in the matter. But still, they should also be wondering about the perverse structures of academic funding and career advancement, the focus on quantity above quality, the lack of quality-control in research and publication of research.

  19. It is amazing that members of a scientific community are eager to outdo one another in crucifying a suspect. Given that the evidence is at best circumstantial and that there exist deviating opinions, I am shocked that concerns about transparency, fairness and due process are generously dismissed.

    1. In what way is/has this process not (been) transparent or fair? I am genuinely curious/ interested.

      Also, I don’t understand the sentence “It is amazing that members of a scientific community are eager to outdo one another in crucifying a suspect.” This seems to point to some form of contest or something. I thought science is helped by objectively trying to come closer to the truth of things. How is it that this process does not help in that? Why mention things like “outdo one another” when that does not have anything to do with trying to come closer to the truth, setting the scientific record straight, and possibly get rid of non-scientific behaviour? I genuinely don’t understand. How else should one try and set the scientific record straight, and possibly get rid of non-scientific behaviour?

    2. Also: as an outsider and reading some of the comments, it seems to me that social psychology is more about politics than about science. It seems a surreal world to be working in at times…I kind of hope that all involved in this field do not have the same feeling about it, that would seem kind of sad in my opinion for multiple reasons.

      1. Lots of academia is about politics and not science. However, the situation in social psychology is pretty bad. The field has always suffered from statistical illiteracy, and has long privileged story-telling over sound results. I get the sense that some are trying to close ranks and defend one of their own in this particular situation. But friends tell me that in the review process in social psych, it is now quite common for reviewers to accuse authors of p-hacking, HARKing (hypothesizing after the results are known), QRP’s, etc.

        The status quo is disintegrating, but many of the people who have been rewarded by the field (and who therefore have gotten and retained professorships based on numerous high-impact publications) don’t have a clue about how to do research that is both theoretically rich and also replicable. In their defense, this is extraordinarily difficult. If anything is clear about this scandal so far, it’s that claims of vendetta’s or witch-hunts don’t seem to 1) give a sense that social psychologists know what they’re talking about and 2) bring the field anywhere closer to where it arguably needs to be (focused on soundness of results and replicability and with less emphasis placed on striking counter-intuitive claims backed by perfectly unambiguous data).

        1. Reviewers accusing authors of QRPs is generally just pots calling kettles black, because the reviewers are from the same community. That kind of self-flagellation solves nothing.

          If social psychology is to clean up it should make QRPs impossible by introducing preregistration.

          Without such fundamental reforms to actually solve the problem, pious talk about the evils of QRPs is just hot air.

          1. Problem is that preregistration solves absolutely nothing AND it inhibits innovative work.

          2. Pre-registration can be hacked. You run the study and if you like the result you pre-register it and run it again, or, even more simply, present the collected results as the results of the pre-registered experiment.

            I don’t see why pre-registration would inhibit innovative work though.

    3. Also: there always will be “deviating opinions”, that in itself does not say anything. Isn’t science about trying to find out which opinion is most probably the correct one. Using logic, data, and arguments to come to that. If this makes any sense, then I would think that it might be more useful to talk about logic, data, and arguments than about the fact that there exists multiple opinions, which is always the case I would think.

    4. I would dare say that members of the scientific community are keen to outdo each other in finding the truth. The truth is quite apparently not to be found in Förster’s work.
      Whether he deliberately defrauded the scientific community is not unclear either: Either he lied now about not using QRPs (less mal-intentioned maybe, but still invalidating his results) or he invented the data. One such mechanism has to have been involved.

      After a two-year-long “due process” and devastating evidence (transparent here for all to see) in favour of fraud or extremely flawed scientific methods, I am shocked that you still defend your former supervisee. I hope you did not teach him what brought him here.

  20. So perhaps this would be a good time for universities to start taking control of researchers’ raw data, as soon as it is generated and before any analyses are done. Then we can stop worrying about the data getting ‘lost’, and the ‘evidence of a crime’ will always be available for inspection.

    Of course this isn’t enough to prevent a determined fraudster.

    1. If you see the author as part of the university, which should have been the default assumption I presume, then the university *is* already taking control of raw data. If you don’t, then you are essentially adding another member to a research group; but given that most of these studies are already multi-author studies, and that still went wrong, that is not likely to solve things. (IMHO, it would just add another administrative layer with unproven effect; we have plenty of those.)

  21. There have certainly been a number of strange procedural developments in this whole affair. For example, in July 2013 the committee for scientific integrity of the University of Amsterdam concluded on the basis of advice (presumably from statisticians) that there was insufficient evidence for scientific fraud. The committee thus rejected the accusations put forward by the accuser. A few month later, the committee for scientific integrity of the Dutch Royal Society (LOWI), supposedly also after advice from their statistical advisers, stated that “the conclusion that research data must have been manipulated is considered unavoidable”.
    It is interesting to note, however, that the LOWI based their conclusion only on the 2012 article and not the other two articles contained in the document of the accuser that is now being published by Retraction Watch. This suggests that in contrast to the accuser, the experts consulted by LOWI found insufficient evidence of fraud in the earlier papers.
    Thus, we seem to have some experts who find insufficient evidence for fraud in all three articles (consulted by the UvA), some experts who find evidence for fraud in one of the three articles (consulted by LOWI), and finally the accuser, who found evidence for fraud in all three articles. This does suggest that there is some cause for reasonable doubt.
    The board of the University of Amsterdam accepted the decision of the royal society on March 28, 2014 in a report that did not name the name of the accused. Within a day, this name was published in several Dutch papers. This is particularly discomforting in a country where newspapers are not permitted to print the names of mass murderers, rapists and terrorists and may only refer to them by their initials. Obviously scientists are not deemed to deserve the same kind of protection. Since journalists need definitive sources to publish such explosive material, one wonder who informed them.
    Scientific fraud does great damage to science and particularly to the image of science in society. As scientists we therefore are all in favor of identifying and punishing scientists who committed fraud. However, the legal rule that people are innocent until proven guilty should also be followed in this process. In this case, one has the impression that this rule has been reversed.

      1. Ill open by saying that although Professor Forester is not a friend of mine I met him a couple of times and liked him and given his passion for doing science my prior that he performed serious malpractice is very low. The issue i want to raise is different though. I really think that some commenters here (e.g.,”Neuroskeptic”) are either purposefully or unwittingly blind, deaf or both. There are two orthogonal issues here — the first is a systemic and impersonal effort to force scientists to perform as they should, weeding out false results and fraudulent scientists. Can’t think why anyone would object to this goal.
        The second issue is what some commentators are blind or deaf too — it is about the life of a real and specific human being named Jens Forster. I reject and am actually disgusted by the manner in which this issue has been handled and I think that even a suspect in committing the most atrocious of crimes should be given a “professional” (to use the term many commentators used), i would say just hearing. It is also actually in service of the first (utilitarian) goal that the process should be so, but this is not my point.
        For example, Mr Skeptic: what do you mean by “consider their arguments?”. Imagine a murder trial in which only a prosecutor appears and the people are to “consider the soundness of her arguments” and, if convinced, hang the suspect. Is this your personal version of Fisherian hypothesis testing? Given that (at least) the credibility, honor, life’s work and livelihood of a person is at stake here — how about soliciting another expert opinion? What is this craziness? Also, since when does a statistical argument that is based on objective (long term) probability apply to a single case? As rare as it is in the long term, it is *possible* that this conjunction of events occurred — furthermore, you cannot place an objective probability on the occurrence of an event.
        I end by granting that the pattern is subjectively odd and this may merit an inquiry but that is merely the beginning point not the end point, or maybe one piece of evidence. There is a tipping point in these kind of processes in which the “whistleblower” (weak, in danger, powerless) becomes s hunter (powerful, hungry, energized) i think that we have come to this point. If we don’t want to slip into a dark period of Mccarthyism (black lists and all) — the justice of the procedure is crucial. I suggest we all tone down the vehemence, lower the certainty — is anyone really willing to take responsibility for this person’s fate?

        1. Dear dr. Eitam,

          The board of UvA concluded on 10 July 2013 in their preliminary findings in the case against Jens Förster:

          * “the linearity of the research results of the accused is statistically almost impossible and the accused himself [ = Jens Förster] has also been unable to provide an explanation for this.”
          * “to publish an ‘expression of concern’ in the journals in which the relevant articles were published, or, if this is not possible, to urge the accused [ = Jens Förster] to do so himself.”

          Would you be so kind to provide me, and other people over here who are less familiar with research practices in the field of psychology, some insights how a professor at UvA with alot of scientific credits (
          http://www.uva.nl/over-de-uva/organisatie/medewerkers/content/f/o/j.a.forster/j.a.forster.html ) is unable to provide such an explanation, given the fact that Jens Förster was totally free to ask help from any statistician anywhere on Earth?

          Would you be so kind to provide me some of your insights what would be considered in your field as good scientific practice (= ‘doing the right thing’) when your university has asked you to contact the EiC of the journal in which you have published papers to ask them to publish an ‘expression of concern’, as statisticians have pointed out that there might be irregularities in regard to the statistics?

          Thanks in advance for some of your thoughts.

          1. No, thank you Dr. Van Dijk. I think I said all I had to say on this matter.

          2. Dear Baruch Eitam, the report proved Forster’s manipulation beyond reasonable doubt.

          3. Still, I worry that people who come to Forster’s defense, however sympathetic, either did not read the 2012 report or do not understand statistics at all. Apparently, Forster himself does not understand the report. Otherwise he would have realized that there is no way that he can get away with his “To be as clear as possible: I never manipulated data.”

            He must feel terrible, and the best thing that he can do is to concede, resign, and begin a new career outside science.

          4. Dear dr. Eitam,

            I am flattered that you assume that I am holding a PhD, but this is not the case. I am just a guy who is graduated at one of the Dutch universities. That’s all. I fail to understand why you don’t want provide this audience with your thoughts about some topics about research integrity of your collegue Jens Förster and of yourself.

            I am not familiar with the Israelian Code of Conduct for Scientific Practice (I am too lazy to ask Google), but I am familiar with the Dutch version of this Code ( http://www.uu.nl/SiteCollectionDocuments/The%20Netherlands%20Code%20of%20Conduct%20for%20Scientific%20Practice%202012.pdf ). This code states:

            * “A second overarching principle is transparency; every scientific practitioner must (be able to) demonstrate how he puts these principles into practice.”
            * “Every practitioner must, if required, be able to explain and motivate if – and if so, to what extent and why – he is at variance with the best practices of the university Code of Conduct.”
            * “This Code obliges researchers not only to conform but also to actively maintain and promote the rules for integer scientific conduct in his academic circle.”

            Excuse me very much, but I have some problems to combine the above statements with your refusal to provide this audience with your thoughts. Are you totally sure that your refusal is fully in line with the Israelian Code of Conduct?

            Best wishes & take care.

          5. Dear Mr. van Dijk, I think you have demonstrated my argument well. Although the Dutch conduct of scientific practice is new to me I am pretty sure it does not include the obligation to respond to any inquisitory sounding question from any random individual surfing the web. If you are seriously interested in my work you can google scholar it and if you have specific or general questions about it i will gladly answer them. My work email appears on the papers as on my departmental website. Now seriously, and ill try to stand behind this one — over and out.

        2. There are quite a few people here who have made similar arguments.

          The problem in this case is that although the defendent was found holding a smoking gun, there is a corpse on the floor, nevertheless have the bullets been removed from the scene of the crime by the shooter, by his own admission. Therefore, you argue, it is scandalous to publicly accuse the defendent of murder, because who’s to say that the bullets came from the defendant’s weapon? Maybe there were never any bullets!

        3. “As rare as it is in the long term, it is *possible* that this conjunction of events occurred — furthermore, you cannot place an objective probability on the occurrence of an event.”

          You seem to have misunderstood the argument. The claim in the statistical accusation is that, even if we assume the null hypothesis that the underlying psychological effect was perfectly linear (which is, itself, improbable, but it is the assumption most favourable to Forster), the chance of Forster’s data being as linear as they were, would have been astronomically small.

          This is exactly the same logic as used in any scientific paper that uses p-values – such as Forster’s own papers.

          No-one is assigning a probability to an event that did actually occur. The claim is that this event (extreme linearity) would have been extremely unlikely given the theory that the data are unbiased real observations of an effect.

          1. Just wrote and then deleted a long response about the meaning of p-values etc. Appreciate the trust in my understanding and hope that the allegations are wrong. Over and out. Baruch

          2. Reading Baruch Eitam’s statements and other comments by Förster’s social psychology friends/colleagues, I am convinced that something is going seriously wrong in your field.

          3. Reading the comments by people hiding behind acronyms I am convinced that people can say everything that comes to their mind about anyone and everything without any risk to themselves.

    1. Isn’t one paper, for which “the committee for scientific integrity of the Dutch Royal Society … stated that “the conclusion that research data must have been manipulated is considered unavoidable” AFTER weighing in all previous accusations, defenses and institutional attempts to deal with the case themselves enough?!

      How can you still see Foerster as ” innocent until proven guilty ” in light of this verdict? In consequence questioning the highest authority of self-regulation the Dutch scientific community has given itself, and several of the finest colleagues who devote time and effort working on this committee.

      Further, would you risk that highly reputable institutions like the Humboldt foundation and the University of Bochum are being compromised, because one and a half years after credible doubts surfaced and months after the committee for scientific integrity of the Dutch Royal Society has ruled that Foerster has at least shown severe misconduct in his research practices, the ruling is still to be kept under the rug? Or do you have positive confirmation that both institutions were informed fully?

    2. It is indeed interesting to note that the preliminary decision of the uva differed from that of the lowi. There is a very good reason however why the lowi exists: to be an independent ruler amidst organisations who feel that their organisation might have something to win or lose by certain outcomes. I am not saying that that is the case here, because i do not know, but let’s say i am not surprised by the difference in decisions.

      Further, it might be useful to know that the lowi only takes into account those complaints where the university has made a decision. In addition, institutes such as the uva have to comply to the lowi. Thus, nothing strange going on sofar, at least from my pespective.

      Then about the decision of the lowi to investigate only one paper instead of three. The conclusion of w. Stroebe that that implies that the other two papers were not deviant is preliminary. For all we know, the lowi choose to look into one paper because of time constraints or so. We probably have to wait for tne lowi report. Why that is taking so long is a mystery to me, as well as the lack of any reaction thusfar from the univerity of amsterdam.

      1. 22 April 2014 is the date of the definitive findings of UvA ( http://www.vsnu.nl/files/documenten/Wetenschapp.integriteit/2014%20UvA%20manipulatie%20onderzoeksgegevens.pdf ).

        The regulations of LOWI state that, in principle, three weeks after the date of the definite findings of UvA, an anonymized version of the LOWI-report will be published on the site of LOWI (article 13 of https://www.knaw.nl/shared/resources/thematisch/bestanden/LOWI_werkwijze_publicatievorm.pdf , only in Dutch).

        The direct link to all LOWI-reports (since 2007) is https://www.knaw.nl/nl/thematisch/ethiek/landelijk-orgaan-wetenschappelijke-integriteit-lowi/adviezen-lowi-vanaf-2007

        A link to an English version of the LOWI regulations is https://www.knaw.nl/shared/resources/thematisch/bestanden/regulations_of_the_national_board_for_research_integrity_LOWI_2014.pdf

      2. The university may have restricted the conclusions and areas of inquiry of the investigators, or the investigators may have restricted their own conclusions. They have only what could be called circumstantial evidence but it is evidence that would be almost impossible to appear by chance. I expect if they ran their experiments repeated, they wouldn’t get anything near as extreme in the lifetime of the universe.

        The time taken is not surprising, most organisations would have followed the history of this type of accusation, and would make sure that the processes are followed.The decision to commission a report, then to assess the credentials of investigators and the requirements of the investigation, then allowing the accused to reply, it all takes time. people have been known to take legal action, and simply not following process may require that the process be repeated.

  22. I think we all better wait to give further comments until more information becomes public. Some people seem to doubt the analyses of the complainants or those of the independent reviewers based on the fact that (1) at first the UvA did not suggest to retract any of the three articles and (2) lowi then suggested to retract only the 2012 paper (and not the others). It could well be that the independent reviews of the analyses were in agreement with the those of the complainants, but that the UvA did not want to retract articles based on probabilities only (as there was no confession by Forster) or simply did not have the courage to advise retraction (which is for instance counter to what Erasmus University did in the Smeester case). This is just a matter of differences in philosophy about what can serve as a base for retraction or violation of scientific integrity.

    The only thing that puzzles me is why the lowi did not suggest to retract the other two articles, as I think the analyses are sound and it’s clear that these results are way too good to be true. Evidence is certainly not circumstantial. If this is fraud or not, can we ever say? But what matter is that the data in these three papers should not be part anymore of the scientific community (the idea might still hold, but the data certainly not).

  23. Several people are saying “UvA initially concluded that no evidence of fraud was present”. No. They concluded thta no proof of fraud was present. What do you mean by proof? And what level of proof do you require? This committee is working in a legal framework. It has to decide if an employee should be fired or not. Lawyers, law. Don’t expect much statistical insight or appreciation of statisical subtleties.

    The patterns in the data could have resulted from chance. (A monkey might type the complete works of Shakespeare. Put thousands of monkeys to work, and the chance that you’ll see the complete works of Shakespeare goes up). Even if it was fraud, there is no proof by whom the fraud was committed.

    However UvA concluded that the work was no good. They were pretty clear about that.

    This yet again supports my argument that in cases like this one should *first* hold an *open* scientific debate abut the integrity of the work. If the scientific conclusion is that the work has no integrity, then internal university officers might like to discuss the integrity and or competence of the researchers, and also the integrity of their own procedures for recruitment and promotion …

    I think that Förster should get his Humboldt fellowship and spend the time and money very carefully re-doing these experiments, in an open and well regulated environment. I suspect that the results will not be publishable according to usual criterian (p < 0.05) but at least the University of Bochum will be able to publish them anyway. If he does all the work all himself, as before, he may well need the whole five years.

  24. And how do you determine the integrity of a scientific work? Both by direct and indirect evidence. Indirect: by checking the procedures which were followed, re-checking the data cleaning and selection and analysis … By subject matter arguments (how psychologically plausible is such and such an effect) and by statistical arguments (how statistically plausible). Absence of evidence can be evidence of absence. So statisticians and psychologists will have to discuss with one another, communicate with one another. Putting lawyers and managers and executives responsible for the reputation of a powerful corporation into the room, is not going to clarify things much, at this stage.

  25. I find it remarkable how many people here try to shoot the messenger. Regardless of how one evaluates the outcome of the procedure and the verdict of the LOWI and University of Amsterdam, the whistleblower did the right thing.

    1. I am interested in learning what the people who downvote my comment think that the whistleblower should have done instead.

      What would you do if you suspected a colleague faked his data? Nothing?

  26. It seems that a lot of attention is given to one single statistical analysis (that assesses linearity) which gives an extremely low p-value (of 2*10^-21), but there is a (a) considerable additional circumstantial evidence for scientific misconduct, and (b) other statistical analyses that corroborate the hypothesis of data fabrication.

    (a) additional circumstantial evidence:
    – the sex distribution reported by Forster (54% and 62% female out of 690 and 823 undergraduates in the 2012 and 2011 paper, respectively) deviates strongly from that of the population of psychology students at the UvA (72%).
    – none of the 690+823+736 participants reported a suspicion about the goal of the study, even though undergraduate psychology students get substantial training in research methods and most of them are to be expected to be seasoned participants.
    – in the 2012 paper, 48 control analyses showed “all Fs < 1"; this is very unlikely.
    – in experiments 6-10b (2012 paper), a considerable portion of participants performed below chance level, which is unlikely for a group of undergraduate students.
    – the lack of reports of participant dropout or missing data is unlikely with such a large number of participants.
    – with about 500 undergraduate freshmen enrolling at the UvA, it is hard to explain how 2242 participants can have been tested for the experiments. It is likely that participants were studying from the UvA; none of them reported being aware of the research hypothesis, yet all participants were debriefed after the experiment. Thus it is unlikely that there were participants that took part in more than one of Forster's experiments, yet the size of the entire population of undergraduate psychology students might only barely match the number of 2242 participants.

    (b) additional statistical analyses
    – the results are not only very linear, but also very consistent across replications. In the 2012 paper, probability for finding the consistency for the differences across conditions (low vs medium, medium vs high, low vs high) is estimated as p=.0065, .002, .00009 for studies 1-5, p=.000001, .0016, .08 for studies 6-10b (analytic part), and p=.002, .0000009, .14 for studies 6-10b (creative part). In the 2011 paper, the estimates are p=.11, .01, .03 for studies 5a-d, p=.02, .06, .04 for explicit manipulation studies, and p=.01, .03 and .55 for implicit manipulation studies.
    – effect sizes reported by Forster (median Eta^2=.281, .308 and .352 for the 2009, 2011 and 2012 papers) are larger than those typically found in creativity research (mean r=.18) and in 10 control papers (median Eta^2=.066). The subtle manipulations used by Forster (e.g. hearing different poems or touching differently shaped objects) affected performance by approximately d=1.5, which amounts to 22.5 IQ points. Such large effects by such subtle manipulations are unprecedented in the field.

    Altogether this makes the conclusion that scientific misconduct through data fabrication has taken place almost inevitable.

    1. I am just wondering, where in the article it is stated that the partcipants were undergraduates from the University of Amsterdam? I could not find this information. Maybe it had been conducted at a or several different universities?

      It is precisely this lack of accuracy and carefullness that makes me doubt about the report.

      I would have preferred are more thorough and reliable report that is supported by facts (in all details) and is not built on assumptions. This would have been more convincing for me.

      In its present form, I – as an outsider – am just not convinced that this is as clear as it is written in the report.

      What else in the report is based on untested assumptions? Why? What motivated this biased approach?

      1. This is not the LOWI-report that led to the University’s conclusion that the Forster & Denzler data were manipulated and the paper should be retracted – that report has still not been made public. This is the complaint that was filed (years ago), the objective of which was to urge the university to conduct an investigation. In the mean time, the complaintant and accused have been heard multiple times, witnesses have been heard, and experts on various issues have been consulted. I assume that the tone of the final report will be quite different, although its findings are likely to be all the more damning to Forster. The LOWI is rarely as explicit in its conclusion as it has been in this case, which suggests that, in addition to the bizarre statistical properties of the data, professor Forster has been unable to sketch a believable picture of when, how, and by whom the data were gathered.

      2. I had the impression that the studies were done earlier in Bremen perhaps, or was it Bochum, and that Förster “took the data with him” when he came to Amsterdam. But, obviously, not all the boxes of paper, log books, instructions etc etc, as an also so far unknown future colleague allegedly advised him. Poor Mr Förster seems rather forgetful, and his hard disk crashed too, but obviously he has been under rather extreme stress for close on two years now. (Where did we hear that before?).

        Doesn’t it say in the published papers where all those thousands of psychology students were studying psychology?

        Maybe the data was collected over many, many years. Painstakingly. That would go some way to explain the always beautifully balanced design and total absence of any dropout. I get the impression that Prof Förster is not too aware of the difference between a sample and a random sample.

        I am still not putting a lot of money on “deliberate fraud”. I would still tend to suspect massive “innocent” QRP.

        Trouble is, once accused, and once one has in response claimed total innocence, and only thereafter learnt something about QRPs, it would be hard to change one’s story.

        1. In reply to Gill’s “I am still not putting a lot of money on “deliberate fraud”. I would still tend to suspect massive “innocent” QRP. ”

          I can not think of QRP’s that would yield the results in the Forster papers. I don’t know, of course, but I guess that real data have been collected, at least in part, but that Forster added constants to the individual scores afterwards. This might explain both the linearity and the large effect sizes (cq relatively small standard deviations). If this would be the case, then I would call this “fraud”, not “QRP”.

          I urge everyone to read the 2012 accusatory report in full, because it contains much more evidence than ‘just’ the impossible linearity.

          1. Yes, I tend to agree with GJ. Richard Gill, would you mind speculating about how QRPs could yield such data? This is not a rhetorical question – I just cannot think of any possibility except for deliberate fraud.

          2. As far as I can see the only QRP that could produce these effects is systematic deletion of datapoints with the explicit goal of generating linear means. If we assume that each sample was originally, say, 80 people, and 20 selected datapoints were removed from each one, you could probably end up with linear means. But – in my opinion – this would be cherry-picking to such an extreme degree that it would be scarcely distinguishable from fraud, anyway.

            Other QRPs, such as publication bias or outcome reporting bias, are just not powerful enough. You would have to run hundreds of samples in order to find one that was extremely linear by chance alone (this is the meaning of those p-values in the statistical report.) There just aren’t enough psychology undergraduates in the Netherlands for that to be possible!

        2. This description of him stands in extremely strong contrast of this description of him on this website http://www.dgps.de/index.php?id=199

          Here it says: “Seine Forschung ist nicht nur empirisch originell und methodisch rigoros, sie ist auch konzeptuell wegweisend, dies insbesondere in der Zusammenführung von bisher disparaten Perspektiven aus unterschiedlichen Teilbereichen der Psychologie.”

          It makes me angry that all the reviewers and laudiatio writers were not more critical. I mean, some people must have noticed earlier that something was wrong with these papers, like reviewers or editors. Take for example the participants section of the 2012 paper:

          “Participants and design. For each of the 10 main studies, 60
          different undergraduate students (number of females in the
          studies: Study 1: 39; Study 2: 30; Study 3: 29; Study 4: 26;
          Study 5: 38; Study 6: 32; Study 7: 30; Study 8: 30; Study 9a:
          35; and Study 10a: 28) were recruited for a 1-hour experimental
          session including ‘‘diverse psychological tasks.’’ In Studies 9b
          (31 females) and 10b (25 females), 45 undergraduates took
          part. Gender had no effects. Participants were paid 7 Euros
          or received course credit. All studies were based on a 3 Priming
          (global, local, and control) between-factorial design.”

          Why did reviewers not ask where the students were recruited? Well, I guess, they would assume that it was the university where the research was done (so it must have been Amsterdam, given that is where both authors were located). It would have been nice if the ages were given (some undergraduate students are mature students and well beyond 25 years of age), and the recruitement methods were indicated. Anyway, by this time the researchers were clearly not “methodisch rigoros”.

          1. Dear Johny,

            Thanks for providing the link of DGPS as this link provides details of the year (2007) in which Jens Förster got his job at UvA. I was unable to find this information on the homepage of Jens Förster at UvA ( http://www.uva.nl/over-de-uva/organisatie/medewerkers/content/f/o/j.a.forster/j.a.forster.html ). This homepage (and as well the text on the site of the DGPS) indicates that Jens Förster is the scientific director of the Kurt Lewin Institute since 2008, but I am unable to find his name at http://www.kurtlewininstitute.nl/kli/organization/management-structure/

            Anyone over here can tell me more about this topic?

            Markus Denzler, the co-author of the 2012 paper, has declared that he took not part in sampling the research data, both other papers (both behind a paywall) have only one author (= Jens Förster). The students might well have been recruited from the other university in Amsterdam (VU University, also with a psychology department), or maybe from any of the other universities in The Netherlands. Richard Gill already pointed out that maybe parts of the participants were students from Bremen (or even Duisburg).

            Jens Förster is the only person who can shed light on this topic. Anyone else over here who can shed some light about the background of all these 2242 undergraduates?

            Not all papers in the journal “Social Psychological and Personality Science” provide details about the background of the students / participants. On the other hand, quite a few of them indeed mention such details:

            * “A total of 79 undergraduate psychology students (54 female; 25 male; M age =19.2) from the University of Wyoming participated in exchange for course credit.” ( http://spp.sagepub.com/content/3/1/72.full.pdf+html ).

            * The online survey was conducted at a Canadian Science Center. Those who met the inclusion criteria (high school aged youth and nonuniversity student adults who used Facebook) were invited to participate by a research assistant. (…). The youth sample consisted of 288 Facebook users (aged 9–18, M = 14.4, SD = 2.15), with a mix of boys and girls (boys, = 112; girls, N = 171; five participants did not report their gender). While we will describe members of the youth sample as adolescents, some were younger than 13 years of age. We intended to include only those participants who met Facebook’s minimum age criterion (13 years), but found that many younger people reported lying about their age in order to use Facebook. We felt that it was important to include these users in our sample. The adult sample consisted of 285 Facebook users (men, N =118; women, N =165; two participants did
            not specify their gender) who were not students (aged 19–71, M = 31.6, SD = 10.28). ” ( http://spp.sagepub.com/content/3/1/48.full.pdf+html ).

            Quite a few of the papers in this journal are free to read (see, eg, http://spp.sagepub.com/content/3/1.toc ), so anyone can compare the method section in the 2012 paper of Jens Förster & Markus Denzler (see also above) with the details on methods in a variety of other papers in this journal.

            Papers submitted to Social Psychological and Personality Science undergo peer review. The peer review is double blind, but I still would like to ask the reviewers of the 2012 paper of Jens Förster & Markus Denzler to identify themself and to comment on the concerns about this paper raised by several people over here.

          2. What is also weird about the 2012 paper in the journal SPPS that there were 600 participants, each earning 7 euro for participation. So the whole study must have costed 4200 euros in participant money. Now, at the bottom of the paper, it says that the researchers did not receive any funding for the research, which is fair enough, assuming it has been paid from departmental money. I am not sure how easy it is to just get 4200 euros in departmental money for some research project at UvA, but at many universities that would not be possible without a small grant (that you would then acknowledge at the end of the paper, as is common).

            I assume that it therefore must have taken place in Amsterdam (otherwise you would assume some acknowledgment of the external funders, like his previous university). Also interesting is that there are NO acknowledgements at all of people who helped with carrying out the research (research which was not trivial to do if it actually all has been done). Given that this was not simply questionnaire stuff, it was quite a bit of work, and if he did it all himself, I wonder where he found all the time for his other activities; we are not talking about a nerdy researcher just doing his studies, but a well connected academic who is apparently involved in many external activities, so he must have been really busy doing this research on the side without any external funding, programming the computers, organising the recruitement, etc. Just the participant time is 600 hours (each session to 1 hour according to the paper), which sums up to fifteen 40-hour weeks (doing nothing else, just for looking after the participants).

            Of course, that is not a major criticism, but it is sloppy reporting to say the least that just raises more questions given the problems surrounding the way the data look! I guess in Holland it is common not to acknowledge the lab technician, but at the very least it then must have been the UvA lab technician who helped out, otherwise you would also have expected some acknowledgment of the volunteering assistent or student. Maybe somebody knows about these things, it will help to understand the whole situation better, and it will help future researchers to do a better job.

          3. In my view, it is a lapse of editors’ and reviewers’ responsibilities if they fail to press authors for the relevant details about a sample, including attrition data and attrition analyses in particular. Before a reviewer or editor even goes to the level of conceptual considerations or asking the authors to discuss the implications of their findings for theory X and Y, the nuts and bolts of good scientific reporting need to be checked and, where they are missing, requested. In other words, papers need to be checked first of all for basic indicators of good scientific craftsmanship. Happens much too rarely! And one problem with APA journals and many others is that once an article is accepted and has been in print for some time, it’s difficult to figure out who the editor was who handled the paper. And of course reviewers are never listed. Authors are responsible for what their papers contain, yes, but reviewers and editors are responsible for gatekeeping. They frequently fail to do that, but can then no longer be held accountable. The situation has changed considerably with the advent of journals that list the editor who handled the paper (e.g., PloS One) or both the editor and the reviewers (Frontiers). My hope is that the latter practice in particular will go a long way towards making editors and reviewers more accountable for checking the basics of a paper (such as sample description).

      3. Dear Henk,

        The compilers of the report ( http://retractionwatch.files.wordpress.com/2014/04/report_foerster.pdf ) state:

        * “Although the origin of the undergraduates is not explicated, it is likely that they were (predominantly) from the University of Amsterdam, at least for the 2011 and 2012 papers.”

        * “The sex distribution in the 2011 and 2012 papers deviates from the sex distribution of psychology freshmen at the University of Amsterdam in the years since Dr. Förster arrived there.”

        * “Other issues related to Förster & Denzler (2012) (…) We note that the University of Amsterdam has had around 500 psychology freshmen per year in the last five years and that 72% of these are female (www.uva.nl). The sex distribution in the sample of Förster & Denzler (2012) (54%) deviates strongly from the sex distribution of
        psychology freshmen.”

        * “Other issues related to Förster (2011) The sex distribution in the sample deviates from the sex distribution of psychology freshmen at the University of Amsterdam.”

        Can you please explain why Jens Förster was unable to provide this details to the compilers of the report? Do you have any evidence that the above quotes of the compilers of the report contain inaccurate assumptions?

        I fully agree with you that there are several plausable explanations why the sex ratio of the participants differs from the sex ratio of the freshman in psychology at UvA.

        Would you mind to ask Jens Förster if he is able to shed some light on this topic?

        Thanks in advance for a reply.

      4. I find this very ironic. You’re blaming the report for something that should have been in the paper in the first place! Why doesn’t it state where the subjects were run?

        1. I partially agree.

          But:
          1) I am not blaming anyone, I am just wondering.
          2) Neither the reviewers nor the editor has obvioulsy asked to specify this.

          Still, if it is not stated (as you also say), how can the report just assume that it has been conducted at the University of Amsterdam, without the slighthest hint that this is a (rightful or maybe wrong) assumption?

          Of course, the accusation seems to be stronger then, but not more convincing.

          1. That’s a fair point. Still, it would have never been an issue if the original article had been more specific. It’s also clear that the editor and reviewers were missing in action there.

  27. A correspondent just raised a disturbing issue.

    Förster says in his defence that his findings have since been succesfully replicated by many other skilled researchers.

    However, unsuccesful replication studies are usually not even submitted for publication (poor researchers, just not skilled enough …)

    If they are submitted they typically get rejected (poor researchers, just not skilled enough …).

    Who might the typical referees be? Guess!

    One can only imagine that the usual (anonyous, highly qualified) referees are Stapel, Smeesters, Vörster and their many successful students and associates, or those who aspire to join their ranks.

  28. I’d like to mention another “defect” of the present system. I think it explains the inconsistencies / anomalies between the conclusions of UvA and LOWI. (Which in fact I think are not very significant, really).

    These inquiries are confidential, secret even. For obvious reasons. The university “integrity committee” entrusted with this awful and responsible task solicits help from internal and external referees. This is not done in the form of some kind of public hearing. No: an expert gets a mysterious letter from a university bureau requesting some advice, and once they have promised secrecy, they are sent some material (a selection? everything?) and asked some questions (good questions? wrong questions? Hard to tell if no dialogue is ever entered into). They are asked to reply in the form of a written report by such and such a much too short deadline, and of course they are not allowed to talk about this with any colleagues either.

    OK so the external reviewer does their best and submits their report and then hears nothing. And I do mean: nothing. Meantime apparently the committee has collected a number of reports and probably spoken in person with “the defendant” and perhaps also “the anonymous whistle-blower” but probably not with any of the external “reviewers”.

    They put their report together and pass on their recomendation to the rector / dean / whoever…

    The whistleblower thinks the committee’s conclusion is much, much too lenient, and – following precisely the official formal procedure – appeals to a kind of “higher court”, now located at the Academy of Sciences.

    The same is now repeated though possibly with more personal interviews by the committee of the same and/or new “experts”. And with more distance, figuratively at least, though literally only half a kilometer.

    Both of these committees truly are constituted of very serious and very careful and very wise persons and actually I think they do a rather good job pretty effectively, if one considers all the constraints put on them. I have great respect for them all.

    What *never* happens is that a bunch of statisticians and a bunch of psychologists get together and discuss freely together all the pro’s and con’s .. and explain to one another what the hell they are talking about! If this would happen, probably they would be able to come up with a perhaps more consistent scientific story for managers, executives etc. to do whatever they think fit with.

      1. because both parties are often manipulating the tools in different ways in order to get the results needed to publish.
        i call it “BSing with Maths”

  29. I concur that the results in the three papers look odd.

    That said, I think the way statistics is used in these “complaints” is also a “questionable research practice”. First and foremorst, an argument is presented that – in its technical form – is full of assumptions that are difficult to understand if you are not methodologically trained. How can you defend yourself well in that situation? You are certainly not given the time to fully study the literature on the matter.

    This is not unsimilar to the legal system, where lawyers are provided to defend untrained individuals that themselves will be unable to grasp the complexities of the legal system. I believe that in “trials” like the one for Jens Förster, a statistical expert should be automatically included to come up with a statistical argument most favorable for the accused.

    Here are some issues I have with regard to the “complaint”. (Yes, I know that this is not the final report.)

    (I) Cherrypicking: Let’s count the number of statistical anomalies we currently know of that could indicate fabrication. (i) The means are too similar across conditions. (ii) The standard deviations are too similar across conditions. (iii) The results are too linear across studies. (iv) The effect sizes are too similar across studies. (v) Digits are not randomly distributed. (vi) Too many F values below .1. (vii) Deviation from the age or gender structure of the population. Many more are certainly to be found.

    With that many indicators every study will soon show signficant abnomalies indicative of fabrication. It is necessary to deal with the multiple comparisons problem in a clever way. The “complainants” do not. In fact, they do not even make all these other checks. They only report evidence that supports their claim. That seems like meeting all requirements of a “questionable research practice”.

    As an aside, certainly, we will have “The results look too normal” as additional indicator of problems anytime soon.

    (II) The assumption of independent sampling: Given the small number of students and the large number of experiments, it is natural to expect that many students participated multiple times in the experiments. In fact, Förster may have recruited from his own classes etc. We know nothing about it. It is also unlikely that he checked the background of willing participants. They may have well come from other degree programs etc.

    Some clever observer will point out that this should have been reported. How participants are recruited is almost never reported in this field. Nobody ever cared. Over time, more and more experiments had to be included in a paper to be publishable. This has strongly reduced the amount of method information provided in the papers.

    To make an ever stronger point: I have never once seen true random sampling from the student body. Instead, I have seen students tasked with recruiting participants in the hallways, email lists, classroom recruiting, etc.

    For the analysis in question that seems to be an important restriction. If students participated multiple times, this increases the chance to observe very similar patterns of results in subsequent studies reported in one paper. Results SHOULD be correlated over studies. Hence, the assumption of a uniform distribution of the p values from the test for linearity no longer holds. The resulting p-values may no longer be valid.

    (III) The “one out of 508 trillion” chance. I just don’t get the logic behind computing this overall probability. Maybe this disqualifies me as a statistician (a career path, I have wisely avoided). The first question I have: Why do the complainants only consider these three studies? Isn’t that like ignoring all dice throws except for sixes and subsequently claiming that the dice is broken?

    But that’s not the thing that irritates me most. It’s that if I conduct 20 experiments that are not significantly abnormal, I would still get an incredibly small number. Let’s say that those experiments all have p-values around .5. The probability of getting exactly these 20 experiments would that not be .5^20?. And hence be absurdly small?

    1. Here are some arguments concerning the issues raised by Konrad Shire:

      (I) cherry picking (or the problem of multiple comparisons): the argument is that there are multiple types of anomalies that could indicate fabrication, so given enough types to look for one could always find one that applies to any given study. Let’s be conservative and say there are a thousand types of anomalies one can look for (I would doubt there are more than a few dozen realistic ones), and let’s be even more conservative and assume there are a thousand ways to quantify anomalies statistically (i.e. researcher’s degrees of freedom). Applying a Bonferroni correction to the original linearity p-value of p=2*10^-21 gives p_corr=10^-15, or one out of 508 million million (508 trillion short-scale; 508 billion long-scale).

      (II) the assumption of independent sampling: the argument is that a significant number of participants were ‘recycled’ over experiments, i.e. some participants participated multiple times in experiments. The papers by Forster indicate that participants were (a) debriefed after each experiment, yet (b) participants were asked whether they were aware of the research hypothesis; none of them indicated they did. To support the recycling argument, a significant number of people (all recycled participants) must all have either (i) an extremily poor memory (which seems unlikely for university students), (ii) have been extremily naive about the goals of the different experiments they participated (also unlikely, especially considering the conceptual similarity of many of the experiments), or (iii) were all intentionally lying about their knowledge of the goal of the experiments (why would they?). This seems not a realistic assumption.

      (III) The “one out of 508 [long scale] trillion” chance – “I just don’t get the logic behind computing this overall probability.” The argument here is that when one runs multiple experiments (e.g. testing for a strong, reliable effect) that are all significant, the product of the associated p-values is also very small yet does not indicate data fabrication. This conflates the summary p-value for linearity with the summary p-value of finding multiple significant effects over experiments: the former is a test for implausibly linear results, while the latter considers finding significant results for the manipulation of interest. As shown in the report, a variety of other (control) papers by other researchers show significant effects in 3-level 1-way ANOVAs; multiplying the associated p-values would indeed give an extremily tiny summary p-value. Yet in these control papers, just one out of 21 linearity tests a significant p-value at the .05 level (p=0.034 to be precise); such a result is nominal.

      1. NNO: “To support the recycling argument, a significant number of people (all recycled participants) must all have either (i) an extremily poor memory (which seems unlikely for university students), (ii) have been extremily naive about the goals of the different experiments they participated (also unlikely, especially considering the conceptual similarity of many of the experiments), or (iii) were all intentionally lying about their knowledge of the goal of the experiments (why would they?). This seems not a realistic assumption.”

        Says who based on what? To me, for instance, (i) and (ii) seem very realistic assumptions based on my extensive experience with research participants. (iii) is debatable, but participants may simply say nothing because they want to get out as quickly as possible or because they (irrationally) fear that there will be issues if they report the goal.

        1. Says who based on what? To me, for instance, (i) and (ii) seem very realistic assumptions based on my extensive experience with research participants.

          I agree, particularly since there are some participants who are mainly (or solely) motivated to participate by the money (or the credits) they get for participating. These participants are not necessarily interested in being debriefed and will therefore not always pay attention to the information given to them.

          1. These “participants” would have remembered something from the debriefing of lat time’s weird creativity experiment….

          2. Not necessarily. Some labs debrief participants via email (that has the advantage of being able to give them the information after data collection has been finished, i.e., they cannot pass it on). And it is quite easy to not read an email.

          3. Sannanina, thanks for joining this discussion.

            You wrote: “I can say, however, that the papers on data that I have helped to collect DO list me and the other student assistants involved.” Would you mind to provide the urls of some of these papers?

            You wrote: “I am a former student of Jens Förster and I also worked as a student assistant in his lab quite some time ago. I am not currently working with him and haven’t worked for him for quite a while, so I cannot say anything about the papers in question since the data was collected after my time as a student assistant.”

            Jens Förster wrote: “The only thing that can be held against me is the dumping of questionnaires (that by the way were older than 5 years and were all coded in the existing data files) because I moved to a much smaller office. (…). This was suggested by a colleague who knew the Dutch standards with respect to archiving. I have to mention that all this happened before we learned that Diederik Stapel had invented many of his data sets.”

            Jens Förster moved in 2007 from Bremen to Amsterdam (http://www.dgps.de/index.php?id=199).

            So the dumping of the questionnaires took place in Amsterdam in the period 2007-2011 and the questionnaires were collected more than five years ago.

            Would you mind to tell me when and where you were a former student of Jens Förster and when and where you worked as a student assistant in his lab?

            Would you mind to tell me where the data (questionaires) for all three papers were collected? In Amsterdam? Or in Bremen? Or in Amsterdam as well as in Bremen? Or maybe as well at other places?

            Thanks in advance for some feedback.

          4. I worked for Jens Förster in the summer of 2003 and from January 2004 to August 2006 in Bremen. I pursued my bachelor’s degree in Bremen from 2002 to 2005, majoring in biochemistry and cell biology. During that time I started to work in Jens’ lab, which in turn got me interested in psychology. I think I took my first class with Jens in the fall of 2003 (my transcript from International University Bremen/Jacobs University does not state the exact date). Several other classes followed. In fact, after I got my bachelor’s degree, I stayed in Bremen for another year to study psychology and to prepare for a master’s degree in this field.

            Concerning the URLs of the papers: I do not have a list with them, nor do I have all of them on file, but I will see what I can do later on.

            Concerning the data collection: I might have been wrong that I was not involved in the data collection for the papers. When I wrote my fist comment, I assumed that I had not been involved. But since it looks like the data was collected in Bremen, this might not be the case (i.e., I might have been involved without realizing it). However, I have no way to track this. As I have stated previously, I did rate creativity tasks while working for Jens, but I do not know if I was involved in the ratings for the paper(s) in question.

            I hope this answers some of your questions.

          5. hi Regina, thanks alot for your friendly and for your extensive reply and thanks even more for disclosing your name.

            No need to spend much time to sort out all kind of details of these papers. I have already found several papers and checked (parts of) their contents. All these papers seem to be good (when looking to the details of the participants) and the part on the methods is aways very extensive and very precise (all seen through the eyes of a biologist).

            I am just trying to reconstruct a proper time line to get clear when and where all these experiments of Jens Förster were conducted (see also one of my other postings).

            Your profile on http://nl.linkedin.com/pub/regina-bode/5/308/46 and on http://www.motivationlab.uni-osnabrueck.de/home.html tells me enough.

            I agree with you that it is indeed not sure if (parts of) the data were already collected when Jens was still working in Bremen (or at other places).

            I can also imagine myself that certain type of experiments with a certain amount of participants (let say 95) gives alot of data, and that you can use (parts of) the basic data also for other papers. Please excuse me when this is a dumb question. So maybe you might not have been aware that (data from) certain experiments also have turned up in one of the three papers?

            Please excuse me once again if this is a dumb question, as such happens (quite often) in my field of interest (= re-use of data for different papers with a different theme / angel / topic, etc).

    2. This is a new twist to the discussion, indeed!
      I never thought about this and at first glance – have to agree. In my eyes, this seconds the claim made by others here that science should not be (ab)used for “prosecution”.

    3. Let me speak here as an independent external mathematical statistician who has occasionally been asked to give confidential advice in matters like this. Here is a link to a talk I have given on several cases I was involved in.

      http://www.slideshare.net/gill1109/heiser-symposium

      It’s an awful responsbility and a very difficult job. I think people who do accept a job like this, take the job very seriously and spend an awful lot of time checking and double checking, and also being extremely careful in their written conclusions. *Of course* one’s job is in the first place to try to defend the accused! (Kind of devil’s advocate position). That is exactly what one does. One takes on a serious responsibility like this with the clear task in mind to try to explode the accusation. I think this is the job description of anyone asked to give advice as “scientific expert”. Be the scientist.

      About the lack of statistical expertise on the part of the “defendant”, on the part of any likely typical “defendant”. (a) they bloody well ought to have more statistical expertise since a huge part of their job description is applied statistics. (b) they ought to easily be able to recruit competent advice from those who do have the required specialist knowledge. Who live in large numbers in another department of the same faculty. And “the specialist knowledge”, by the way, is not so specialist in a case like this! We are talking about elementary, standard, statistical methods going back to Fisher in the 20’s (who did a really pretty meta-analysis of Mendel’s experiments, turning up perhaps the first well documented “too good to be true” statistics in science), and known to all competent applied statisticians.

      About the one in 508 trillian chance: this chance is so stupid, it shouldn’t have been published; at least, not in the conclusions of investigating boards. It’s a negative warning sign, I agree. Whenever you see a number with a large number of zeros you should automatically distrust it. I advise the following rule of thumb: always halve the number of zeros. If it is still a large number, repeat.

      These are the kinds of numbers which cause untold damage in law courts, in science, in society. Nobody has any idea what they mean, or rather, usually the completely wrong idea; and usually no idea at all that they are conditional on a whole heap of assumptions, some of which might be reasonable approximations, some perhaps not. The wise scientist has to explain the scientific significance of the analysis results, not report some abracadabra. When you multiply a lot of small numbers together and do not take account of possible error and possible dependence, the error in the final result is many orders of magnitude.

      Of course a competent external independent mathematical statistician asked to give advice in cases like this points out the likely dependence between the subjects used in the different studies. Worries about the possible non-robustness of the statistical analysis against the obvious departures from theoretical text book assumptions. Tries out experiments comparing parametric and semi-parametric and completley non-parametric analyses. I know because these are the things I’ve now done several times and this is methodology which we discuss at conferences, we learn from experiences and share experiences.

      Maybe these are the reasons why the UvA’s conclusion was on the mild side? The research was not good, the papers should be labelled as such; no proof of cheating. Maybe the LOWI’s further investigations turned up further evidence? Maybe they were less mild because they are more independent? Who knows?

      The main thing here is that the linearity is unbelievable, both from the point of view of psychology, and from the point of view of statistics … even after taking account of all the provisos I have just mentioned. The standard errors are unbelievably small and the effects unbelievably large from the point of view of psychology. (Don’t believe anything I say about pychology: check and double check. Ask wise independent competent people from the field). So the investigation started, it seemed, by simply looking at the summary statistics with a statistician’s eye and a psychology methodologist’s eye. Psychology faculties have departments filled with this kind of people, people who have exactly both these two ways of looking at experimental results coming out of psychology.

      Having seen this incredible anomaly the scientist looks for evidence to *disprove* the obviously suggested hypothesis. Where is it? Where is the data? Where are the log books? Instead all that is found are more red light warnings. What can you do?

      If I were the LOWI I would recommend Humbold Foundation to give Förster his grant and insist on spending at least half of it (the first half) on careful, and carefully monitored, replication of these experiments. Another 2500 psychology students. That way all possible damage is corrected. Even if the results are negative they should be published widely and the new data-sets archived.

  30. Excellent analysis! Indeed, the most overlooked QRP in the current case seems to be the selective use of statistics. This needs to be discussed in much more detail.

  31. The report “Suspicion of scientific misconduct by Dr. Jens Förster CONFIDENTIAL” bears no names of authors. It is written in a ‘we’ form, so I assume there are at least two authors. Their names are irrelevant. The report has a date, 3 September 2012.

    It seems to me that this report was part of the complaint filed against Jens Förster. The anonymized version of the UvA report ( http://www.vsnu.nl/files/documenten/Wetenschapp.integriteit/2014%20UvA%20manipulatie%20onderzoeksgegevens.pdf ) provides a time schedule (‘procedure’).

    1. 2012: a complaint was filed by dr. X. Let us assume this was on 3 September 2012 (or shortly before or after this date).
    2. 2012: board of UvA informs Jens Förster that a complaint was filed and that a Committee was installed to investigate the complaint.
    3. 2012: Jens Förster informs the Board of UvA that he is sick.
    4. 2012: the Committee asks Jens Förster for a response on the complaint.
    5. 2012: Jens Förster informs the Committee that he is still sick, that he is therefore unable to respond, and that it is expected that he will stay sick at least until 11 January 2013.
    6. 2013: Jens Förster informs the Committee that he is still sick.
    7. 2013: Jens Förster informs the Committee that he is still sick.
    8. 2013: the Committee recieves a report from prof. dr. Y. (made on request of the Committee and as a response to the complaint).
    9. 2013: the Committee informs Jens Förster about this report of prof. dr. Y and asks him for data files and other information (after having consulted the UvA GP if this was already possible).
    10. 2013: Jens Förster informs the Committee that he is still sick. He sends them a preliminary response and data.
    11. 2013: Jens Förster informs the Committee that he is still sick.
    12. 2013: the Committee sends Jens Förster a bunch of questions on any of the three papers.
    13. 2013: Jens Förster sends the Committee a second ‘preliminary response’.
    14. 2013: the Committee talks though Skype with prof. dr.(Y?)
    15. 2013: the Committee gets and e-mail from dr. ?. with answers on their questions send to dr. ?. At the same day, the Committee talks with dr. ?. and with prof. dr ?.
    16. 2013: the Committee informs Jens Förster that they are aware that he is still sick and that the Committee is therefore unable to talk with him, but that the Committee continues with their inquiry and with hearing other people.
    17. 2013: Jens Förster informs the Committee that he will slowly try to re-start with his job in June 2013 (‘re-intergratie’).
    18. 2013: the Committee sends Jens Förster and the complainer an invitation for a hearing where both can react on each other and on questions of the Committee.
    19. 2013: the Committee recieves a final response from Jens Förster.
    20. 2013: the Committee receives a second report, prepared by prof. dr. Z (made on request of the Committee). Both reports have been send to the complainer and to Jens Föster.
    21. 2013: the hearing of the Committee takes place with both the complainer and Jens Förster.
    22. 10 July 2013: date of the preliminary decision of the board of UvA.

    1. I wonder what would happen if a particularly fervent statistician ran a computer program to find the pattern that is most common across several studies?

  32. Let me first identify myself as a friend and a collaborator of Jens Förster. If I understand correctly, in addition to the irregular pattern of data, three points played a major role in the national committee’s conclusion against Jens: That he could not provide the raw data, that he claimed that the studies were actually run in Germany a number of years before submission of the papers, and that he did not see the irregular pattern in his results. I think that it would be informative to conduct a survey among researchers on these points before concluding that Jens’ conduct in these regards is indicative of fraud. (In a similar way, it would be useful to survey other fields of science before concluding anything against social psychology or psychology in general.) Let me volunteer my responses to this survey.

    Providing raw data
    Can I provide the original paper questionnaires of my studies published in the last five years or the original files downloaded from the software that ran the studies (e.g., Qualtrics, Matlab, Direct-Rt) dated with the time they were run? No, I cannot. I asked colleagues around me, they can’t either. Those who think they can would often find out upon actually trying that this is not the case. (Just having huge piles of questionnaires does not mean that you can find things when you need them.) I am fairly certain that I can provide the data compiled into workable data files (e.g., Excel or SPSS data files). Typically, research assistants rather than primary investigators are responsible for downloading files from running stations and/or for coding questionnaires into workable data files. These are the files that Jens provided the investigating committees upon request. It is perhaps time to change the norm, and request that original data files/original questionnaires are saved along with a proof of date for possible future investigations, but this is not how the field has operated. Until a few years ago, researchers in the field cared about not losing information, but they did not necessarily prepare for a criminal investigation.

    Publishing old data
    Do I sometimes publish data that are a few years old? Yes, I often do. This happens for multiple reasons: because students come and go, and a project that was started by one student is continued by another student a few years later; because some studies do not make sense to me until more data cumulate and the picture becomes clearer; because I have a limited writing capacity and I do not get to write up the data that I have. I asked colleagues around me. This happens to them too.

    The published results
    Is it so obvious that something is wrong with the data in the three target papers for a person not familiar with the materials of the accusation? I am afraid it is not. That something was wrong never occurred to me before I was exposed to the argument on linearity. Excessive linearity is not something that anybody checks the data for.
    Let me emphasize: I read the papers. I taught some of them in my classes. I re-read the three papers after Jens told me that they were the target of accusation (but before I read the details of the accusation), and after I read the “fraud detective” papers by Simonsohn (2013; ” Just Post it: The Lesson from Two Cases of Fabricated Data Detected by Statistics Alone”), and I still could not see what was wrong. Yes, the effects were big. But this happens, and I could not see anything else.
    The commission concluded that Jens should have seen the irregular patterns and thus can be held responsible for the publication of data that includes unlikely patterns. I do not think that anybody can be blamed for not seeing what was remarkable with these data before being exposed to the linearity argument and the analysis in the accusation. Moreover, it seems that the editor, the reviewers, and the many readers and researchers who followed-up on this study also did not discover any problems with the results or if they discovered them, did not regard them as problematic.

    And a few more general thoughts: The studies are well cited and some of them have been replicated. The theory and the predictions it makes seem reasonable to me. From personal communication, I know that Jens is ready to take responsibility for re-running the studies and I hope that he gets a position that would allow him to do that. It will take time, but I believe that doing so is very important not only personally for Jens but also for the entire field of psychology. No person and no field are mistake proof. Mistakes are no crimes, however, and they need to be corrected. In my career, somehow anything that happens, good or bad, amounts to more work. So here is, it seems, another big pile of work waiting to be done.

    1. Many people contributing to this discussion confuse the Research Hypothesis with the Linearity Hypothesis.

      Forster wanted to get significant results for his Research Hypothesis, and that is why he manipulated his data. However, he did this in such a naive and clumsy way that as a side effect, he caused perfect linear relationships between his independent and dependent variables.

      Even if these relationships would be truly linear in the population then it is still impossible to find such perfect linearity in samples. The Linearity Hypothesis has been used to investigate data manipulation.

      Now that data manipulation has been proven beyond reasonable doubt, it makes no sense to defend Forster by saying that his Research Hypothesis may still be tenable and that he should be given the opportunity to replicate his studies.

      And it does not make sense either to say that the linearity is not easy to spot. Forster did not have to spot the linearity to know that he cheated, he was there when he did it.

      The Linearity Hypothesis was only used to demonstrate the data manipulation, independently from the Research Hypothesis. Of course, the incredibly high effect sizes for the Research Hypothesis may have sparked the interest that lead to the accusatory report.

      If Forster had been more modest in his manipulation, and introduced somewhat smaller effect sizes, and if he would have been more knowledgeable about how to manipulate data, then he would not have been caught.

      1. I am really surprised by such strong statements based on likelihoods.

        I do not think that this is convincing, but makes me suscipious about the hidden agenda of these (or is it only one?) authors.

        Probably the most plausible explanation: This is a paradoxical intervention of someone who wants to convince the readers that the “proofs” are not as strong as it is claimed.

        If this is this case: The paradoxical intervention worked – at least for me.

    2. Dear professor Liberman,

      Thanks alot for your extensive posting and great that you are using your own name. I would like to react on your posting by telling something more about my own experiences while doing research and while publishing results of this research ( http://scholar.google.es/citations?user=hmhMcScAAAAJ&hl=en ).

      Raw data (= paper lists with all biometric details of all individuals) collected as a student in the summer of 1983 and published in 1991 ( http://wildfowl.wwt.org.uk/index.php/wildfowl/article/view/1408 ) are still available. The same is the case for raw data collected in the summer of 1993 and published in 2002 ( http://onlinelibrary.wiley.com/doi/10.1046/j.0019-1019.2001.00014.x/abstract ). This paper also contains a map and a table with details of all locations (including years and observers, etc.) where the raw data have been collected.

      A few weeks ago, I have submitted a paper in which we document a new longevity record of 33 year of a particular species of gull. This bird was marked in the UK in January 1980 by an entity which does not exists anymore for already a very long period of time. Nevertheless, I recieved a scan of a paper list with the raw data of this bird (and as well with all the raw data of all birds captured at that day) within a day after I had send an e-mail to an ornithologist who had published a short online report with some general results of this study.

      I don’t want to convince you that all biologists store their raw data in such a way. That’s not the case. Besides that, the data in my paper from 1991 have been digitalized in the 1980s and the calculations (SPSS) have also been carried out by a mainframe computer. For sure, this digital information does not exist anymore.

      I still don’t understand why APA guidelines about storing raw data seem to be irrelevant when submitting papers to journals which follow these APA guidelines.

      You are right that Jens should not be blamed for not seeing that something was wrong, in particular because you, and many other of your collegues were also unable to detect these irregularities.

      ‘Doing the right thing’ means, for example, that you inform asap the EiC of the journal that there are concerns about (parts of) your paper when this information is exposed. ‘Doing the right thing’ can also mean that you publish a rebuttal of the report of the complainer. Maybe you might ask Jens if he has contacted the EiC of the journals and maybe you can ask Jens if he could release his defence that nothing is wrong with all three papers?

  33. This gives further information about the busy life of Prof. Förster: http://www.referentenagentur-bertelsmann.de/speaker/173116/Jens_F_rster.html

    He was clearly in the know about “best practise”, because he was member of the ethics advisory committee of the Section Social Psychology of the German Psychology Society: “Von 2003 bis 2005 war er Sprecher der Fachgruppe Sozialpsychologie der Deutschen Gesellschaft für Psychologie, deren Ethikrat er außerdem beisaß.”

    Further it says: “Auf internationalen Kongressen werden seine Thesen stark diskutiert. Förster ist einer der produktivsten und meistzitierten Sozialpsychologen seiner Generation und hat zahlreiche Beiträge in internationalen Fachzeitschriften veröffentlicht.” In short, it says that his work is an important topic on international conferences, and he is one of the most productive and most cited social psychologists of his generation.

    As I mentioned about this 2012 paper in my earlier comment, I wonder how he would have been able to carry out such a big study all on his own, and it surprises me that someone at this level would make such beginner mistakes he did. The story gets more amazing the more you dig into it.

    1. Johny: “As I mentioned about this 2012 paper in my earlier comment, I wonder how he would have been able to carry out such a big study all on his own, and it surprises me that someone at this level would make such beginner mistakes he did. The story gets more amazing the more you dig into it.”

      Does publishing a long series of studies as a single author make you a suspect because it makes people wonder?

      Does the fact that a top-level researcher makes ‘beginner mistakes’ something indicative of fraud?

      1. LOWI does not say that they believe Förster committed fraud. They say he is responsible for what they consider fraudulously generated scientific conclusions. They believe someone cheated, they don’t know who, but they do know that Förster was responsible and moreover that he should have realised that something was badly wrong.

        It’s quite subtle.

        Is it a good thing that top-level researchers make numerous big beginner’s mistakes, and don’t even realize it? Even though they are glaringly obvious? It’s the top-level researchers whose research is noticed outside their own field. And it’s the top-level researchers whose behaviour sets the standard for everyone else inside their field. High exposure to risk, high responsibility. It goes with the job.

    2. Dear Johny,

      Once again thanks for your nice link with some background information (in German, but no problem for me to understand).

      On 21 September 2012, the Royal Netherlands Academy of Arts and Sciences published a Dutch version of a report with the title “Responsible research data management and the prevention of scientific misconduct”. The English version of this report (issued in 2013) can be downloaded for free from https://www.knaw.nl/en/news/publications/responsible-research-data-management-and-the-prevention-of-scientific-misconduct?set_language=en

      Highly recommended for anyone who is interested in research ethics. The Dutch version of the report can also be downloaded for free. On 15 May 2013, the Royal Netherlands Academy of Arts and Sciences has even published another report about this topic ( https://www.knaw.nl/nl/actueel/publicaties/vertrouwen-in-wetenschap , only in Dutch). Once again, highly recommended for anyone interested in research ethics.

      http://www.ncbi.nlm.nih.gov/pubmed?term=jens%20f%C3%B6rster%5BAuthor%5D lists 21 publications (but I am not yet very familiar with working with Pubmed, so please correct me when I am wrong).

    3. Johny, now and then I attend the defend of a PhD thesis at the University of Groningen (The Netherlands). Such a defence is public and parts of it are ceremonial. On the other hand, anyone is free to ask (tough) questions to the candidate about any aspect in the thesis. Nowadays, at least since February 2014, the candidate will only get his PhD when he declares in public that he will work strictly according to the rules of the Code of Conduct for Scientific Practice.

      So I am wondering why Jens Föster, given that he was “member of the ethics advisory committee of the Section Social Psychology of the German Psychology Society” does not respond on all the questions of people over here.

      One example to illustrate my point of view. Several people over here have suggested that Jens selectively has omitted data, but that this cannot be found in the papers. The Code states “II.1 The selective omission of research results is reported and justified”.

      Towards my humble opinion, ‘best practise’ means that Jens Förster, and/or his co-author Markus Denzler, clarify how these suggestions can be rebutted.

  34. I had some more thoughts about what people could do who feel that Förster has been treated unfairly etc etc. Here’s what they can do. If they are social pychologists they can replicate his experiments. If he has 100 supporters over the world, each one can do one small experiment within just few months from now, surely? Every experiment should be carefully planned and the protocol of the experiment published in advance. Good statisticians should be consulted and good methodologists consulted to ensure that the experiment is performed to the highest standards. No QRP’s. Each experiment leads to a published data set, all the standard documentary evidence available, and a pulbished outcome. Whether positive or negative.

    We will carefully observe not only whether or not the theory is confirmed, but also if the completely unexpected (and as far as I know, unexplainable) straight line turns up.

    Real experiments are perhaps too expensive or time consuming, but computer experiments are easy. Let the social psychologists who believe in the integrity of Förster’s results work with applied statisticians and methodologists to do computer simulation experiments to test the methodology of the whistle-blower’s initial report. See if they can find holes in the arguments. What is the effect of correlation between sub-studies because the same subjects are taking part in both sub-studies? Does it depend on whether the subjects in different studies are re-randomized to the three treatments or not?

    It seems to me that if science itself later vindicates Förster then his reputation is also restored.

    Förster’s work can be replicated by researchers careful to avoid any possible criticism of QRP’s. The whistleblower’s methodology can be tested.

  35. For the last few days reading these posts I’ve been hesitating about whether to post something, probably for the same reasons why we don’t see many other social psychologists responding here. I have to commend Nira and others for openly posting, I’m not that brave. Putting oneself in the spot light by posting here is not something everyone would risk. Even if there is nothing strange about one’s data or one doesn’t engage in questionable research practices, I know from personal experience that the mere accusation of anything even remotely related to questionable research practices sets in motion a horrible process.

    On the one hand, I wouldn’t go as far as some people here in questioning the motives of the accusers, but I do think it is unfortunate that this complaint was released to this website — who did this is unclear. People are reading this so-called report as if it were the unbiased report, but it is not. It is intended to convince the reader of Jens’ guilt, not to inform in an unbiased way. On the other hand, the fact that someone DID release this complaint does create some questions about the motives. Perhaps a suggestion to the administrators to change the title to something other than “report”…?

    Although I agree with some people that the statictical methods are also debatable, to the extent that I understand them they do make a lot of sense to me and are quite convincing. The way they are used and presented to scold is something that doesn’t make a lot of sense to me though. Well, I should specify: the argument based on linearity and the correlations between the extents to which the experimental conditions deviate from the control conditions is convincing. As others have hinted to, most of the other evidence is extremely circumstanial at best. That is not to say that the papers do not suffer from sloppiness in terms of reporting. However, for most if not all (except the statistical arguments) I can easily think of reasons or situations that might have brought them about. For instance, in the review process one needs to remove reference to where the data was collected because this information in combination with the topic of the paper immediately tells reviewers who wrote the paper. Perhaps, in his joy over getting the paper accepted, Jens failed to put this information in the final manuscript. The same applies to names of research assistants who conducted the research.

    The effect sizes are very high, but in itself this could just mean he discovered a factor that is very powerful. In itself this does not seem suspicious to me. Of course, if you already think someone is guilty, all ambiguous information is pointing in that direction. Confirmation bias, which is not something one would expect if the accusers are free of biased motives of their own. Did the accusers have the opportunity to look at the effect sizes in the data files themselves? As we know, psychologists are notoriously bad in the use and appreciation of effect sizes – maybe he reported effect sizes as eta-squared but these were actually something else 😉 On a side note, I am not sure whether the comparison sample for the etas is representative. For behavioral or performance measures, I often find eta-sqaures around or above .15, which would be considered anomalies according to this comparison.

    Well, people seem to be suggesting that Jens did not provide any information to the committees. That seems highly unlikely to me. If he did make up the data, he could have easily also made up on the fly, now even, where he collected the data. Also note that research assistants often will not even have a clue about what study they are running or what relationships are expected (a good thing in a double blind study, mind you) so it is not so surprising that he does not recall either.

    People seem to think it is unlikely, the number of participants and the lack of knowledge on dates and so forth. A productive social psychologist who conducts at least 20 experiments/studies in a year (many do and the most productive people probably do 40-50 experiments in a year) is not going to remember who collected each study and when, especially not after five years.

    We do not as yet have enough information about the case to say anything about these “other” issues. As Jens mentions in his reply, the publication and decision, it seems, suddenly went very quickly. I tend to believe him and I also think it is wise for him not to respond here to every possible comment. These accusations ruin people, can ruin careers and need to be handled carefully.

    My question is why we do not see more people here defending and/or explaining the statistical methods used to make the accusation. If they were so eager to release their complaint, rather than wait for the official institution to deal with it, they should also be here to explain the details. I would very much like to see some more charitable interpretations of the data, some “what if he did this and this…” For example, let’s say we were to add several possibilities such as the following: (1) several control conditions were run, only the nicest one was reported (questionable reporting, certainly, but forgivable maybe) (2) rather than 40 experiments, he ran 120 (definitely not impossible, certainly again questionable although it might have been the case that he was just perfecting measures and manipulations). Let’s say we did add a couple of contingencies like this. What would happen to the probabilities then?

    I think I have a reasonable understanding of the linearity argument/hypothesis. In my reading, the argument hinges on the control conditions. Are the differences between the experimental conditions (without considering the control conditions) also too uniform to be true? Perhaps he just ran a lot of control conditions (perhaps even separately or post hoc) and picked the ones that were best… This is certainly not the way to conduct valid scientific research, but it is not the same as fabricating data.

    My point is, the strongest (and in my reading the only real) argument that there IS something questionable here involves the control conditions. If we forget about the other lack of information about participants for a minute, add some contingencies and some questionable practices, look at the “clean” evidence against Förster, then what remains? Could we test the probabilities of several of these alternative possibilities, rather than jumping to the conclusion of data fabrication?

    Unfortunately, I know I can’t, but someone certainly should.

    1. The LOWI concluded fraude, but did not know who committed it. They did state that Förster was responsible for the research and should have been aware of it. So in their view, he is either fraudulous or incompetent.

      It is not too clear what LOWI understands as fraud and what they understand as QRP (questionable research practices). Many highly prevalent practices in psychology research, accepted by many researchers as practically necessary and therefore defensible even if not strictly kosher, would be called fraudulous by researchers in other fields.

      So what do you mean by data fabrication? You can “fabricate” results by deliberately selecting from data, just part of it. You have thereby fabricated the final data set (the one whose summary statistics are reported in the paper) but you have not actually fabricated data in the sense of fabricating subjects responses. You can fabricate results by altering the data set. You haven’t fabricated data in the sense that Diederik Stapel fabricated data.

      The question, what were the motives of the accuser, is a very good one. It should always be asked and if possible it should be answered. Hopefully the UvA CWI and the LOWI were able to gain insight into this question.

      1. Dear Richard,

        You wrote: “The question, what were the motives of the accuser, is a very good one. It should always be asked and if possible it should be answered.”

        I tend to disagree with you.

        The Code of Conduct (http://www.uu.nl/SiteCollectionDocuments/The%20Netherlands%20Code%20of%20Conduct%20for%20Scientific%20Practice%202012.pdf and the Integrity Complaints Procedure (http://www.uu.nl/SiteCollectionDocuments/Corp_UU%20en%20Nieuws/Klachtenregeling%20WI%201%20september%202012_EN%20def.pdf ) provide the motive [for Utrecht University, but both regulations are identical for anyone doing research on any of the Dutch universities. The motive is a scientific motive.

        “This Code obliges researchers not only to conform but also to actively maintain and promote the rules for integer scientific conduct in his academic circle.”

        “One way to check academic integrity is to exercise the right of complaint when employees [of Utrecht University] have breached or are suspected of breaching integrity.”

        So anyone working as a researcher at any of the Dutch universities, including you and including the accuser, must actively maintain and promote these rules and one can easily argue that you and the accuser are morally obliged to file a compaint when there is serious evidence that one of your collegues is breaching the rules (after you have discussed this topic with your collegue and after it has become clear that he is unwilling to repair his errors, see I.10).

        Well, the accuser had the opinion that this was the case for Jens Förster and thus a complaint was filed to the board of UvA. The accuser has prepared a report in which is argued what’s wrong with these papers of Jens Förster, and the accuser has added this report to support the complaint. That’s all.

        1. In general I agree that the motives of an accuser should not be relevant, but the assumption there is that the accuser’s motives are non-selfish. From reading the accusation, I can’t escape the impression that this person was doing everything in their power to make the case. Should it really be so that a colleague is trying to make their best case against you, or should the colleague also try to present the evidence in a balanced way? If the former, it seems that motives are leaning more in the direction of persecution than of scientific purity. Should there really be an army of statisticians checking everyone’s work for ALL things that COULD potentially indicate? That basically means a witch hunt. I don’t think anyone actually wants a witch hunt, but I could be wrong.

          1. There is no doubt that the report filed by the complainer is biased as the complainer needed a report (or whatever) in which the concerns were presented as strong as possible. On the other hand, Jens Förster was free to provide the Committee with anything which would lead to the conclusion that the report of the complainer was rubbish / nonsense, etc. Jens Förster was also totally free to ask any statistician (or anyone else) for help and for preparing such a report (or whatever). Releasing this material of Jens Förster would give people the opportunity to judge by themself if the report of the complainer is rubbish or not.

            I also don’t have an idea if and how often the complainer and Jens Förster have discussed the concerns about the three papers with each other and I also don’t have an idea what was the outcome of these discussions (given that they have taken place). So I have no clue what kind of actions the complainer has conducted before the complaint was filed.

          2. If I were the committee, I would want to know something about the motivation and methods of the complainant. I mean, the procedure by which the report came to be written. Personal gain? Academic gain? [ie academic credit by being a succesful fraud-buster. *After* your victims have been disgraced, write papers about your wonderful role in that]. Revenge? Fishing expedition? p-values mean very little if we are not told the procedure which led the researcher to calculate them! Which ones were calculated but not reported?

            If I were suspicious, I would tell the complainant to get stuffed, just go and follow normal scientific procedures. Write a non-anonymous scientific paper and try to get it published. One can always study the scientific integrity of the scientific works first; later, anyone who wants to, can ask questions about the personal integrity and/or competence of the researcher afterwards.

            In this case the complainant apparently comes from the methodology department of the same psychology faculty. Seems to me that anyone study QRP’s in that department would automatically take a look at the work going on around him or her, especially, I am sorry to say, in social psychology. And if they found anomalies, first of all discuss them with the scientist in question. I don’t know if that happened. If I were the CWI I would want to know (a) who the whistleblower was and (b) if they hadn’t already approached the scientist in question, why not.

    2. PS maybe it is not wise to conduct at least 20 experiments/studies in a year; doing 40-50 experiments in a year seems to me quite crazy. Physics PhD students spend four years doing one experiment. Quantity or quality? Is it really so smart to do one experiment a week so you can publish two papers with signicant findings per year?

      This is how you can generate anomalies like Försters. Do one experiment per week and keep only the best one or two per year. After five years you have a nice dossier for your final paper. The selection procedure automatically (a) selects experiments where the with group variances were coincidentally extremely small, (b) the three sample averages lie close to a straight line!!!

      Is this good science?

      1. PS The process I just described certainly generates a biased sample with (a) small within group sample variances, (b) sample means close to a straight line. Because those two features both lead to bigger F test for difference between the three groups. Does the biased sample also tend to have (c) F test of non-linearity surprisingly non-significant?

        If so we have a “pure QRP” explanation of the phenomenon observed.

        My hypothesis can easily be tested by simulation, but a clever theoretician should also quickly be able to find out whether this was a brilliant insight or a completely stupid idea.

        If actually the procedure was not to select on a good result of F test of difference between the three groups, but instead to select on a good result of F test of difference between control and average of “high” and “low” treatment groups, we have an “innocent” (ie QRP) explanation for the striking linearity. It is not of psychological origin but is an artefact of a QRP: doing many experiments and selecting only the “successful” ones for inclusion in your publications.

        Did Förster know what was a QRP and what was not a QRP five years ago? I doubt it. Does he know it today? I hope so.

        1. This does’t generate the results mentioned in the complaint and it wouldn’t explain the way the case unfolded:

          1. If Forster ran 1000 studies and selected those that worked, he would have selected them to fit the research hypothesis, not the linearity hypothesis. To get this pattern of results, you would have to select studies purely on their conformity to the fraud markers, on the conformity to your theory.

          2. Even if, for some reason, Forster ran a 10000 studies, and carefully selected those that showed the fraud markers of linearity and homogeneous variance, why hasn’t he mounted such excessive use of QRPs in his defense?

          1. YM: I’m trying to explain that selecting studies that fitted the research hypothesis might bias them to fit the linearity non-hypothesis, at the same time. It’s a bit technical, sorry.

            The research-hypothesis oriented selection procedure does select data sets *closer* to linearity. That is because the power of the usual research hypothesis F test (anova, test of equality of three groups) is greatest when the three mean values are equally spaced, other things being equal. However we must also take account of the within group variances. The research-hypothesis oriented selection procedure also selects data with small within group variance, which other things being equal, makes it easier to *reject* linearity. Which of these two opposing tendencies wins? It should be easy to find out. Maybe someone knows the answer, already.

    3. Dear Joop,

      two comments on this. First, regarding the effect size issue: your typical etas, although quite large, are still less than half of what Förster reports in the three papers. So that’s really a different league, particularly in a field in which the typical effect size is around .21 (reported by a meta-analysis about social psychology effects sizes from around 2002ish). This corresponds to 4.4% of variance. No, this is not eta squared, which tends to be larger than r squared. But it also provides context for the 20 to 40% variance (r squared) accounted for by Förster in his studies.

      Which brings me to my second comment. With effect sizes this big, as a behavioral scientist you should not only be happy that your particular hypothesis gets supported time and again. You should also start scratching your head about the phenomenon in general. In behavioral sciences the general rule is that even if you have a proven effect/source/process affecting behavior, you will only see it through a lot of noise generated by other effects/sources/processes, which usually cuts down your effect size. Not the case here. It’s as if Förster has hit upon an effect that slams through with maximum force, probably scratching the upper limits of reliable variance in the dependent measures. As a researcher, I would be fascinated by the fact that I appear to have hit upon an incredible exception from the rule in behavioral science and would want to understand what makes this process so special. It should be clear right away that this is much more than just proving a hypothesis about creativity or global/local processing. Unless, of course, you think that this type of processing is absolutely fundamental and so all-pervasive that it will have these ultra-strong effects.

      The last time I looked at effects this strong, it was in a paper written by a colleague. I was the coauthor. After I was thrilled initially that the paper found such strong effects had been found and the paper had already gone through one round of review at a peer-reviewed journal (none of the reviewers questioned the effect sizes!), I became concerned about outliers and analyzed all data myself. That’s when I spotted the wonderful phenomenon of data points aligning closely and uniformly to regression lines across the entire range of scores and it became clear to me that someone had tinkered with the data. Long story short: my colleague, to his/her credit, followed the data trail and was able to identify one RA who had been responsible for these data fabrications and changes and who had done the same thing on an unrelated bachelor’s thesis, too. We pulled the paper from the review process (rarely have I been so relieved to see a paper lost!). My colleague brought the RA’s behavior to the attention of the department, but was told not to make a fuss about it. The RA had already graduated by that time.

      I tell this story to underscore one basic fact: as scientists, we are all responsible for our data and understanding what processes, valid or invalid, gave rise to them. Before I publish anything, I look at my data from every possible angle, trying to make sure that I understand them, that no artifacts have crept into the file during data processing and transformations, etc. And if I then saw patterns as strong and consistent as those in Förster’s papers, I’d think that I’d found the holy grail and would do everything and anything to understand why I found it and how something as strong and shiny as this is possible in the first place.

  36. I think a number of dinstinctions are in order:

    1. This keeps coming up. The document enclosed to this webpage is *not* the report. It’s the complaint. The relevant question about the complaint is *not* whether the material presented is sufficient for a conviction. The relevant question is whether the material in the complaint is sufficient to warrant an investigation.

    2. The UvA initially concluded that datamanipulation could not be be proven because QRPs could not be ruled out. The LOWI investigation will likely have been focused on exactly the question of whether QRPs are or are not a plausible alternative explanation of the strange patterns in the data. They concluded that this alternative explanation is implausible. It is likely that they will mount evidence for precisely this conclusion. The debate is uninformed without this information, and I think that university officials, who hold the report in their possession, should release it now (hint).

    3. Datamanipulation was considered proven by LOWI, but Forster was *not* found guilty. Instead, he was found *responsible*. It seems to me that the immediately relevant question is therefore: who manipulated the data? It is remarkable that Forster doesn’t raise the question. It is entirely unlikely that he gathered the data himself, so it would seem extremely important to investigate who else might have done it.

    4. Whether or not the *research hypothesis* (“global is good”) replicates is immaterial to the question of whether the data were manipulated. Cyril Burt manipulated his twin data to prove that IQ is heritable. His conclusion is true, but that doesn’t mean he didn’t fabricate data. the same holds for Jan Hendrik Schon, some of whose work was replicated but still fraudulent. Even some of Stapel’s faked work has been replicated.

    5. The probability argument in the complaint computed the likelihood of finding means to lie exactly on the regression line *if the linearity hypothesis were true*. The relevant number has a very precise interpretation, namely: if the population means were *exactly linear* and one repeatedly gathered samples as in Forster’s research (under relevant assumptions), then one would find results as close or closer to linearity once every 508,000,000,000,000,000,000 times. This is not a Bayesian probability (i.e., it does not give the probability of Forster being straight) but a frequentist one (it gives the probability of finding these or more extreme results, *if* Forster were straight, i.e., the linearity hypothesis happened to be true).

    1. Your fifth point is quite interesting. So what is the probability of the pattern given that Forster did not hold a linearity hypothesis but simply hypothesized an ordinal pattern (a < b < c)?

        1. But only if the STATISTICIAN holds a linear hypothesis, although there exists an infinite number of alternative hypotheses that should also be considered when calculating of the probability. Am I right, Richard Gill?

          1. Sorry, I am not a statistician but I would like to understand if the probability 1/508,000,000,000,000,000,000 is correct. So far, I have learned that the result depends on the specific (linear) hypothesis the statistician has intentionally selected (cherrypicking?). But what if a curvilinear hypothesis (preserving the ordinal relationship) had been tested? Are we still in the trillions? I happily stand corrected, but “thumbs down” won’t do.

          2. In Germany they say “Kein Antwort ist auch ein Antwort” (no reply is a reply). Thus, until I am convinced of the contrary, I am concluding that the calculation of the incriminating probablity is faulty because it was based on a questionable research practice (QRP). I am curious if this issue will be addressed in the final LOWI report.

          3. Perhaps you should consult with Jelte Wicherts from UvA. He is “specializing in errors with statistics”.

          4. I’m not sure whether I understand the question correctly. The reason why a linear hypothesis was assumed is because the data presented in the papers ARE supposedly linear. Therefore, the most favorable way of testing “against” these data is to assume that there is a linear order of those conditions “in real life”. Testing against an ordinal relationship instead would even further decrease the probability because Förster’s data would then have to be linear “on accident”.

            Now, had the data in Förster’s paper not been linear, you would be correct. Finding an ordinal data pattern would be much likelier than finding a linear pattern (although, again, it might be questionable whether this could happen 40 times in a row if the effect was not ridiculously large). In fact, as I understand it, Förster’s papers have been (conceptually) replicated, which means that the effect is probably real in the sense of an ordinal pattern.

          5. Hmm, you cannot test a general curvilinear hypothesis (like a hyperbola) based on three groups. You don’t have enough degrees of freedom for that. For a general curvilinear relationship, there always exists a possible curve that fits all data exactly. So that approach fails. Bayes doesn’t solve that either, by the way (it also makes assumptions). That is also why the report mentions that the analysis was not applied to some sub-sets of outcomes based on two groups: there always is a perfectly linear relationship through two points.
            If your point is that we cannot infer that data are impossibly unlikely without making some distributional assumptions, then you are right. There is *always* some underlying true datastructure that would make an observation “likely, but not too likely”. However, the point here is that even a most favorable assumption for the author, namely that the relationship is linear as suggested by those data themselves, is unable to explain the degree of linearity.

          6. Freddie, the linearity hypothesis in the Report is nothing to do with what Forster himself hypothesised, it is based purely on the observed fact that the data are very linear.

            The statistician’s null hypothesis – based on those data – is to assume that the underlying psychological phenomena really were linear. That is the null hypothesis as it’s the simplest explanation of the linear data. But it turns out that even on that assumption, the chances of the data being that linear are 1/508 quintillion.

            If the underlying phenomena were not linear, the odds would be even smaller. So the 508 quintillion represents the most favorable assumption for Forster, not a cherrypicked one to attack him…

          7. As far as I understand, H(0) says there is no difference in the “population”, whereas H(1) says there is. The qualification that H(1) implies a linear relationship comes solely from the statistician. It was certainly not Forster’s hypothesis. That is, the,linear hypothesis was imposed by the complainer (or who ever). My still unanswered question is: What if a curviliear hypothesis would have been tested, which is possible with three means?

          8. No. The linear hypothesis does not come from the complainer, but from Förster’s DATA. Several people responded basically the same thing: This hypothesis has nothing to do with what Förster hypothesized. This is irrelevant to test whether his data are “too good to be true”. This test is carried out to find out whether his data can be “real” in the most favorable way (to Förster): By assuming that linearity is the “true” effect, and testing how likely it is to find linear data 40 times in a row under this assumption. If a curvilinear hypothesis would be the “true” effect, the likelihood of obtaining linear data would be even smaller.

            I don’t know how I can explain it in a different way.

          9. Hi Hellson,

            Let’s offer Freddie an analogy, to see if it makes it easier to see the shock of it.

            Suppose an associate of yours tests people’s reaction times after drinking (say) coffee versus wine versus control (water). Suppose her reports that it is faster with coffee and slower with wine. It is 2.00 seconds faster with coffee and 2.00 seconds slower with wine. He measured each subject to the nearest 1 ms, and the individual subjects showed wide variation in the effects of coffee versus wine, but it averaged out very neatly like that. You might say “what an amazing coincidence, how funny!”

            Then another study of another pair of superficially opposite interventions, but which have no reason to be exactly the same size, e.g. listening to opera versus listening to rap music. He reports effect sizes of +3.20 s for one and -3.20 s for another. You might say “Wow! Where did you get these two perfectly counteracting pieces of music from? Each subject had different-sized effects from the two pieces but it all cancelled out so the averages were perfectly equal? Amazing! You are the Bermuda triangle of science. You should enter the lottery, you would win every time!”

            But then it happens again and again. Each time, opposite (OK) but perfectly equal (Hmmmmm) effects from interventions that have no reason in principle to be exactly equal in effect size. You might start to distance yourself from him, or if you are a very good friend, start having some firm conversations with him.

            You look deeper into his data and things get worse. When you look amongst the Men, that equal-and-opposite thing doesn’t work at all. Nor in the women. Only when you merge them together.

            That is when you know how it all came to be like this.

            If you can’t persuade him to come clean, the only thing you can do as a friend is go on Retraction Watch and say how nice a chap he is, which is quite probably true. Most people are indeed nice at a human level.

            (My post is about a fictional analogy, and is not data from the Forster query. The idea is to show people what is meant by it all being too linear, and why proposing an excuse that the underlying data is not linear does not help the defendant.)

          10. Suppose the probability of finding these (or more linear) data in one’s samples is p under the hypothesis that the population distributions conform to the linearity hypothesis. Then the probability of finding these (or more linear) data will always be smaller than p if the population distributions do not conform to the linearity hypothesis. So if some curvilinear or ordinal or other hypothesis is true, the probability of finding these (or more linear) data is necessarily *smaller* than 1/508,000,000,000,000,000,000. It *cannot* be larger.

            You can think of it this way. Suppose that a researcher ventures the hypothesis that 50% of all swans are black. The researcher presents evidence in the form of 40 independent samples of 50 swans each, so involving a total of 2000 swans, and states that in every one of these samples the percentage of black swans equals exactly 50%. So in every one of the 40 samples, exactly 25 out of 50 swans were black. Assuming that 50% of the swans are in fact black in the population, the probability of finding exactly 25 out 50 to be black in 40 consecutive samples is astronomically small. This is the type of p-value that the report mentions.

            In the above analogy, your question would be: but what if the researcher’s hypothesis were not true, but some other hypothesis were true, e.g. 51% or 20% or 80% of the swans were white. You can now sense that the probability of finding 40 samples with n=50 in which 25 swans are black will always be larger if in the population 50% of swans are black, than if some other hypothesis holds.

            So the p-value in the report is conditional on the best possible scenario for Dr. Forster. It gives the likelihood of these or more linear results (finding in every sample that 50% of swans are black), if Dr. Forster’s theory is correct (50% of swans are actually black). Under all other scenarios (the actual percentage of black swans in the population is not 50%), the likelihood of these or more results is thus smaller than 1/508,000,000,000,000,000,000.

  37. Richard Gill wrote “If I were the LOWI I would recommend Humbold Foundation to give Förster his [five million euro for five years] grant and insist on spending at least half of it (the first half) on careful, and carefully monitored, replication of these experiments. ”

    I am not convinced this would be the most logical step.

    1) this would reward, rather than punish, a researcher who beyond reasonable doubt has committed—or at the very least was in the end responsible for—scientific misconduct.

    2) cynics could argue that a week (not a few years) is more than enough to manufacture new data that completely replicates the results presented in the three questionable papers, and this time shows no statistically significant linearity effect, consistency across experiments, or other anomaly.

  38. Richard Gill is spot-on in observing that fictitious relationships can arise not only from data fabrication but from other Questionable Research Practices which may be more common than widely assumed, and whose adverse consequences may be seriously underestimated.

    For example, in a study of correlation between variables, selecting subjects who fit the hypothesis can generate powerful correlations out of nothing, more easily than commonly supposed (http://www.ncbi.nlm.nih.gov/pubmed/22285446). But without selecting patients, selection amongst multiple slightly different estimates of a variable in each patient can even more powerfully generate correlations of 0.8 or 0.9 or more. In my field some element of this is sadly rather common. When a scatter plot is shown, this may reveal a characteristic shape to the data, with outliers either sliced off with a straight-line margin, or with dents near at the middle of the dataset like an apple core. I have called these phenomena the “Enron Shave” and “Enron Bite” and provided spreadsheets to test for the effect.

    For t-tests between 2 groups, unblinded selection amongst multiple values can also have this effect and markedly exaggerate differences, or synthesise differences between groups when in reality there isn’t one. In a scatter plot it produces an effect that might be called “Enron Kissing Tadpoles”, because the data clusters have heads that are touching and tails pointing away from each other, a pattern that does not arise naturally. Again this can be tested for.(http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0065323)

    The excessive linearity in these social psychology studies could, similarly, arise from a process of selection occurring, albeit with 3 groups rather than 2, and might produce the same pattern if data raw points could be obtained. The F test p values described above in this discussion are a more sophisticated way of addressing this phenomenon: the data do not contain enough variability to have arisen by natural processes.

    Richard Gill’s commentaries here have been very comprehensive and illuminating. My only additional thought is that QRPs such as selecting one trial amongst many for publication, or one subject (or one measurement) instead of others for inclusion in a study, should perhaps in future not be accepted as innocent.

    In cardiology we are particularly vulnerable to this sort of inadvertent manipulation because of the nature of our specialty. When we make a measurement of cardiac function in clinical practice, we conventionally pick whichever heartbeat we like from the large number that are available. Nobody tells us that doing so in research practice, without blinding to study arm or the other variable, produces the wrong answer.

    We are trying to spread the message in our specialty (http://www.ncbi.nlm.nih.gov/pubmed/22824106) but wonder whether the same problems may be afflicting other fields too?

    1. Dear Darrel,
      Thanks for pointing us to your important work on the topic.
      From the outside, however, I really do not see the difference between QRPs (when QRPs mean selecting subsets of subjects/patients) and deliberate fraud.
      A “finding” based on cherry picking a few data points is similarly useless as a “finding” based on fabrication.

      1. Exactly.

        A statue carved out of wood can be just as artistic as one built up from layers of clay, even though in the former case the artist can be argued not to have “added” anything.

        Likewise, patterns arising from removal of data points can be just as wrong as those arising from adding of fictional data points, even though in the former case the scientist can claim not to have “added” anything and the full audit trail of the raw data will be available: the only thing fake is the pattern.

        Do other Retraction Watchers agree that we should upgrade the status of cherry-picking from QRP to fraud?

        1. I think it should be recognised that there are shades of grey. Removing an outlier can be done for decent reasons, or bad ones. Forcing cherry-picking to be either labeled as QRP or fraud (or acceptable practice!) is bad either way. However, I completely share the feeling in this subthread that it is a bad development that QRPs seem to be more and more regarded as “innocent” practices. The level of cherry-picking or data-massaging that would have to take place to reach the results in this case is so huge that one can no longer argue, in my opinion, that they were “innocent”. Perhaps one might argue that the author thought “others do the same”, but that is not a defense. You do not need much statistical background to understand that repeated selection of “best” subsets of participants is utterly biased.

          Another aspect that worries me is that many feel it is the duty of the community to come up with scenarios that might explain the observed patterns in terms of QRPs rather than fraud. In my view, the report contains sufficient convincing evidence for fraud. Of course, that is open to counter-evidence, but that should be provided by the accused, possibly assisted by some (willing) peers. It is not the duty of all (unwilling) peers to come to the rescue. At the same time, I don’t expect him to read RW, or reply here, so I am very curious to what he has to say for himself through the proper channels, hopefully soon.

    2. The PubMed 22824106 article looks interesting, but unfortunately behind a paywall. Any preprint available?

        1. Thank you Prof. Francis. Well done paper – I will be sharing it with colleagues in the lab I work in.

          For others reading this thread, I recommend this paper as a good read as to how many ways there are to fall prey to Questionable Research Practices. “No fabrication or falsification is required for these exaggerated or false results: they arise spontaneously when poor reproducibility combines with compromised blinding and a researcher’s prior belief.” It’s these types of phenomena that yield so many irreproducible results in the literature, even by earnest well meaning scientists.

          Beyond teaching biology students how to do a t-test, statistical courses should focus more on ideas presented in this paper, to hammer home for budding scientists the point that science is carried out by people, and people are subject to several unfortunate tendencies that interfere with Good Research Practices.

  39. What I would like to see is that ALL papers of Jens Foerster are investigated.
    As was done in, e.g., the Stapel case.

  40. Reading the discussion with great interest. I arrived a couple days late, and didn’t get completely through yet.

    Do like to take this opportunity to ask for a reference/source on issues associated with (large proportions of) F < 1. As it happens, I commented about a load of F < 1's just last week when discussing an (otherwise unrelated) paper, yet my colleagues did not share this concern or considered it odd in any way. Now that a similar concern pops up in the complainants 'report', I'd like to share a source that backs these, yet I have trouble finding one (googling f < 1 doesn't work well). Is anyone familiar with a more 'to the point' exposé than this one (http://www.psycontent.com/content/t0g1460240217300/) which is a rather steep read?

    1. My intuitive argument would be that, under the null hypothesis, the variance-per-degree-of-freedom in the numerator and denominator of the F-ratio should be about equal, so given stochastic variations it is roughly equally likely for F to be larger than or smaller than 1. It doesn’t literally have to hold exactly, but as a rule of thumb that should be the case. That means that lots of F1 but not F<1.)
      That is not the "more 'to the point' exposé" that you requested, perhaps, but common sense works for me.

      1. This is true. The assumptions of regression and ANOVA lead to an expectation on of F = 1 if the null hypothesis is true. So tests expected to have a null effect might reasonably have test statistics cluster around F = 1.

        There is however a wrinkle, this assumes that that the data are continuous. Most of the data we are discussing are discrete (e.g., rating scales) and this makes distributions of F statistics for true null effects behave differently – particularly in small samples. For non-null effects the discrete scores don’t tend to be so much of a problem. Ceiling and floor effects can also impact on the distribution of statistics for true null effects.

        Without detailed simulations this sort of pattern is merely suggestive of a too good to be true pattern. I was more struck by the F ratios all hovering around the same value (though this is a side-effect of the linearity of the means).

        1. Thank you both! I kinda do understand the ‘problem’, it’s more that I am looking for a source to also convince my colleagues. I’ll go with the Voelkle paper, their recommendations are pretty clear. Further suggestions remain welcome.

  41. There’s a very important clue that’s been overlooked here: in “Sense Creative” Forster says that

    “for the creative task, we handed out a cartoon picture with a dog sitting on a sofa and asked participants to find the most creative title for it… Four experts who were blind to conditions later rated the creativity on a 7-point scale”

    In other words, these four experts reportedly saw the raw data (for five experiments, numbered 6-10, in the 2012 paper) I think this makes these four the only other people who, by Forster’s account, ever saw any of the raw data (his coauthor didn’t).

    If one or more of them were to name themselves and testify that they were indeed involved, it would be a powerful point in Forster’s favor. If Forster is unable to get any of them to testify, that would look suspicious.

    1. @Neuroskeptic: The problem is that the people that were involved in rating the creativity task might not know that they were rating the task for the paper(s) in question. As I have stated before, I did work for Jens Förster some time back. And I also did rate creativity tasks such as the one described above. But I usually did not know for which studies I made these ratings (which has the advantage that I also did not know the specific hypotheses as Joop has pointed out).

      I do not know how Jens’ lab looked like in Amsterdam (though as I understand it the data might have been collected in Bremen). During my time in Jens’ lab in Bremen his lab had 16 student assistants. Granted, not all were involved with rating creativity tasks. In particular, only native German speakers were trained to rate those (for obvious reasons). However, given that there was some fluctuation due to students joining or leaving the team, this still leaves quite a large number of people who might potentially have been involved.

      As I said below, Jens has generally been very good at acknowledging the students involved in his research. But these acknowledgments have usually been fairly general, which is okay with me. In a lab with a large number of student assistants that help with collecting and rating of data from a large number of studies it becomes very hard to track who did what at which time exactly.

      Maybe it should be documented more carefully which assistant does what when nonetheless. But I can understand that Jens might not have done this (or even thought about doing this) in the past, and I doubt that things are much better in other labs. Documenting these things would mean that more money and resources would be required – money and resources that are hard to come by. If it is not common practice to have assistants who take care of documenting and archiving data you will have a hard time convincing your institution or the third party who finances your research to give you the necessary money and/or to provide you with the necessary infrastructure. And considering that Jens already has an – in my opinion – extraordinarily efficient lab means that he would probably not have been able to free the necessary resources by restructuring.

      1. And just to be clear: I am not sure if Jens cannot track the names of the assistants involved. I just wanted to explain why this might be the case.

    1. Now that I have read it: The LOWI report contains some new information in addition to what we learned from the 2012 accusatory report.

      The most interesting thing is that the perfect linearity (in itself already impossible) is not present in separate subgroup analyses of male and female participants. This is further proof of data manipulation: There is no linearity in males, nor in females, but the male means and female means compensate each other perfectly to create perfect linearity in the overall means.

      This leaves no other explanation than that Forster calculated overall means first, then decided on the differences between the (overall) means that he wanted to report, and in that way determined the constants that he needed to add to the individual scores to create the desired differences. However, as he used the overall means to determine the constants, we do not see linearity in subgroup analyses.

      I find this particular case very interesting, but of course it is very bad news for the field of social psychology. It also is terribly sad, especially for everyone who came to Forster’s defense. He should come clean now.

      1. That’s funny. For the combined analysis, Forster is accused for too much consistency (too good to be true). Now, the diminished consistency in the subgroups is also held against him.

        1. But not on its own account. The lacking linearity in the subgroups against the background of linearity in the overall sample is strange.

        2. The point being that the (expected) variations in the subgroups shows that the nature of the data inherently does not give any evidence that weird patterns should exist, but nevertheless the (unexpected) lack ov variation in the entire dataset was observed. The explanation would be that the manipulation of the data took place on the basis of the overall means (which were tweaked to be linear), but that manipulation was not carried out in all subgroups separately. It would be pretty hard, if not impossible, to tweak data such that they would be super-linear on any type of subgrouping; tweaking on the whole set is easier (but foolish).

          1. It would be possible to subgroup-tweak, surely. You’d just tweak the males to be linear, then tweak the females to also be linear (perhaps with a different ‘slope’) and then combine the two. The mean will be linear because both of the subgroups are (assuming equal cell sizes).

          2. Yes, you could, provided you know I will subgroup based on gender. But I could also subgroup based on age intervals, handedness, odd/even subject numbers, or even random assignment. I believe gender was just chosen as an arbitrary criterion to form subgroups in that stratified analysis. It would be very difficult (impossible even) to make all of those subgroups have “ultra-linear” behavior. This shows that the observed anomaly is not just a feature of the data itself (like, as a result of the discreteness and limited range of rating scores). The fact that the anomaly only exists for the dataset as a whole (that – in contrast to subgroups, I believe – was reported on) further suggests that the data was manipulated intentionally.
            [which I replied to Freddie’s comment above]

      2. The only question – in this hypothetical scenario – is why would you choose the constants so that they made the means end up as almost, but not 100%, linear? Was the remaining nonlinearity intentional, a way to avoid looking ‘too good to be true’? If so, surely you would set the remainder higher. Then it would be undetectable.

        Or was it unintentional – perhaps due to rounding errors?

        1. This “near-perfection” *could* be explained by the discreteness of ratings, perhaps? The rounding errors you suggest.
          Consider this small thought-experiment: suppose I have three groups of e.g. 7, 6, and 8 subjects. Say, the ratings of subjects in group 1 add to 15, i.e. average 15/7=2.14.., and those in group 2 add to 19, i.e. average 19/6=3.16.. Then I would “like” the linearly extrapolated mean of group 3 to be 88/21 to form the linear progression of averages 2.14.., 3.16.., 4.19. Unfortunately, the 8 subjects would have to score a total of 33.52 points then. Tweaking integer ratings in group 3, the best I can arrive at is either 33/8=4.12.. or 34/8=4.25, which is slightly less than perfectly linear. For larger groups, the possible match would be better, but likely not perfect. So, I suppose, changing a few numbers here and there in the spreadsheet and then misplacing the original forms would work. I don’t say it happened like that; but in the absence of a credible explanation it may be imaginable it happened like that. Of course, this is a very stupid and lazy way of tweaking data. However, I can’t think of any better mechanism that might be likely to generate these data.

  42. I posted this concern over at Neuroskeptic’s blog, before I came over to read this discussion – but I’ll post it again since it is relevant. A lot of commenters here are claiming that the p value is computed under the most favorable hypothesis for Foerster, but this is not true. The test here is entirely post hoc – that is, the data drove the test because the data looked too linear. The null hypothesis assumes that this is not the case, and the control sample was *not* selected for their linearity. If you compare a sample selected as suspicious for its linearity against another sample not selected for its linearity, of course you’re going to find differences. This makes the test a post hoc test, and the computed “naive” p value a meaningless number. I agree, it looks mighty suspicious, but any null hypothesis significance test without a correction for the post hoc nature of the test is completely invalid.

  43. Helen Arbib, thanks for your comments.

    The LOWI-report is also available, see https://www.knaw.nl/shared/resources/thematisch/bestanden/LOWIadvies2014nr5.pdf

    The view of LOWI in regard to the raw data which had been dumped by Jens Förster:

    “De eisen in de Nederlandse Gedragscode Wetenschapsbeoefening 2004/2012 en de te volgen c.q. toepasselijke Rules of the American Psychological Association (APA) dat onderzoeksgegevens ter controle moeten worden bewaard zijn duidelijk genoeg. Zij worden niet opzij gezet doordat collega’s eveneens hun ruwe data bij de verhuizing gedumpt hebben of dat niet alle onderzoekspsychologen de afgelopen tien jaar de regels van de APA strikt gevolgd hebben. Deze klacht is terecht opgeworpen en derhalve gegrond en zal in onderstaand oordeel van het LOWI worden meegewogen onder 5.”

    “5.1 De conclusie dat er gemanipuleerd moet zijn met onderzoeksgegevens acht het LOWI onontkoombaar. (…). Op grond hiervan én daarbij mede op grond van de inadequate verantwoording van de dataverzameling en van de oorspronkelijke data, is sprake van schending van wetenschappelijke integriteit;”

    —————————————————————————

    The LOWI-report also mentions the suggestion of one of the advisors of LOWI that the scientific community should get access to the report of the complainer so scientists could discuss with each other in public about the findings of this report. He also advised that this report should be published in a journal.

    (“Geen van deze QRP zou bovendien een verklaring kunnen bieden voor de hoogst onwaarschijnlijke lineariteit in
    de gevonden uitkomsten onder de conditie “local” en “global” en in de controlegroep. Gevraagd naar de kritiek op zijn QPR-verklaring door Klager adviseerde … in zijn gesprek met het LOWI de discussie over de mogelijke verklaringen voor de onwaarschijnlijke verbanden voor het wetenschappelijke forum naar voren te brengen, de analyse van Klager in het desbetreffende tijdschrift te publiceren en de wetenschappelijke gemeenschap in kennis te stellen van de gevonden merkwaardige verbanden.”)

  44. — Case closed —

    GJ has made a dynamite observation (search for “create perfect linearity” above). I am afraid I now have to “retract” the discussion with etb, Dave Langers, et al, about whether this escapade could have arisen from extreme cherrypicking, and whether that would be misconduct or only moderate naughtiness. It’s way beyond that now.

    This is not bad for social psychology, only for the public relations image of it. The actual science is much improved because we now have identified a body of work as unreliable and can cancel it from our thoughts. Retraction Watch is likely to be the most up to date forum because journals will take ages to retract the papers.

    That Prof Liberman has taught on the data, and re-read the papers and found nothing wrong, sadly doesn’t contribute significant defence. There are papers that have been read and cited by thousands of people, which make absolutely no sense, and nobody notices until it is pointed out (http://www.sciencedirect.com/science/article/pii/S0167527313008012). I use one as a test of observance in junior staff who want to do a PhD with me (http://tinyurl.com/DOH-test). Most people see <10% of the problems. I saw only 2% when I first looked, as is embarrassingly obvious from our skeptical letter to the journal.

    Dave Langers pointed out, "It is not our job to make up possible excuses, discuss which is least severe, for Förster to then pick that one in his defense." But with GJ's revelation, it's worse than that now: none of the excuses any of us scientists have proposed here are possible any more.

    Sadly, it is time to employ a person from another profession – the one which specialises in making up excuses…

  45. Some other remarkable items in he LOWI-report ( https://www.knaw.nl/shared/resources/thematisch/bestanden/LOWIadvies2014nr5.pdf ):

    “Beklaagde heeft onvoldoende duidelijk gemaakt – ook niet ten overstaan van het LOWI – op welke wijze zijn dataverzamelingen tot stand zijn gekomen, en waar en wanneer de in de … artikelen vermelde experimenten precies zijn verricht.”

    Translated and revised: “Jens Förster was unable to reveal to LOWI when and where quite a few of his experiments were conducted and how he had composed quite a few of the available data files (SPSS files).”

    “Veel experimenten waarvan verslag werd gedaan in de … artikelen zijn in … uitgevoerd en niet uitsluitend in …. Dat de co-auteur … zich een aantal van de uitgevoerde experimenten in … niet kan herinneren acht Beklaagde goed te verklaren uit zowel het tijdsverloop sinds deze experimenten werden uitgevoerd en de experimenten veelal op elkaar leken.”

    Translated and revised: Jens Förster states that quite a few of the experiments used for 2 or 3 of the papers were not only conducted at site Z (UvA?), but also at site Y (Bremen?). Markus Denzler was unable to recall some of the experiments conducted in (Amsterdam?). Jens Förster states that this is plausible, given the high similarity of several of the experiments and given the amount of time since these experiments had been conducted.

    “De CWI heeft echter het advies gevolgd van de deskundige …, dat niet kan worden uitgesloten dat sprake is van
    QRP, die in dit wetenschappelijke veld naar de mening van deze deskundige gebruikelijk, zo niet “prevalent”
    zijn. Er is niet voldaan aan de strenge eisen van goed wetenschappelijk onderzoek, dat afdoende verantwoording aflegt van onder andere de gevolgde en van te voren vastgelegde protocollen van onderzoek, van dataprocessing en statistical analysis (zie Report to the CWI, … …).”

    Translated and revised: professor Unknown (asked by the University of Amsterdam during the first procedure) did not want to exclude QRP as an explanation of the findings in the papers of Jens Föster. According to him, such kind of QRP were common / prevalent among scientists working in this field. He also reported that Jens Förster was not working according to the strict procedures in regard to datacollecting, dataprocessing and statistical analysis.

  46. A colleague of mine hinted at relabeling subjects as a way to produce the findings of the accusers report. This could be an explanation for the linear patterns and the way to small variances in the mean: Suppose you collect your data, you discard the group membership, you rank order the scores, then assign to the lowest score the Low group label, to the next score (i.e., the but one lowest score) the Mid group label, to the next score (i.e., the but two lowest score) the High group label, and then to the next score the Low label, the next the Mid label, the next the High label, etc… That will result in these linear patterns and also in way too small variances… Here’s a small R script that simulates the situation: https://dl.dropboxusercontent.com/u/609029/Forster/Simulation_Forster.R

    # Example relabeling with sorting subjects
    nstudies = 16
    n = 45
    k = 3
    sim = replicate(nstudies, {
    res = apply(matrix(sort(rnorm(k*n)),,k,byrow=T), 2, function(x) c(mean(x),var(x)));
    list(m=res[1,], v=mean(res[2,]))
    })
    layout(matrix(1:nstudies,nstudies^0.5))
    sapply(sim[1,], plot,type=’b’) # linear patterns

    Fvals = n*sapply(sim[1,],var) / unlist(sim[2,])
    1-pf(Fvals, k-1, n-k) # way too small p-values

    # Example relabeling *without* sorting subjects
    nstudies = 16
    n = 45
    k = 3
    sim = replicate(nstudies, {
    res = apply(matrix((rnorm(k*n)),,k,byrow=T), 2, function(x) c(mean(x),var(x)));
    list(m=res[1,], v=mean(res[2,]))
    })
    layout(matrix(1:nstudies,nstudies^0.5))
    sapply(sim[1,], plot,type=’b’)

    Fvals = n*sapply(sim[1,],var) / unlist(sim[2,])
    1-pf(Fvals, k-1, n-k) # normal p-values

    1. where it says “way too small p-values” it should say “way too large p-values”

        1. I think Raoul was withdrawing his 8:56 remark about p values.

          I, too, think the relabelling theory is good. In fact (sorry to be boring) my colleague and I wrote a paper about the 3 evil “R”s:
          _R_emove (delete subjects who don’t fit)
          _R_emeasure (throw the dice again until you find what you want)
          _R_eclassify (relabel)

          http://tinyurl.com/false-effects-from-true-data

          All 3 can generate false effects and could give group means that are less variable than if they arose from unmanipulated data

  47. I’m not a statistician, but reading the article carefully as a social psychologist, I’m surprised this article wasn’t flagged by peers earlier. It’s basically saying that priming people to think ‘globally’ or ‘locally’ by apparently amazingly working manipulations leads to increases in cognitive perfomance of around 1.5 standard deviations. As the report that led to the investigation says:

    “The cognitive test used in experiments 6-10b has only four items, yet the effect sizes are around d = 1.5, which represent very large effects given the expected low reliability of the scale.”

    Although this was not an official IQ-test, these kind of reasoning tasks tend to correlate quite strongly with each other and with IQ- scores. An effect of 1.5 standard deviations would translate into an increase of 1.5*15 = 22.5 IQ points seems absolutely ridiculous to me. And he did not find it once, but 7 times! If you look at the effects in the literature about of one of the most extensively investigated ‘experimental effects’ on cognitive tasks, stereotype threat, you’ll find that effect sizes lie around d = 0.3.

    If people’s cognitive performance could truly be improved by 1.5 standard deviations by some kind of experimental manipulation, it should have been big news. If people’s cognitive performance could truly be improved by 1.5 standard deviations just by ‘smelling locally’ rather than ‘smelling globally’, or by ‘tasting locally’ rather than ‘ tasting globally’ or even ‘touching locally’ rather than ‘touching globally’ etc., then these results should have been world news, and school systems all over the world should have been drastically reformed. So I wonder, why didn’t anyone take these results as serious as they are?

    Regarding the other dependent variable, creativity, I’m surprised about the effects too. Participants were shown some kind of simple drawing, and were asked to come up with a title for this drawing. Four ‘experts’ rated how ‘creative’ each title was. The most striking result there to me is that the interrater reliablities were extremely high in all studies (all Chronbach’s alpha’s > .85), which is uncommon in creativity tasks.

    So, just looking at the content of the paper makes me suspect that something odd is going on. This in itself me be no reason to draw conclusions about the veracity of the results, but the combination with all the vagaries about loss of data does make it highly suspicious to me. For example, I don’t see how downsizing iin office space forces you to remove raw data files from your computer. Forster says he had to chuck out the raw (paper?) questionnaires because he was moving to a smaller office, but if you read the paper, you’ll see that almost all of the data was collected by means of computers. It does not even mention any paper questionnaires.

    Even without statistical evidence that these results are unlikely, I’m convinced that the results, found in all 42 experiments, It’s too good to be true.

  48. Hi there, this is a chilly atmosphere in this room. I do not think that this is representative..
    Formerly, many replications were a good thing. Now, it seems to raise suspicion.
    But isn’t it amazing that the effects replicate? Does anybody try to understand why in this case linearity repeats? Maybe this is part of the phenomenon? What do we know now?
    The studies were presented at a meeting, I vaguely remember, it must have been 2003 or even earlier..
    First, of course we were very surprised about the huge effects. However, the effects even replicated in a controlled setting when we did 6 of the studies shortly later, Among them were the heute/tagesschau studies and the creativity studies. 5 replicated. 4 of them with straight lines like in the papers.One was not significant. I would call this a success.
    I do believe that there is something that we should understand about the effects. We should try to understand the psychology andf not only the statistics.

    1. The data are too linear to be real not by some abstract standard but given the level of variance within each group, in the published papers.

      If the psychological effects were very linear, the variances within each group would be small. Then the linearity of the means would be expected.

      But in these papers the variance within group was quite high, yet somehow, despite each datapoint being random over quite a wide range, the means always came out as linear.

      For that to be real it would not involve psychology but parapsychology, because it would imply that if one participant happens to be uncreative (say), the next participant would become more creative to compensate… despite them never having met…and perhaps it would also have to work backwards in time!

    2. > Formerly, many replications were a good thing. Now, it seems to raise suspicion.

      No. Nobody here has argued against the merits of independent replications or put forth the idea that effects that actually replicate should automatically raise suspicions.

      > But isn’t it amazing that the effects replicate?

      They do? Honest question: Care to cite

      > Does anybody try to understand why in this case linearity repeats?

      The point isn’t the linearity of the means in and off itself. Linearity might well be a property of the underlying phenomenon — if it exists — in the sense that it would not violate any natural laws. Is the linearity plausible to assume? No, and in fact Förster himself denies *ever even noticing* the linearity. And why would vastly different and sometimes very weak manipulations to induce local vs global processing modes all reveal essentially identical, equally strong, equally linear effects?

      None of that is the point, though. The point is that *even under the assumption of a strong linear relationship in the population*, Förster’s data are virtually impossible to have arisen from proper random sampling. They don’t vary as much as one must expect when drawing a (small) sample from a large population. Just look at the standard errors.

      > The studies were presented at a meeting, I vaguely remember, it must have been 2003 or even earlier..

      That’s interesting. Given that the main publication under consideration here dates from 2012, the others from 2009 and 2010, that is a baffling publication lag for such amazing results. Thoughts? Comments? Also: What meeting? Where?

      > First, of course we were very surprised about the huge effects. However, the effects even replicated in a controlled setting when we did 6 of the studies shortly later.

      Who is “we”? Also, where are the data for those six studies? Are they (still) available? Have articles involving those data been published? If not, are there manuscripts?

      > Among them were the heute/tagesschau studies and the creativity studies. 5 replicated. 4 of them with straight lines like in the papers.

      This is again in stark contrast to e.g. Förster’s claim that *no one ever noticed the linearity* — not him, not his colleagues, coauthors and collaborators, not any reviewers or research assistants. So, you *did* notice, right? Then

      > What do we know now? […] I do believe that there is something that we should understand about the effects. We should try to understand the psychology andf not only the statistics.

      You’ve had — by your own account — more than nine years to make sense of it. So what are your conclusions?

      Thanks!

  49. To contradict some statements that have been made here, the investigators did not have the data. Their report specifically mentions that they did obtain additional information from the authors for the 2012 paper (n and sd), everything else comes from the paper. If the data files were made available that would certainly help explain what was done, and probably why they aren’t being released. It would be extremely difficult to create data that matched the given results but did not show signs of manipulation. It was also mentioned somewhere that the original paper records were thrown out. So no consent forms, nothing. It does happen, but is rather convenient in this case, and poor research conduct. In this type of study all information from participants is often not transferred to computer files.

  50. Another interesting issue:

    In 2010 Forster published this theoretical article:
    http://www.tandfonline.com/doi/pdf/10.1080/1047840X.2010.487849

    Here (http://www.socolab.de/main.php?id=66) Forster writes about the experiments described in the article for which LOWI concluded that data manipulation must have occurred that:

    “The series of experiments were run 1999 – 2008 in Germany, most of them Bremen, at Jacobs University; the specific dates of single experiments I do not know anymore”.

    So, when the theoretical article was published in 2010, all those 24 experiments described in the 2011 and 2012 articles had already been conducted?

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.