Förster report cites “unavoidable” conclusion of data manipulation

Jens Förster

Last week we wrote about the 2012 complaint that triggered the investigation into Jens Förster, the social psychologist at the University of Amsterdam whose work has come under scrutiny for possible fraud.

Now we have the findings of the official investigation by Landelijk Orgaan Wetenschappelijke Integriteit (the Dutch National Board for Scientific Integrity, often referred to as LOWI) — which clearly indicates that the institution believes Förster made up results.

Here are some of the highlights from the document, which we’ve had translated by a Dutch speaker:

“According to the LOWI, the conclusion that manipulation of the research data has taken place is unavoidable […] intervention must have taken place in the presentation of the results of the … experiment”

“The analyses of the expert … did add to [the Complainant’s report] that these linear patterns were specific for the primary analyses in the … article and did not show in subsets of the data, such as gender groups. [The expert stated that] only goal-oriented intervention in the entire dataset could have led this result.”

“There is an “absence of any form of accountability of the raw data files, as these were collected with questionnaires, and [there is an] absence of a convincing account by the Accused of the way in which the data of the experiments in the previous period were collected and processed.”

“[T]he assistants were not aware of the goal and hypotheses of the experiments [and] the LOWI excludes the possibility that the many research assistants, who were involved in the data collection in these experiments, could have created such exceptional statistical relations.”

Read the whole report here in Dutch and English.

118 thoughts on “Förster report cites “unavoidable” conclusion of data manipulation”

    1. parallel with the technical issue’s (if there are any) there will develop a discussion about
      WHY the whole, yes really the WHOLE, german (science) journalism Mise-en-scène snapped
      into a state of a total media blackout during one full week ! :


      i will be trying to share the media coverage of the many thorny roads concerning this Förster case via:


      which is just a small partial study of the many
      issue’s involved as covered in:

      University Inc: http://pearltrees.com/p/2qZ1

      1. The German media was also rather mute on Smeesters and Stapel (and Sanna too for that matter). And constantly is pretty insensitive to all the other indications of fraud floating around on RW everyday (I mean outside of psychology).

        1. Yes, but Smeesters, Sanne … were not German and there were not prestigious German institutions involved. So this silence really is strange. *Today* Förster was to have received a 5 M Euro prestigious research grant and prestigious professorship…

          1. As a German scientist, I am very sensitive to the topic. But I must say that I am not surprised by the media. To the media, he might appear primarily as a Dutch scientist who has a paper under fire (occurs daily, as documented on RW). Also, the Humboldt grant is already postponed (and perhaps will be cancelled). Again, keep in mind that Stapel resulted in only very few articles in the German media. And compared to Stapel, the current state of the Förster affair may appear to be just not enough of a scandal.

          2. Perhaps, because this happens all over, not just in (social) psychology….

          3. Original statement by the Stiftung ( don’t forget that the foundation was set-up by the government of the Federal Republic of Germany and is funded by the German Foreign Office, the Ministry of Education and Research and the Ministry of Economic Cooperation and Development ) somehow in the mean-time, quite mysteriously, lost it’s translation:


            but i preserved the original

            (mirror) http://web.archive.org/web/20140502112116/http://www.humboldt-professur.de/de/nachrichten/stellungnahme-foerster

            there are more telling abberations from the original.

          4. Interesting. I hope he won’t get away with it. They explicitly refer to a matter of disagreement between the university’s first decision and the one by the LOWI, which might work in his favor. I really don’t understand why his university doesn’t want to open a wider investigation. And what will happen to the other two “too good to be true” papers? I guess that simulations will also show that the results in these papers are unlikely to occur.

          5. It seems that the following text has been deleted from the final paragraph of the German:

            “In diesem Zusammenhang kritisiert die Humboldt-Stiftung nachdrücklich den Bruch der in diesem Verfahren vereinbarten und zu wahrenden Vertraulichkeit. Das am 29. April in anonymisierter Form auf der Website der Universität Amsterdam veröffentlichte Urteil war am selben Tag in den niederländischen Medien veröffentlicht und Herrn Professor Dr. Förster zugeordnet worden.”


            “In this connection, the Humboldt Foundation strongly criticises the breach of confidentiality that had been agreed and was to be observed during the proceedings. The decision that was released in an anonymised form on the website of the University of Amsterdam on 29 April was published in the Dutch media and linked to Professor Dr Förster on the same day.”

            Also gone is the bolded part of the sentence saying that “In light of the circumstances in question and the lack of agreement on their assessment – which apparently also exists in the Netherlands – the Foundation calls upon the media and the parties involved”

            What did that mean in the first place? Why is it gone?

          6. Excuse me very much, but there is no lack of agreement between UvA and LOWI. No way. See http://www.vsnu.nl/files/documenten/Wetenschapp.integriteit/2014%20UvA%20manipulatie%20onderzoeksgegevens.pdf

            UvA states in its final decision (22 April 2014): “The findings and recommendations of the LOWI, in so far they are represented below, wil be adopted; “The conclusion that research data must have been manipulated is considered unavoidable; the diversity found in the scores of the control group is so improbably small that this cannot be explained by sloppy science or QRP (questionable research practices); intervention must have taken place in the presentation of the results of the 19 experiments described in the 2012 article. Based on this and based also on the inadequate explanation regarding the data set and the original data, a violation of academic integrity can be said to have taken place.”.

            Maybe LOWI has decided to concentrate on one paper (Förster & Denzler 2012), because supplementary data were only available for this paper?

            “We acquired SDs and cell frequencies in all twelve studies from Dr. Förster via email. We also requested raw data but this was to no avail.” (report of the complainant about the Förster & Denzler 2012 paper, no details on supplementary information and/or requests without results for both other papers).

            “Op … 2013 heeft beklaagde aangegeven nog steeds ziek te zijn maar heeft de commissie alvast een ‘preliminary
            response’en data toegestuurd.” Quote in the UvA report (preliminary findings) in which is stated that Jens Förster has send ‘data’ to the Committee of UvA.

            “Na zijn aanvaarding heeft het LOWI hem in december2013 het dossier en de USB-stick met de onderzoeksgegevens van het artikel uit … van … en … …, gepubliceerd in … toegezonden.” Quote in LOWI-report in which it is stated that LOWI has send supplementary information (data) of the Förster & Denzler (2012) paper to an expert.

            That’s all what I can find about ‘supplementary data’ in the official documents which are public available.

            I fail to understand why NRC (a major Dutch newpaper) would not be allowed to publish the name of Jens Förster at the moment when the final decision of the UvA investigation was published in an anonymized version on the website of VSNU.

            I also fail to understand why it seems that the Alexander von Humboldt-Stiftung does not promote a post-publication peer-review of anything what has been published by Jens Förster.

          7. I suspect that they are hinting that at first UvA declined to go so far as to conclude deliberate manipulation. LOWI later concluded differently. So that’s a disagreement, of a sort.

  1. The University of Amsterdam should take the initiative to investigate the papers of Forster, at least those papers he wrote while working there.

    We, scientists, should know which results to believe… And such an investigation would also be beneficial for Forster, if he is completely innocent (although this may be hard to believe given the evidence) or even if he committed fraud in only 1 or a few papers.

    1. According to an article from the UU, the UvA has stated they are not planning on doing any follow-up investigation: http://www.dub.uu.nl/artikel/nieuws/duitsland-slikt-fraudebeschuldiging-uva-hoogleraar-niet-voor-zoete-koek.html

      “De Universiteit van Amsterdam beschouwt de zaak echter als gesloten. Een woordvoerder: “Als we nieuwe klachten krijgen zullen we die volgens procedure opnieuw beoordelen en onderzoeken. Maar vooralsnog zien we geen reden meer artikelen onder de loep te nemen.”

      translation: The University of Amsterdam considers the case closed. A spokesman: If we get new complaints we will go through the procedure of investigating those complaints but we currently do not see any reason to investigate any other articles.

      1. This certainly does not speak well of UvA! Whether the outcome would be positive or negative, doing so regarding his entire work in UvA would only be good for its reputation.

        The decision not to so is not.

  2. I have a stupid question. I simply don’t understand the experiment. It seemed to be aimed at showing that there exists some intermediate level of sensory-analytic processing — apparently independent of the type of sensory input. The outputs of this cognitive module depend on whether the individual is looking at a gestalt or at details, and also correlates with creative vs analytical thinking. Is that correct? If so, what was the experimental manipulation used to get the subjects to focus on gestalt vs. details, or was this a test of some unforced innate tendency? Sorry to be dense.

    1. I confess I have not read this particular paper. However, the abstract seems to refer to some of the classic global/local attention manipulation researchers use in visual perception experiments (starting from the seminal work by Navon). These are well documented modes of operation of attention. I think he is claiming that these two modes affect the social domain as well. I’ve been suspicious of many social psych findings for a long time. Note, experimental psychology has a ton of really solid research, I’m referring to some of the fluffier social psychology findings that end up in Psych Science, for example. Many claims that perceptual phenomena extend into the social domain and affect general personality variables are bogus, in my opinion. At minimum, the effects are MUCH weaker than they are said to be in the published papers (generalized QRP?). So weak, indeed, that they do not replicate…

    2. I think the experiment looked at interactions, but this didn’t matter for the investigation of manipulation. They looked at one aspect of the data and clearly it had a lot less variability than would be expected by chance, and this was repeated for all sub-experiments. A reasonable assumption is that they have taken one group, duplicated and offset it to create the others and then added and subtracted almost identical interaction terms.

      As the investigators still have the data files it shouldn’t be too hard to work out exactly what was done, but they are unlikely to hand them over.

  3. I quite like the idea of subset analysis. If the trend is legitimately, biologically (or psychologically) linear when looking at the whole group, then it should also be linear or near linear –at least most of the time–when analyzing sub groups like gender or age or date of completion of the test. Good work.

    1. On this, I’m not so certain. Gender effects have been documented in multiple domains.. If, eg, women show a linear effect and men don’t, then the men’s data will reduce the effect when averaged, but the linear trend might remain. Clearly this is not what happened here, but I’m not so certain of the general validity of the general gender-subset approach.

    2. The problem with the data is not so much that the relationship that decribes the three groups is so linear (although that admittedly could also be argued to be unlikely). The problem is that the level of linearity far exceeds that what you would expect given the amount of noise that is present in the data. In any dataset with noise, there should be some random deviations from linearity. Those were not present in this (whole) dataset. If the subgroup data are linear too (in the sense that they do not reject the hypothesis that the effects are linear), then this is still no explanation for the *unexpectedly high degree of linearity* in the total group.
      Apart from that, I agree with Hamash that subgroups need not all be linear for the total group to behave linearly, in principle.

  4. Compare these two press releases by the Humboldt Foundation:

    Alexander von Humboldt Professorship – The Award Winners 2014
    [10 awards]

    30 April 2014, No. 11/2014
    Invitation to the award ceremony for the Alexander von Humboldt Professorships
    [6 presentations – you can guess which 6 of the 10, from the named six sponsoring universities]

    That’s all Humboldt is saying.

    1. I find Humboldt’s recent press announcement on Förster quite adequate for the time being. No doubt they have not yet had a possibility to hear Förster and reach final decision.

  5. The investigation methods are of more value than the Förster paper to be retracted.
    Non-linearity in sub-sets of data cannot just neatly cancel themselves out giving – fortuitously for the hypothesis – a linearity in the complete data set.

  6. I’m not a statistician, but reading the article carefully as a social psychologist, I’m surprised this article wasn’t flagged by peers earlier. It’s basically saying that priming people to think ‘globally’ or ‘locally’ by apparently amazingly working manipulations leads to increases in cognitive perfomance of around 1.5 standard deviations. As the report that led to the investigation says:

    “The cognitive test used in experiments 6-10b has only four items, yet the effect sizes are around d = 1.5, which represent very large effects given the expected low reliability of the scale.”

    Although this was not an official IQ-test, these kind of reasoning tasks tend to correlate quite strongly with each other and with IQ- scores. An effect of 1.5 standard deviations would translate into an increase of 1.5*15 = 22.5 IQ points seems absolutely ridiculous to me. And he did not find it once, but 7 times! If you look at the effects in the literature about of one of the most extensively investigated ‘experimental effects’ on cognitive tasks, stereotype threat, you’ll find that effect sizes lie around d = 0.3.

    If people’s cognitive performance could truly be improved by 1.5 standard deviations by some kind of experimental manipulation, it should have been big news. If people’s cognitive performance could truly be improved by 1.5 standard deviations just by ‘smelling locally’ rather than ‘smelling globally’, or by ‘tasting locally’ rather than ‘ tasting globally’ or even ‘touching locally’ rather than ‘touching globally’ etc., then these results should have been world news, and school systems all over the world should have been drastically reformed. So I wonder, why didn’t anyone take these results as serious as they are?

    Regarding the other dependent variable, creativity, I’m surprised about the effects too. Participants were shown some kind of simple drawing, and were asked to come up with a title for this drawing. Four ‘experts’ rated how ‘creative’ each title was. The most striking result there to me is that the interrater reliablities were extremely high in all studies (all Chronbach’s alpha’s > .85), which is uncommon in creativity tasks.

    So, just looking at the content of the paper makes me suspect that something odd is going on. This in itself me be no reason to draw conclusions about the veracity of the results, but the combination with all the vagaries about loss of data does make it highly suspicious to me. For example, I don’t see how downsizing iin office space forces you to remove raw data files from your computer. Forster says he had to chuck out the raw (paper?) questionnaires because he was moving to a smaller office, but if you read the paper, you’ll see that almost all of the data was collected by means of computers. It does not even mention any paper questionnaires.

    Even without statistical evidence that these results are unlikely, I’m convinced that the results, found in all 42 experiments, It’s too good to be true.

    1. Indeed.

      Also odd: despite being well written and empirically very impressive, with a huge number of well-designed experiments and results that told a beautifully consistent “story”, ‘Sense Creative’ was published in a journal with an impact factor of just 1.26.

    2. 1.5 standard deviations difference in IQ in a sample from the whole population of say Dutch adults between 20 and 70 years old is a huge difference, but 1.5 standard deviations difference in IQ in a sample of Dutch psychology students is not a big difference. The population is very homogenous (I suppose) so “1 standard deviation” is not a great deal.

  7. The LOWI-report reveals that three different experts (two by UvA and one by LOWI) have investigated the report of the complainant. I am unable to find indications of such a report made by an expert asked by Jens Förster and aimed to rebut the report of the complainant. I was wondering why Jens Förster was unable to provide such a expert report.

    I tend to think that the job of Jens Förster at UvA will be finished on 1 June 2014. This is the first date when Jens would start with his new job in Bochum. The scenario ‘case closed’ seems not a bad option for UvA (= just wait 3 weeks and Jens Förster is not anymore employed by UvA). Bochum has not yet decided that Jens Förster gets his new job.

    Jens Förster wrote: “For me, the case is not closed. I will write a letter to UvA and LOWI (see article 13 of the regulations), and will remind the commissions of the information that they overlooked or that did not get their full attention.”

    Article 13 at https://www.knaw.nl/shared/resources/thematisch/bestanden/reglementlowi.pdf states: “Indien Klager en/of Beklaagde van oordeel is dat de behandeling van de klacht door bet LOW! niet naar behoren is geschied, kan Klager en/of Beklaagde zich op grond van hoofdstuk 9 van de Algemene Wet Bestuursrecht, richten tot het LOWI.”

    So Bochum will wait what LOWI will decide. They will take their time. Seems as well not a very bad option.

    Jens Förster wrote also: “Moreover, now is the time to test the statistical analyses by the complainant – note that he stated that those would not be publishable because they would be “too provocative”.”

    I am sure that Jens Förster is very happy that RW has decided to publish the report of the complainant. Quite a few people over here, including various statisticians, have scrutized the statistical analyses of the complainant.

    I am hereby inviting Jens Förster to join the ongoing debate on RW about the validity of the statistical analyses by the complainant.

    1. The data sets which were analysed by the various experts should also be made freely available. I wonder who is the official “owner” of them? Who would have to give permission? UvA? Förster? It seems to me that there are absolutely no confidentiality issues here.

      1. Dear Richard,

        1. Statement of UvA: “case closed” ( http://www.dub.uu.nl/artikel/nieuws/duitsland-slikt-fraudebeschuldiging-uva-hoogleraar-niet-voor-zoete-koek.html ) so no confidentiality issues.

        2. Förster & Dentzer (2012) has only one affiliation (= UvA) and Markus Dentzer has declared that he was not involved in collecting the data of this paper. Jens Förster will leave UvA at 1 June 2014 and he has published the paper. The research was totally funded by UvA (= public money). The UvA affiliation means that UvA is responsible for the contents and that you can ask UvA for the raw data when the author does not respond (or when he is sick / gone / passed away, etc.). UvA was also responsible for all Stapel papers published with an UvA affiliation (as long as they are not retracted).

        3. Code of Conduct for Scientific Practice: “III.3 Raw research data are stored for at least five years. These data are made available to other scientific practitioners at request.”

        4. So send an e-mail to as well Jens Förster and UvA with a request for all the raw data and refer to this rule in the Code. File a complaint to UvA when UvA refused to send you these data (wait until 1 June 2014 before you file such a complaint).

        5 . I don’t have access to both other papers, but both have only one author (=Jens Förster). http://www.ncbi.nlm.nih.gov/pubmed/21480742 (only UvA as affiliation)
        http://www.ncbi.nlm.nih.gov/pubmed/19203171 (only UvA as affiliation).

        6. Seems that you can just send 1 e-mail to both Jens Förster and UvA with your request (permission is not needed, just refer to the Code).

        Good luck.

        How about proposing Jens Förster that both of you will soon (= within 3 weeks) have a public debate with each other, one at Leiden University and one at UvA? Free access for all students, psychology students get study credits when attending this debate / college.

          1. Hi Richard,

            http://www.uu.nl/SiteCollectionDocuments/The%20Netherlands%20Code%20of%20Conduct%20for%20Scientific%20Practice%202012.pdf for the Code of Conduct. This Code of Conduct is identical for all Dutch universities (so also for Leiden, see http://organisatie.leidenuniv.nl/klachtenloket-medewerkers/commissie-wetenschappelijke-integriteit.html ). Maybe there is also an English version on the website of Leiden University.

            I should contact the UvA counsil for scientific integrity (Hanneke de Haes, http://www.uva.nl/contact/medewerkers/item/j.c.j.m.de-haes.html?f=de+haes
            and/or the secretary of CWI (Miek Krol, http://www.uva.nl/contact/medewerkers/item/j.m.c.krol.html?f=Krol
            and/or Jacqueline Groot Antink ( http://www.uva.nl/contact/medewerkers/item/j.b.groot-antink.html?f=groot+antink ).

            See also http://www.uva.nl/onderzoek/onderzoek-aan-de-uva/wetenschappelijke-integriteit/wetenschappelijke-integriteit-uva.html (only in Dutch).

  8. I just realise that the fact that the linearity completely disappears when we split the data into males and females isn’t interesting at all, given what we already know.

    If the combined sexes linearity is just an extraordinary chance outcome because the experiment was “done by the book”, then the sexes-separately linearity would be an even more extraordinary chance outcome. Its disappearance would only to be expected.

    If the data has been manipulated to get good p-values for the test of the research hypothesis, and the overall linearity is just an artefact of fairly simple data manipulation procedures (such as a not too statistically sophisticated person could fairly easily do), then linearity of both groups separately would be totally unexpected. Its disappearance would only be to be expected.

    So both ways, we got to see exactly what we would expect on separating the sexes! Since we saw exactly what we would have expected to see under either “hypothesis”, what we saw does not discriminate between them.

    1. That’s right, but the nonlinearity of the subgroups does rule out the hypothesis that the linearity is some kind of artifact of the analysis method, driven e.g. by the fact that the data is categorical.

      Initially, I thought that that was just about possible, and was Forster’s best hope of challenging the findings.

      Now, it is clearly not the case, because the male and female data showed normal Delta-Fs.

      1. Dear Neuroskeptic,

        I have checked parts of the section methods in various psychology papers in various journals in which the results of experiments are presented. Some of these papers don’t have a section acknowlegdments and don’t provide details / names of other people involved in collecting data / helping with statistics / commenting on drafts and/or details on funding. Invariably, sample size (broken down by sexe), some background information of the participants and detailed information on compensation is listed in all these papers for all separate experiments presented in these papers.

        Förster & Dentzer (2012) provide precise information on sample size (number of participants, broken down by sexe) for each of the 12 experiments and they state “undergraduate students” and “participants were paid 7 Euros or received course credit”. That’s all what I can find in this paper and it seems to be the bare minimum what psychologists need to disclose (please correct me when I am wrong).

        “How goal-fulfillment decreases aggression” (Markus Denzler, Jens Förster & Nira Liberman 2009, Journal of Experimental Social Psychology 45: 90-100) presents the results of 3 experiments. This paper provides much more details:

        Exp. 1. “Ninty-one participants (51 women, 40 men) from University of Würzburg participated in a series of studies and received €12 (at the time approximately US$14) as compensation. (..). We examined speed of lexical decision after excluding incorrect responses (1.2% of the responses).”

        Exp. 2. “Fifty-two participants (25 women, 27 men) from Bremen University participated in a battery study and received €12 (at the time approximately US$14) as compensation. One participant had to be excluded because he was not a native German speaker. (..). We excluded incorrect responses (2.3% of the responses).”

        Exp. 3. “Eighty-five (44 women, 41 men) participants from Bremen University were recruited for a battery study and received €12 (at the time approximately US$14) as compensation. Because not all reaction times were recorded for two participants due to computer problems, we excluded them from the analyses. (..). We excluded incorrect responses (2.8% of the responses).”

        “Acknowledgments. This research is part of the first author’s doctoral dissertation and was supported by a grant of the German Science Foundation (DFG) awarded to Jens Förster (FO 392/8-2). We thank Fritz Strack from the University of Würzburg, Germany, for providing us with the research facilities for the first study. For collecting and coding the data we thank Florian Albert, Anna Berencsy, Regina Bode, Aga Bojarska, Maren Breuer, Nina Burger, Laura Dannenberg, Maria Earle, Karla Fettich, Marcela Fialova, Hana Fleissigova, Rebecca Hitchcock, Kirils Jegorovs, Dora Jelen, Sebastian Karban, Kaska Kubacka, Alan Langus, Janina Marguc, Petra Markel, Mayuri Nigam, Basia Pietrawska, Sonja Ranzinger, Dinah Rohling, Gosia Skorek, Anna Steidle, Thomas Stemmler, Aska Styczynska, Karol Tyszka, Rytis Vitkauskas, Alexandra Vulpe, Julian Wucherpfennig, and Nika Yugay. Sarah Horn is thanked for proof-reading the manuscript. We would also like to thank Amina Özelsel, Stefanie Kuschel and Katrin Schimmel for invaluable discussions, and three anonymous reviewers for their constructive comments.”

        So alot of people were involved in collecting and in coding the raw data for this paper (in total 3 different experiments with in total 228 participants) and quite a few others are listed for help as well. All of them have a name and all of them can be approached in case of uncertainties. The last author, Nira Liberman, and one of the people listed in the acknowlegdements, Regina Bode, were brave enough to join the debate on RW about the more recent activities of Jens Förster. Once again, thanks alot for your comments and your thoughts and for participating this debate with your own name.

        I am not a statistician and I am also not a psychologist, but I do have the opinion that not much is wrong with the contents of the paper “How goal-fulfillment decreases aggression”. Please correct me when I am wrong.

      2. True, the artefact hypothesis was one thing bothering me from time to time – the very small number of levels of these variables meaning less room to maneuvre. It would be good to do some simulations to see if this feature would affect the F test for linearity. The literature on robustness of the F test to departures from normality focusses on what goes on in the right tail. People were never much interested in the left tail before.

    2. Dear Richard, I agree, “given what we already know,” but I am very much surprised by your “isn’t interesting at all” when you mention the subset analysis.

      The subset analysis is what made the difference between the CWI and LOWI reports, and between the first and second UvA board decision.

      The experts who were consulted by the CWI did not conduct a subset analysis and did not want to exclude QRP’s as a possible explanation for the linearity in Forster’s results. Therefore, The UvA board had to play safe (in view of judicial consequences), and only asked for an “expression of concern.”

      The new expert consulted by the LOWI did conduct subset analyses and decided that Forster’s linearity must have been the result of deliberate manipulation. The LOWI report therefore excluded “sloppy science” and QRP’s as possible explanations, and the UvA board asked for retraction of the Forster & Dentzler paper after all.

      The CWI-experts may have been overly cautious. I don’t blame them, but because of their timidity the subset analyses have played a crucial role in ultimately establishing “deliberate manipulation”.

      In earlier comments, you yourself did not want to exclude QRP’s, so to you the subset analyses must have been highly interesting as well.

      1. Well, first the subset analysis came across to me as brilliant but on further consideration I come to the conclusion it adds nothing. And I explained why: Whether we assume the hypothesis “fraud” or the hypothesis “chance”, given that we have too-good-to-be-true linearity overall, we would expect too-good-to-be-true linearity to vanish when we go to subsets. So the fact that it vanished when we went to subsets does not give us any further information to distinguish between the two hypotheses. The Bayes factor is 1. The likelihood ratio hasn’t changed.

        I’m afraid that this “disappearance” of the too-good-linearity merely has a psychological impact on non-statistically thinking persons. Psychologists who still want to believe linearity as some kind of escape clause for Förster and don’t understand statistical variation.

        I suspect that the CWI itself was overly cautious because honest and cautious experts would have to say that what they had seen so far did not “prove” data-fabrication. They were no doubt asked to evaluate the work of the complainant, not to mount a massive new investigation themselves. They no doubt know less than the complainant, because the complainant is apparently a psychometrician (a methodologist) so is at home both in the worlds of psychology research and the world of applied statistics, and completely familiar with life in a social psychology research lab, whereas if the external expert is a mathematical statistician, then what do they know about psychology research? They can pronounce on the correctness and appropriateness of the methodologist’s methodology, not much more than that. Then, very important: CWI is thinking of the legal meaning of their ruling, not of the scientific ruling. The LOWI on the other hand is thinking more of the scientific meaning. Remember, if the UvA would want to fire Förster for fraud, the committee’s work will be contested by lawyers, not by scientists.

        Secondly, the LOWI has more information, since more facts had been uncovered in the meantime, and more experts, so more variety of expert’s opinion, so also experts who were totally convinced of fraud, as well as experts who maybe still weren’t quite certain of fraud. Finally, LOWI themselves were convinced, and after that they can cite the statements of experts which fit to their conviction.

        Sorry I have a lot of experience with judge’s written verdicts and the Dutch legal system.

        1. “I’m afraid that this “disappearance” of the too-good-linearity merely has a psychological impact on non-statistically thinking persons.” (e.g. myself in day to day thinking mode)

        2. Richard, according to you the subset analysis and the LOWI report did not add any new information to what we already knew. So you still contend that “innocent QRP” is a more likely explanation than “deliberate fraud”?

          1. GJ, I *never* contended that innocent QRP is a *more* likely explanation that “deliberate fraud”.

            And I do not contend that the LOWI report adds no new information. It adds lots of new information.

            I do believe that the subset analysis is a red herring!

            I believe that the papers concerned exhibit such total lack of scientific integrity that they should be withdrawn. I believe that either Förster is grossly incompetent or that he has faked data big-time.

            Moreover these have been my opinions ever since I got to know about the case.

            As time has gone by and more information has become available I am tending more and more to the “fraud” variant.

            I used to think it unlikely that someone who fakes their data would fake them in such a naive way as to create this linearity, but now I am no longer so sure. Since the scores are whole numbers it is rather easy to create a nice control group with very low standard deviation, then substract “1” off all the scores to get a nice low group, and add “1” to all the scores to get the high group. Just occasionally make it 0 or 2 that you add or substract. But nowadays I find super dumb fraud not so difficult to believe after all (especially now we know how little Förster understands statistics and his total lack of any credible defence whatsoever).

            On the other hand, I *can* think of a QRP mechanism (though it is one which is close to fraud) whereby selection for significance of the research hypothesis could tend to create too-good-linearity, though more research needs to be done to check.

            Please remember: when judging whether a paper should be withdrawn or not, we don’t have “innocent till proven guilty”. Actually, we never “prove” innocence. We just keep “innocence” as a kind of working approximation but always stay looking for proof of guilt. No scientific theory is ever definitive.

            But judging a person is different from judging a paper. We should start with a presumption of innocence.

            Moreover, judging that someone has committed scientific fraud is much more serious than judging that someone is incompetent at scientific research. So the burden of proof should be much higher in the former case than the latter case. Incompetent scientists can keep their jobs till they get their pension: they just won’t get many more publications, research grants, students … But fraudulous scientists get fired.

          2. As an aside, you now say: “I *never* contended that innocent QRP is a *more* likely explanation that “deliberate fraud”.”

            But on May 3, in the “Anatomy”-thread you said: “I am still not putting a lot of money on “deliberate fraud”. I would still tend to suspect massive “innocent” QRP.”

            I do not hold it against you, in an open discussion we are allowed to make mistakes and to change positions (I often do), but your opinion is important, because as long as there are statistical experts saying that QRP’s cannot be excluded, the UvA board, the Humboldt committee, the Bochum university cannot not take appropriate action.

          3. Thank you GJ! That is a nice example of how, when your mind changes, your memory changes too. I stand corrected.

            Now you are getting the different responsibilities mixed up. UvA, Bochum, Humboldt all have to judge. They have to take decisions. I don’t have to take a decision. They have to take a decision and it has to be their decision. Their decision is not dictated by some crazy scientist.

            I can only state my scientific opinion, try to explain it, and try to make sure that people understand what I mean. I am not doing very well, I agree. Sorry.

            I think that massive yet “innocent” QRP means such a level of incompetence that the appropriate action of the UvA board, the Humboldt committee, the Bochum university would be the same as if fraud had been “proven”. However, I don’t tell them what to do, and they can’t hide from their responsibilities because not everyone is agreed on everything.

            LOWI essentially says the same: they believe “fraud”, but they don’t say that Förster did it, but they say he is responsible and should have known it happened.

          4. Actually I have told Humboldt what to do: give Förster his grant and force him to painstakingly (and under close supervision) repeat his experiment; and publish the results. That will take at least three years (200 psychology students!). And I told Förster that he can tell his lawyers that LOWI is hiding behind numbers in the trillions, which are totally unreliable. I mean: could easily be many orders of magnitude wrong. And certain to be misinterpreted by the press and the public.

          5. Orders of magnitudes wrong: because we are multiplying lots of small numbers together. If those small numbers have a bias to be too small, then on a logarithmic scale we have an accumulating negative bias. That means: orders of magnitude.

            And why a bias to be too small, to begin with? Because we are using a parametric instead of a non-parametric analysis, or an asymptotic approximation (bootstrap) instead of an exact non-parametric approach. I have observed in similar cases that this leads to p values which are too small. Eg in the Geraerts case the nominal p-value is 0.001 or something like that but the actual p-value is 0.05 or something like that.

            In the Geraerts case there were groups which were too equal. This means that we can do exact permutation tests. Also the same subjects occurred in different tests, and one can take account of the resulting dependence by computing the permutation test p-value (which is definitely valid) after combining nominal p-values (eg resulting from nominally assuming independence using the Fisher method after doing nominal F test assuming normality).

            It made a rather big difference in the number of zero’s between “0.” and the first non-zero digit.

            Does it matter? Probably not.

          6. In addition, your Bayes factor is not 1.

            First we considered two options (besides fraud): The linear relationships between the independent variable and the overall means of the various outcome variables are the result of (1) perfect underlying linearity or (2) pure coincidence (without underlying linearity).

            The subset analyses have shown that #1 cannot be true, yielding Forster’s results even more unlikely than they already were.

          7. Well OK there are really three options: coincidence, QRP’s, fraud

            If the overall super-linearity were coincidence, we wouldn’t expect it to persist in the subgroups
            If the overall super-linearity were a QRP, we wouldn’t expect it to persist in the subgroups
            If the overall super-linearity were fraud, we wouldn’t expect it to persist in the subgroups

            We didn’t see it persist, so we are not any more the wiser from seeing it not persist.

            But the totality of all we know (before looking at the subgroups) fits perfectly to the hypothesis of fraud, fits badly to the hypothesis of QRP, and incredibly badly to the hypothesis of coincidence.

            That is my personal opinion at the moment. I am not a lawyer or a judge (and I’m glad of that).

            People who *judge* (eg LOWI) have to take account not only of the likelihoods but also the costs and benefits of each possible decision under each possible scenario, and prior probabilities of the three scenarios. So if there are 3 possible “states of nature”, and if the LOWI’s decision consists of choosing one of three states of nature as the true one, with enormously different consequences for the “suspect” under each possible decision, then there are 9 combinations of “actual truth, LOWI determined truth”. 9 different “costs” (costs to society, costs to the “suspect”). All this has to be wisely weighed together. That is what we have judges for.

          8. If you phrase it like above, then there are many more options, among which the option of “perfect underlying linearity” that you forgot to mention, but that we would expect to persist in the subgroups.

            The subset analyses showed that it did not persist, so we *are* wiser, because we can now rule out “perfect underlying linearity”, further diminishing the likelihood of the observed linearity of the overall means.

          9. GJ: perfect *underlying* linearity is not an interesting hypothesis, anyway. I am only talking about statistically *impossible* sample linearity. It’s statistically close to impossible even if the underlying truth happened to be linear; there is no reason to suppose it should be; every reason to suppose therefore that the sample linearity is even more statistically impossible. (I’m talking about the “too good to be true” observed linearity. It was too good to be true even if the underlying truth was exactly linear, and even more too good to be true if it weren’t)

          10. The phenomenon that we are trying to explain are the observed linear relationships between the categories of the independent variable and the category means of the various outcome variables.

            Everyone (except for a few die-hard Forster supporters) agrees that the observed linear relationships are “too good to be true,” that is not the point.

            The point is that you are mixing explanations with qualifications.

            “QRP” and “fraud” are qualifications (and not very well defined).

            “Coincidence” and “underlying linearity” are explanations, and both are VERY unlikely. “Deliberate manipulation” is also an explanation and is very likely.

            “Deliberate manipulation” can be qualified as “fraud.”

            But as long as experts (like you) do not want to exclude the possibility of explanations that can be qualified as “QRP”, it remains interesting to conduct additional analyses to rule out other explanations.

            The subset analyses and a recent additional simulation study (http://datacolada.org/2014/05/08/21-fake-data-colada/) help to rule out “underlying linearity”.

            You say this is not interesting because we all already knew that “underlying linearity” is impossible. But you still list “coincidence” in your list of options, and that seems even more impossible.

            Moreover, I do not agree with your “If the overall super-linearity were fraud, we wouldn’t expect it to persist in the subgroups.” I do not agree because it is of course possible (and easy) to generate linear data.

            If Forster would have generated his data, linearly, with more credible effects, and with random error, then he would not have been caught.

            However, I do not think that Forster generated data. I still think that he really conducted these experiments, and analyzed true data, but added constants to individual scores, in order to introduce the effects that he wanted to report.

          11. GJ, we are arguing about the route from a priori possibilities (coincidence? QRPs? fraud) to a final conclusion (which we both agree, and the LOWI does too, is fraud). We are using the same words in different ways and there are a lot of technical subtleties here.

          12. Circumstantial evidence which casts doubt on data collection seems to be that apparently 1) none of the over 2000 subjects who alledgely participated in the experiments, who supposedly studied psychology and who supposedly were Dutch and as such exposed to press coverage of the investigation have not found their way to this or other discussion forums, 2) Förster evidently has not summoned any of them to provide evidence in his defence to UvA or LOWI, and, in fact, these bodies even need to presume in which centre the experiments were carried out because of lack of such or any other concrete evidence of data collection provided by Förster during the investigation.

          13. In my view the subgroup analysis decisivly rules out any explanation in terms of the nature or the acquisition of the data. In those cases the linearity would be pervasive. Instead it only appears in the overall means across the three groups. Thus it occured at a late stage. It occurred after all of the data was collated in one place.

          14. nskeptic, there are reasons to suppose observed super-linearity *can* be explained through how the data is collected. Certain QRP’s (QRP’s akin to fraude) would tend to generate super-linearity in the over-all means, but not in the subgroups, because the QRP selection of data is targeting the over-all means, namely targeting a significant F test of the experimental hypothesis concerning the over-all means.

            That might well lead to a significantly insignificant (if you see what I mean) F test of certain non-targeted hypotheses.

            Sorry it is all rather subtle and I am probably not explaining it well and I might even be wrong but I haven’t heard a sound argument yet *against* my argument.

          15. In all honesty, the argument in favour is also qualitative at best. It would be interesting to see the suggested effect emerge in an actual simulation, for instance.
            Testing for significance of the presence of a linear effect is not the same as testing for superlinearity of that effect. Significant F-values don’t require superb linearity. They just require that a (close-to-)linear term is present that is strong enough to dominate the noise. A decent degree of “curvilinearity” would not much affest that, I believe. So although QRP may bias towards an unexpectedly linear effect in principle, my bet is that that would be a very small effect.
            Intuitively, one is testing a different tail of a different distribution: QRP would bias towards a p too close to 0 for the null-hypothesis, whereas the reported anomaly is more like a p too close to 1 for the linearity-hypothesis.
            This is gut feeling, I admit, but no guts no glory… 😉

          16. The QRP argument is moot because Forster has denied using any QRPs.

            OK, he has also denied fraud, and we suspect him of fraud nonetheless. But it would be absurd to imagine that he is in fact guilty of QRPs, but not fraud. Were that the case he would surely confess to the QRPs, thus avoiding getting convicted of fraud, and he would get away with a lighter punishment. Perhaps he could even become a martyr of the “Stapel backlash”. Why would he deny QRPs when they are his main alibi for the much more serious fraud charges?

          17. @Neurosceptic. You are totally right. Jens Förster has firmly denied conducting any QRP’s and he has also firmly denied fraud. Jens Förster also told that he had thrown away alot of paper questionnaires (etc.). I fail to understand the opinion of Jens Förster that ‘throwing away raw data before they are published’ is not classified as QRP.

            I also fail to understand why Jens Förster did not filed a complaint to LOWI in which he stated that the preliminary decision of UvA was ‘a terrible misjudgement’, in particular given his firm statements (see above). Please read carefully, and once again carefully, the whole text and all findings of the preliminary decision of the Board of UvA. It seems to me that this decision cannot be classified as ‘the findings of the complainant are nonsense’. No way.

            How would you (or anyone else over here) feel when the Board of your university (assuming you are working at a university) had made a similar decision on three of your papers published in peer-reviewed journals?

          18. @Neuroskeptic. My interpretation is that Foerster denied QRPs since he had commited none (at least not in the p-hacking / statistical sense, not keeping the original files is silly/sloppy/wrong, but not immoral). I also believe that he denied fraud since he commited none. I know Jens only superficially, but I know quite a few of his colleagues/teachers/students and I know the academic and the ethical climate within which he acts – and I find it impossible to believe he made up data. Besides, I would expect him to do a better job at falsification, he might be sloppy, dumb he is not. At the same time, the data are obviously made up, no doubt in that. Can these two contradicting “facts” (both are not, but for me, very close to being facts) be reconciled? My only hypothesis that solves the contradiction is that the are made up, but not by Foerster. I think that a thorough investigation is needed into the administrative side of this sad story, as the statistical side is beyond any doubt, IMO. But would a person of integrity cooperate with an investigation against his former employees/students? Not clear.

  9. It would be very, very bad for the reputation of the University of Amsterdam if they would consider this case to be closed. First, at the Stapel case, they ducked their responsibility by not really investigating the studies of Stapel while he was employed in Amsterdam (the studies were more than 5 years ago published and the authors had no obligation to keep the raw data, but it was very suspicious that none of the authors and co-authors had this data somewhere on an old disk or floppy) and they also ignored the KNAW (Royal Academy of Science) report that was very critical on the co-authors who didn’t notice some clear mistakes or impossibilities in these articles. At least, if they really cared, they would have spent some money to redo the Stapel studies.

    In the Förster case no serious effort was made to go to the bottom: in the first stage Förster declined to give to data-files to the statistician who asked for it, in the second stage the committee decides that they could not ask to retract the paper because the data was missing, and only in the third stage when the case went to LOWI some of the data apparently turned up and could be analyzed (with a very clear conclusion).

    This is not my field and I never before heard about Förster, but when I googled I found that one of his most spectacular articles is “Why Love Has Wings and Sex Has Not: How Reminders of Love and Sex Influence Creative and Analytic Thinking” (DOI: 10.1177/0146167209342755). To summarize, when one thinks of love one is future-oriented and more creative and when one thinks of sex one focuses more on the present and is more analytical. This paper reports very strong results, and again, control conditions are always very, very close to the average of the sex and love conditions. You don’t have to be a close reader to see that the analysis is sloppy; for example when two treatments with each 20 subjects are compared every time t-tests with 57 degrees of freedom instead of 38 degrees are reported. In study 2 the first regression has creative performance as a dependent and as independent variable (love=1, sex=-1), which means that they do exactly the same t-test as reported a few lines above, but they report other t-values. And it may be my dirty mind, but how likely is it that no student who thinks about love imagines sex and the other way round (footnote 3)? Of course, my casual reading of this article cannot in any way be a proof of QRP or worse, but I think there is much reason for the University of Amsterdam to take a closer look. Not looking suggests that they are afraid of what they may find.

  10. Intriguing.

    1.) it seems unlikely that eating cornflakes on their own rather than together with other cereals would dramatically improve your ability to complete a logic puzzle (an example of Forster’s experiments).
    2.) But the p-values of the post-hoc analysis look like there is some underlying fundamental ‘scientific truth’ (or, of course, fraud)
    3.) Forster denies all wrongdoing.

    The only way I can reconcile all this is that Group category labels were inadvertently incorporated as actual ‘values’ into the calculations of means (really poor code-writing!). For example, there are three groups of 20 which all come out with a true mean of about 4 on a 7-point scale. But then to Group 1 you add 20 ‘1’s, to Group 2 you add 20 ‘2’s and to Group 3 you add 20 ‘3’s. Then divide each group by the new N of 40 and you get group means of 2.5, 3.0 and 3.5 (pretty close to several of Forster’s results). Obviously you also reduce your SD significantly by adding 20 identical values.

    Alternatively, if the groups all had true means of 4.2 and their category labels (now used as ‘values’) we’re 1, 3 and 5 then you would get new means of 2.6, 3.6 and 4.6. Again, pretty close to Forster, especially allowing for a bit of noise and perhaps a small ‘genuine’ effect from his experiments.

    So the ‘scientific truth’ behind the results could be, for example, that 2 lies exactly midway between 1 and 3!

    If this was the case it would represent gross incompetence but not necessarily malicious intent. Just an idea.

    1. Interesting idea, David.

      However, this would only work if Forster obtained identical means (4 out of 7 in your example) across conditions in all studies, which is as impossible as obtaining almost perfectly linear means across all studies.

    2. I an pretty sure that the statisticians who redid the analysis with identical results using his files would have noticed this.

      1. Yes, fair points! I suppose you could just about argue that categorical data will ‘cluster’ somewhere, and that you are especially likely to see ‘regression to the (probably central) mean’ when taking averages from 4 independent scorers using an arbitrary scale. But I’d missed the fact that they had his data files – I thought they were just working from data in the papers.

        So unless the category labels were somehow incorporated into the data points…? No, seems too improbable and I can’t see how it would have worked. Sadly now also have to agree with manipulation.

        Only last hope for Forster’s redemption is that he’s set this whole thing up and we’ve all become unwitting players in a great social psychology experiment looking at tendencies to defend, accuse, jump to conclusions or remain dispassionate! Now that would be cool…

  11. I don’t know if anyone has posted this before, but: Smeesters’ now-retracted 2011 article (with Liu), “The effect of color (red versus blue) on assimilation versus contrast in prime-to-behavior effects” cites four articles by Förster (as sole or co-author). One of these is Förster’s 2009 article, “Relations between perceptual and conceptual scope: How global versus local processing fits a focus on similarity versus dissimilarity search”, which is one of the articles investigated by LOWI. This, of course, proves nothing, but it seems to have some irony value, at least.

    1. Here is one tip to catch fraudsters. See who “replicates” results from experiments that turned out to have been fabricated. It is likely that the person who “replicates” such findings cheated herself to produce the same pattern.

  12. Just sent the following email to Prof. Förster:

    Dear Prof. Förster

    As you know there is a lot of debate about the statistical methodology used by the complainant in the recent investigations of CWI-UvA and LOWI. It seems to me that this debate should be open and scientific and that you too should have an opportunity to defend yourself in it.

    I have a request, and a suggestion.

    1) Is it possible to obtain from you the data-sets which were also analysed by the experts in the two investigations? I would like to post them on internet so as to facilitate the scientific debate. I believe there should be no confidentiality issues.

    2) Would it be an idea to organise a public debate e.g. at Leiden University Psychology department or some other neutral venue with presentations by yourself and supporters of your position as well as by critics? And with people from social psychology, psychology more broadly, methodology (applied statisticians), theoretical statisticians?

    Yours sincerely …

  13. I disagree with the firm statement of Jens Förster that the Board of UvA had found “no evidence of academic misconduct” in their first decision of 10 July 2013. Please read carefully the whole document of UvA on the site of VSNU.

    The conclusions of the Board of UvA (10 July 2013):

    (1): academic misconduct could not be ruled out. [That’s not the same as “no evidence of academic misconduct”]
    (2): sloppy science / QRP could not be ruled out.
    (3): Jens Förster was unable provide an explanation for the linearity of the research results presented in the three papers.
    (4): the patterns identified by Jens Förster could not be confirmed by other, comparable research.
    (5): Jens Förster was unable to retrieve raw data of experiments presented in the three papers. It was therefore not possible to ascertain whether data may have been manipulated. [Once again, that’s not the same as “no evidence of academic misconduct”]
    (6): based on (3), (4) and (5) urge Jens Förster to send an e-mail to both journals that an ‘expression of concern’ should be published for all three papers.


    It seems to me that any honest and any serious scientist will immediately send such and e-mail to both journals and that he will also immediately start to think how to solve this problem. See, eg,
    http://retractionwatch.com/2013/12/24/doing-the-right-thing-yale-psychology-lab-retracts-monkey-papers-for-inaccurate-coding/ for a nice example how other psychologists cope with “problems with their data.”

    It seems to me that any serious journal will make a quick decision what to do with such a request and it seems likely that both journals would indeed soon publish such an ‘expression of concern’. It seems likely that Retraction Watch will soon get informed by these statements and that RW will make a posting of these ‘expressions of concern’. Likely, many people will start to scrutize the papers when such a posting is published on RW and that these people also will ask Jens Förster for details / raw data / data files (etc.).

    It seems to me that Jens Förster has not send such an e-mail (am I right? Please correct me when I am wrong). Is there anyone over here who knows what Jens Förster has done (and why he seems unwilling to disclose this information)?

    Jens Förster did not file a complaint to LOWI. So Jens Förster agreed with these first decision of the Board of UvA. I am unable to combine the beforementioned statement of Jens Förster with the findings in the first decision of the Board of UvA. Please correct me when I am wrong.

    The regulations of UvA (and as well of all other Dutch universities) state that, sooner or later, [an anonymized version of] the above mentioned conclusions of the Board of UvA must be published on the site of VSNU.

    “While no one can read every paper, someone, somewhere has read your paper, very, very carefully. If there is something they are unhappy about, we will all know about it in due course.” [Dave Fernig]

    “You publish BECAUSE people want to hear about your work and they will read your paper. Some will read it very carefully, so if you have got something wrong it will come to light. Defensiveness, silence, summoning the self-righteous shield of peer-review is more damaging than coming out into the open.” [Dave Fernig]

    1. Amazing: The studies were not conducted in Amsterdam, but in Bremen (Germany)?
      Does that mean the panel was not responsible for this case?…..

      1. The papers were published by Förster from his Amsterdam academic address. So the UvA has a big responsibility and a big say in this. Data needs to be kept for at least five years from the date of publication. This is a standard rule and it has been a standard rule for years. Violation of that rule is a failure of scientific integrity. It doesn’t make a difference that all your friends ignore that rule, too. A top professor should be setting a good example, not following a bad example. Scientific papers should say something about when and where the data was gathered. These papers don’t. That’s what we call sloppy research. A university whose professors do sloppy research have to take action of some kind or other…

    2. Rolf, thanks alot for this new information. Some preliminary remarks:

      JF: “The series of experiments were run 1999–2008 in Germany, most of them Bremen, at Jacobs University”. Can anyone tell me where I can find this information either in the LOWI-report or in the UvA report? Why not mention Jacobs University as well as affiliated institute at any of these three papers when all the research has been carried out at Jacobs University?

      JF: “the specific dates of single experiments I do not know anymore.”
      I am a biologist working with (large) datafiles with various records of individual birds, so excuse me when I am asking a dumb question. Any record of these birds can always be linked to a particular set of basic information: date, site, unique IDnumber, age (often in two or three age classes), sex (if appropriate), capture method, etc.

      Why psychologists do not (need to) store this kind of information (= age, sex, date, site, type of compensation, other background information, etc.) on such kind of files?

      I have a file with records of 380 gulls found dead in The Netherlands in the mid 1980s. These data were published in 1997 in a peer-reviewed paper in Bird Study (.http://www.tandfonline.com/doi/abs/10.1080/00063659709461066#.U3CFKYF_vTo ). About 8 to 10 years ago, the last author of this publication has send me an excel-file with all his records. Any individual gull has a unique row with information on site, date when it was found, age, and with all details on measurements, etc.. Parts of the data of this file have been quoted in the discussion of a paper I have published in 2008, parts of the data of this file are quoted in a draft of a new paper. I doubt if the original paper forms (= labwork when taking measurements and when examining the sexe) still exist, but I consider such an excel file as ‘raw data’. People disagree with my opinion?

      JF: “The series of experiments were run 1999 – 2008 in Germany, most of them Bremen, at Jacobs University”.
      JF & MD in their 2012 paper: “Participants were paid 7 Euros or received course credit.”

      Please show me papers with results of such kind of psychology experiments conducted in the Bremen area / at the Jacobs University and/or elsewhere in Germany where it is noted that the participants got 7 Euro as compensation.

    3. I think Förster is guilty of selectively quoting from the expert report, in advancing his theory that someone in his lab may have manipulated the data. The expert report, as quoted in the LOW report, states:

      “It seems to me that only goal-oriented intervention in the entire data set can lead to this result.”

      I emphasise: the *entire* data set.

      So, Förster’s new theory assumes that after all these studies run from 1998 to 2008, presumably in 2008 or 2009, but at any rate before Förster started seriously looking into the analysis for publication, someone who worked in his lab manipulated the entire data set, without Förster’s knowledge.

      I think the list of people who could have done this thing would have to be pretty small. Not only that, but I can’t think of any reason why someone in his lab might have done this, seeing as none of them were given credit in the publication, and so would have little to gain.

    4. “If the data did not confirm the hypothesis, I talked to people in the lab about what needs to be done next, which would typically involve brainstorming about what needs to be changed, implementing the changes, preparing the new study and re-running it.”

      I assume this modus operandi was not reported in any publications. Isn’t that a QRP in itself?

    5. Thanks for posting this.

      In my eyes, this at least creats some doubts regarding what was going on and it also shows that science cannot be an ultimate tool in prosecuting someone; maybe lawyers should deal with this part of the whole story from now on.

      Regarding science: How to proceed from here?

      I can only encourage the author(s) of the complaint to submit their work to a peer-review journal.
      I am already surprised that they haven’t done this yet – in other cases such manuscripts were published. Did the authors of the complaint already submit their manuscript? If so, what did the reviewers say?
      Has it been accepted?

    6. It occurs to me that the following two statements are not quite in agreement.

      “I always emphasized that the assistants are not responsible for the results, but only for conducting the study properly, and that I would never accept any “questionable research practices”.”

      “If the data did not confirm the hypothesis, I talked to people in the lab about what needs to be done next, which would typically involve brainstorming about what needs to be changed, implementing the changes, preparing the new study and re-running it.”

      It sounds to me as if assistants were in fact somewhat responsible for the results, because if the results didn’t support the hypothesis, they would have to come up with ways to alter the study, and then run it again.

      The above is not necessarily problematic, but it seems inconsistent.

  14. Thanks Rolf. Call me naive, but I am inclined to believe him.

    However, my favourite part of his response is this:

    “If the data did not confirm the hypothesis, I talked to people in the lab about what needs to be done next, which would typically involve brainstorming about what needs to be changed, implementing the changes, preparing the new study and re-running it.”

    So, if I’ve understood correctly: when you don’t get the results you ‘want’, keep running the experiment until you do – then disregard all the negative results and publish the positive ones. Maybe that’s where I’ve been going wrong all these years…!

    Maybe that’s also why some research assistant fabricated the data, so that they could stop re-running the damned experiment!

    1. This is the way an experiment is run in a laboratory. No different from calibrating an instrument, finding the correct dosage of a medicine, etc. But: once you find the right instrument/procedure – you are supposed to be able to rinse and replicate the results every time you use the same method – in your lab or anyone else’s. [I have problems with “conceptual” replications that Foerster mentions as the ***only*** type of replication, but this is a different discussion]

      1. I’m not sure this is the way an experiment is run in a laboratory. Sure you optimise the methods to use minimum reagents etc but at that point you are just getting things going: you are not collecting data to confirm or refute a hypothesis. Forster says that they tweaked things “if the data did not confirm the hypothesis”. Perhaps I’m reading too much into it, but that sounds different to me…

      2. This seems to be a misunderstanding with respect to replication (“you are supposed to be able to rinse and replicate the results every time you use the same method”). If an effect exists, the probability that your experimental data indicate its presence depends on the statistical power of your test. In most cases in psychological research, replicating an effect over and over again without failures is an highly unlikely event due to low power.

    2. No one talks anymore about the other two papers. The LOWI only focused on the SPPS paper, but the other two papers also showed the same unlikely patterns. So, were all these unlikely patterns in at least three papers due to the intervention of someone else?

      1. What makes me even more wonder is the fact that the complainants (I assume academics) did not publish or succeed in publishing their analyses in a peer-reviewed journal.
        It is my understanding that this is the most appropriate scientifc level of discourse – at least in Psychology.

        Any idea, why?

        1. You seem to assume the authors were unsure of themselves. I suspect they were sure, and the proper venue to report scientific misconduct is not a peer-reviewed journal. I wonder whether there even exists a journal, that would not be ignored by the psychology community, where such an analysis would be published.

          1. No, I do not assume that they were unsure.
            I just do not understand, why they have not used this tool, as well. I just know from peer-review processes in other fields that they can be very tough, because they typically ask experts to evaluate your manuscript.
            I assume that this is the same in Psychology. And – from a quick search – there are cases (e.g., in Psychonomic Bulleting & Reveiw), where other academics succeeded in publishing critisim against studies.

        2. hi Henk,

          Jens Förster himself has stated that UvA has openend an investigation on his academic integrity after “a colleague from the methodology department at the University of Amsterdam (UvA) observed some regularities in data of three articles that are supposedly highly unlikely.” So I assume that this collegue, likely the complainant, will also be an academic.

          There are at least three reasons why this scientist (the complainant) has -not yet- decided to publish his findings in a peer-reviewed journal.

          (1): he does not feel comfortable to disclose his name. Filing a complaint to UvA means that he does not need to disclose his name; submitting a paper to a peer-reviewed journal means that he must disclose his name.

          (2): filing a complaint to UvA when you are sure that a collegue has violated academic rules, or when you have firm evidence that this collegue has violated the academic rules, is ‘doing the right thing’. The Dutch Code of Conduct states that “permitting and concealing the misconduct of colleagues” must be “punished”. It also states “A researcher or director has a duty of due care with respect to the science as a whole and particularly to the researchers in his immediate circle.” Please be aware that you should first try to convince your collegue that he should change his behaviour (e.g., by doing a replicative study, or whatever), before filing a complaint.

          (3): first see how (the experts of) UvA and LOWI and the accuser react on the contents of the report and/or the wish not to disclose the identity of Jens Förster before the investigation is finished (et.).

  15. JG (May 8, 2014) wrote: “This is not my field and I never before heard about Förster, but when I googled I found that one of his most spectacular articles is “Why Love Has Wings and Sex Has Not: How Reminders of Love and Sex Influence Creative and Analytic Thinking” ( DOI: 10.1177/0146167209342755 ). To summarize, when one thinks of love one is future-oriented and more creative and when one thinks of sex one focuses more on the present and is more analytical. This paper reports very strong results, and again, control conditions are always very, very close to the average of the sex and love conditions. You don’t have to be a close reader to see that the analysis is sloppy; for example when two treatments with each 20 subjects are compared every time t-tests with 57 degrees of freedom instead of 38 degrees are reported. In study 2 the first regression has creative performance as a dependent and as independent variable (love=1, sex=-1), which means that they do exactly the same t-test as reported a few lines above, but they report other t-values. And it may be my dirty mind, but how likely is it that no student who thinks about love imagines sex and the other way round (footnote 3)? Of course, my casual reading of this article cannot in any way be a proof of QRP or worse, but I think there is much reason for the University of Amsterdam to take a closer look. Not looking suggests that they are afraid of what they may find.”

    I found an early online version and I have conducted some casual reading of this paper. My gut feelings tell me that nothing is wrong with this paper, please correct me when I am wrong. The contents of the paper and the way how the experiments are described are totally different from the 2012 paper by Jens Förster and Markus Dentzer.

    1. “Received May 6, 2008, Revision accepted April 20, 2009, OnlineFirst, published on September 17, 2009.”
    2. Authors and affiliations: “Jens Förster University of Amsterdam and Jacobs University Bremen, Kai Epstude
    University of Groningen, Amina Özelsel, Jacobs University Bremen”.
    3. “We conducted two studies.” So not 12.
    4. “Study 1. Sixty students were recruited (31 women, 29 men; average age = 21.30 years) to participate in a battery of unrelated experiments that lasted approximately 2 hr and for which they received 20 euros. Note that each cell had 20 participants with a balanced gender distribution.” I conclude that recuitment of the participants was focussed to get the same amount of males and females. Nothing wrong with this kind of recruitment, please correct me when I am wrong.
    5. “Study 1. Two independent raters evaluated the notes participants had taken when imagining the proposed scenarios with respect to …..”.
    6. “Study 2. Sixty students were recruited (30 women, 30 men; average age = 23.40 years) under the same conditions as in Study 1.”
    7. “One may wonder whether these high percentages are due to our young college student sample. However, in a different experiment (Förster, Özelsel, & Epstude, 2009), in which we used an elderly sample from the same region,….”
    8. “This research was supported by a grant from the Deutsche Forschungsgemeinschaft (FO-244/6-3) to Jens Förster. We thank Aga Bojarska, Alexandra Vulpe, Anna Rebecca Sukkau, Basia Pietrawska, Elena Tsankova, Gosia Skorek, Hana Fleissigova, Inga Schulte-Bahrenberg, Kira Grabner, Konstantin Mihov, Laura Dannenberg, Maria Kordonowska, Nika Yugay, Regina Bode, Rodica Damian, Rytis Vitkauskas, and Sarah Horn who served as experimenters. Special thanks go to Nira Liberman, Markus Denzler, and Stefanie Kuschel for invaluable discussions, and to Gregory Webster for his helpful comments on an earlier version of this article. We thank Regina Bode and Sarah Horn for editing the manuscript.”

    So another paper of Jens Förster, but with a long list with names of people who have helped in conducting these 2 (not 12) experiments (etc.), with detailed information about the participants (as well as very accurate information on design of the experiments, etc. etc.).

    Regina Bode (May 6, 2014) on Retraction Watch: “I worked for Jens Förster in the summer of 2003 and from January 2004 to August 2006 in Bremen. I pursued my bachelor’s degree in Bremen from 2002 to 2005, majoring in biochemistry and cell biology. During that time I started to work in Jens’ lab, which in turn got me interested in psychology. I think I took my first class with Jens in the fall of 2003 (my transcript from International University Bremen/Jacobs University does not state the exact date). Several other classes followed. In fact, after I got my bachelor’s degree, I stayed in Bremen for another year to study psychology and to prepare for a master’s degree in this field. (..). Concerning the data collection: I might have been wrong that I was not involved in the data collection for the papers. When I wrote my fist comment, I assumed that I had not been involved. But since it looks like the data was collected in Bremen, this might not be the case (i.e., I might have been involved without realizing it). However, I have no way to track this. As I have stated previously, I did rate creativity tasks while working for Jens, but I do not know if I was involved in the ratings for the paper(s) in question.”

    No red flags. Disclaimer: I am not a statistician and I have not checked any of the results. Please tell me if my gut feelings are wrong.

    Im convinced that UvA has handled this case very good. So a quick publication of an anomymized version of the final decision (and not a very short summary), a short declaration that this case is closed (without telling that Jens Förster will soon leave UvA) and a statement that anyone can file a new complaint to UvA about any other paper with an UvA affiliation and/or about any other researcher with an UvA affiliation.

    UvA will be aware that The Netherlands has some very keen science journalists. S no need to bother about a press release of UvA about a member of the staff who will soon leave UvA. The journalists know what’s going on and they will immediately run a story when the case concerns a high profile professor of one of the Dutch universities.

  16. JF on 11 May 2014: “The series of experiments were run 1999–2008 in Germany, most of them Bremen, at Jacobs University.” JS in the LOWI-report: “Het standpunt van Beklaagde. Veel experimenten waarvan verslag werd gedaan in de … artikelen zijn in … uitgevoerd en niet uitsluitend in …. ”

    JF on 11 May 2014: “I still hesitate to share certain insights with the public. It is hard for me to decide how far I can go to reveal certain reviews or results.” Please disclose the dots in the above sentence in the LOWI-report.

    JS on 11 May 2014: “Research assistants and more people would conduct experimental batteries for me. They entered the data when it was paper and pencil questionnaire data.They would organize computer data into workable summary files (one line per subject, one column per variable).”

    OK, so maybe no big problem to throw away a huge pile of (old) paper questionnaires when all information has been entered into a database (and after a good check that everything on paper is digitalized in a proper way).
    So JF moved to UvA with a huge pile of paper questionnaires. Anyone can witness the existence of these piles of paper questionnaires with texts in German language in the office at UvA?

    And where are these files with the testing results of in total at least 2242 undergraduates? In Bremen? At UvA? Lost in cyber? On floppy disks? Lost?

    JS on 11 May 2014: “Indeed, the SPSS files on the creativity experiments for 2012 paper include the 390 German answers.” The 2012 papers reports 690 participants. So how about the answers of the 300 other participants? Lost? Gone?

    Anyone any idea if any of the test results of any of the 2242 undergraduates listed in the 3 papers has also been published in any of the other papers of Jens Förster (and / or his co-workers)?

    Please be aware that Jens Förster got punished by LOWI because he had thrown away raw data (“Deze klacht is gegrond”). Language of lawyers, but ‘gegrond’ is a keyword for any lawyer.

    Please be also aware that the chair of LOWI (prof. dr. mr. C.J.M. Schuyt) is as well a sociologist as a lawyer (‘jurist’) and that he recently has resigned from being a member of the Dutch The Council of State ( http://www.raadvanstate.nl/the-council-of-state.html ). On top of that, prof. Schuyt was chair of a Committee of the Royal Dutch Academy of Arts and Sciences who published in September 2012 a report with the title “Responsible research data management and the prevention of scientific misconduct”. Report can be downloaded for free.

    See https://www.knaw.nl/en/news/publications/responsible-research-data-management-and-the-prevention-of-scientific-misconduct?set_language=en (English version) and

    https://www.knaw.nl/nl/actueel/publicaties/zorgvuldig-en-integer-omgaan-met-wetenschappelijke-onderzoeksgegevens (Dutch version).

    Highly recommended to read, in particular for any scientist who has the opinion that throwing away raw data is ‘good scientific practise’.

  17. Foerster’s latest reply to the accusations:


    “Coworkers analyzed the data, and reported whether the individual studies seemed overall good enough for publication or not. If the data did not confirm the hypothesis, I talked to people in the lab about what needs to be done next, which would typically involve brainstorming about what needs to be changed, implementing the changes, preparing the new study and re-running it.”

    OK, this is obviously questionnable research practice. Rerunning experiments until a significant finding is found. Perhaps he should have considered that his hypothesis is just false.

    “The complainant wonders about the size of the effects. First let me note that I generally prefer to examine effects that are strong and that can easily be replicated in my lab as well as in other labs. There are many effects in psychology that are interesting but weak (because they can be influenced by many intervening variables, are culturally dependent, etc.) – I personally do not like to study effects that replicate only every now and then. So, I focus on those effects that are naturally stable and thus can be further examined.”

    This is utter nonsense, see, for instance:

    “I did not see the unlikely patterns…”
    How about testing the “likeliness” of his data directly?:

    Quite frankly, this deeply flawed reply leaves me speechless.

    1. I am confused regarding the effect size comment (“utter nonsense”). Does this mean that any strong effect (in Psychology) is a priori based on some fraud or qrp?

      I thought that that the controlled (or artificial) laboratory setting might as well contribute to the effect size.

      Am I wrong?

    2. “OK, this is obviously questionnable research practice. Rerunning experiments until a significant finding is found. Perhaps he should have considered that his hypothesis is just false.”
      This statement is misleading. Often, a phenomenon only occurs under very precise conditions. This is true in social psychology and is true in the natural sciences. Changing the conditions until the parameters under which the phenomenon occurs are found is not QRP, it’s science at it’s best. What is not acceptable is making claims that the phenomenon occurs under a more general set of conditions. This is what social psychologists often do, when they write a splashy paper. They imply the results are general and robust, whereas in reality they can only be found under a very specific set of conditions (which may make the results much less interesting).

      1. Hi Helen and psychometrician. Please see our earlier discussion on exactly this point, here (I think!):


        I’m afraid I agree with psychometrician. I still think there is a fundamental difference between optimising an experiment so that the methodology is working well before you collect data and ‘optimising’ an experiment to confirm the hypothesis you want if you don’t like the initial data. Forster wrote that they played with the methods “if the data did not confirm the hypothesis.”

        Perhaps it’s all in the translation and we’re over-calling this but it does seem dubious to those of us in lab science.

        1. There is nothing wrong with finding the conditions under which a phenomenon occurs. But suppose there is no phenomenon at all. Then repeating the experiment with varying conditions until you get a statistically significant result (on the average, it will take you 20 attempts before you succeed), and then publishing the results “as if” you proposed this experiment out of the blue, and did it, with success, is stupid.

          What you must do is repeat with varying conditions till you think you have got it right. Then publish your proposal to do an experiment. Then repeat exactly what you just did and publish the results, whether significant or not.

          1. PS the same holds for replications. If 20 people replicate the experiment and one succeeds and that one, and that one only publishes, we have learnt nothing. Förster keeps saying that his findings have since been replicated by many other researchers. However if they use his methodology then of course the succesful replications replicate his findings, the unsucessful replications are either not submitted for publication at all, or, if submitted, they are rejected by editors and reviewers, because clearly the new experimenters were just not skilled enough.

          2. Upload all these replicative studies to http://biorxiv.org/

            “By posting preprints on bioRxiv, authors are able to make their findings immediately available to the scientific community and receive feedback on draft manuscripts before they are submitted to journals.”

          3. “Then publish your proposal to do an experiment. Then repeat exactly what you just did and publish the results, whether significant or not.”

            IFF YOU GET IT PUBLISHED! Preregistration does not solve this problem. That’s where inferential statistics meets its limits. Face it, science is about communication. P-levels, power and effect size are rhethorical devices to convince an audience about the rebustness of a phenomenon. But who really believes that alpha quantifies the probability of erroneously rejecting the nullhypothesis? I call it the Fisher illusion.

      2. I agree, very often (at least in other disciplines) the experimental setting is just not appropriate, e.g., because the manipulation is not strong enough etc.

        @psychometrician: Is it just your feeling that this is all qrp? Based on some prior assumptions?

        I totally understand that running the same experiment several times (and then reporting only the instances when it was successfull) is utterly wrong.
        However, learning something from “mistakes” in the set-up and the measurement is in many disciplines a gold standard.

        Would you accuse everyone who learns from empirical mistakes and changes something based on this learning process that he/she committed QRP?

        1. Henk – again this might just be the way he phrased it. But let’s be clear: he didn’t say that they changed things “if we realised the methods were wrong” (which clearly would be fine) but they changed things “if the data did not confirm the hypothesis”.

          We have to assume that if the data *did* confirm the hypothesis then no further inquisition of the methods was undertaken. It *seems* to imply (to some of us) that the only acceptable answer was confirmation of the hypothesis…

        2. As David (see below) already pointed out, the following sentence is an indication of QRP: “If the data did not confirm the hypothesis, I talked to people in the lab about what needs to be done next”. A kind of confirmation bias, I would say, indicating bad scientific habits. But this is rather typical among psychologist because falsification seems no longer to be a virtue.

    3. What leaves me speechless (almost) is: calling a pretty civilized and intelligent public discussion on “Retraction watch” an “internet chatroom”.

  18. I think that there is a confusion about the effect size. The size of the effect is large compared to the within group standard deviation. Of course it is. The effect is statistically significant. Otherwise it wouldn’t have been published. Some people have said that the measurement is essentially an IQ test, and an effect size equal to one standard deviation is the same as 15 points difference on an IQ test, which is hardly likely to be influenced by whether your breakfast cereal was one brand or a mixture of several brands. However we are talking about a very homogenous group of subjects (they are all psychology students…). I suppose the distribution of IQ within psychology students has a smaller standard deviation than the distribution of IQ in the whole population. So the size of the effect in psychological terms might be small, after all. Maybe psychology students have very similar IQ and maybe their apparent IQ is very easily affected by their breakfast cereal … this is a very special sub-population of the whole population. Actually probably it’s extra-sensory perception on the part of the psychology students.

  19. @Henk: What I wanted to say is that it is flawed statistical thinking that large effect size estimates are more replicable than smaller ones. That kind of flawed thinking is explained in the related link I provided. See also Geoff Cumming’s paper entitled “Replication and p Intervals p Values Predict the Future Only Vaguely, but Confidence Intervals Do Much Better” . Social scientist typically overestimate the degree of replicability. See http://www.youtube.com/watch?v=5OL1RqHrZQ8 for an illustration. Psychologists typically neglect the typical sample variation when estimating population parameters (see http://www.nicebread.de/at-what-sample-size-do-correlations-stabilize/ ). Let us take the following example from http://www.nicebread.de/a-comment-on-we-cannot-afford-to-study-effect-size-in-the-lab-from-the-datacolada-blog/ : “…you can also directly compute the necessary sample size for a desired precision. This is called the AIPE-framework (“accuracy in parameter estimation”) made popular by Ken Kelley, Scott Maxwell, and Joseph Rausch (Kelley & Rausch, 2006; Kelley & Maxwell, 2003; Maxwell, Kelley, & Rausch, 2008). The necessary functions are implemented in the MBESS package for R. If you want a CI width of .10 around an expected ES of 0.5, you need 3170 participants”

    Moreover, sample sizes in the social sciences tend to be to small, resulting in underpowered studies (see http://www.ncbi.nlm.nih.gov/pubmed/15137886) and low power: median estimated power in the social sciences to find a medium-sized effect (i.e., d 5 0.5 in the population) is around .5 (see Maxwell, 2004, p. 148).

    @Helen: Yes, I agree with you that an experiment can and should be re-run using a modified experimental setting if a kind of technical error (bad operazionatlisation of the dependent/independent variables etc.) occurred. If this is the kind of modification Foerster applied, I have no issues with that. However, trying out different kinds of operationalizations until a significant finding was found, without any sound justification why one operazionalization is better than the other, is bad science.

    1. Thanks for the clarification. The paper (by Cumming) is indeed very interesting.

      However, why is the corner stone in the complainant report almost exclusively built upon p-values and the logic of significance testing, if it seems to be a new trend in Psychology to not only rely on significance testing?

      In the light of this development in psychology, the use of likelihoods in the report does not to be (the methodological/analytical) state of the art in this discipline anymore. No confidene intervals etc.

      Why not?

      1. No confidence intervals because what would you like to put an interval around? We do not have a single theory with one parameter, theta say, such that fraud, QRP’s and “bad luck” correspond to different intervals of values of theta. We have (per study) just one model: three independent random samples. That’s the null hypothesis. The alternatives are: QRPs; and fraud. Some QRP’s are already known and well studied and understood, in some respects at least. Some “natural” kinds of fraud are known but no doubt there are many which we still haven’t thought of.

        The investigating committees have to make a decision. Either (e.g.) conclude fraud, or don’t. For them, it’s a decision problem. We the spectators don’t have to make a decision. Anyone can request the data from Förster, do their own analyses, form their own opinions, and submit their findings for peer review if they think they are worth putting on record.

  20. About the subgroup analysis. I made a comment some time back which still seems to be “awaiting moderation” so I’ll repeat the essentials again.

    At first I found the subgroup analysis, instinctively, a stunning confirmation of fraud, and a brilliant idea by whoever had it. Now I do not find it interesting at all. Here is why.

    Suppose we start all over again, and are first looking at the super-linearity (the observed linearity which would be too good to be true even if the true population means happened to be exactly linear). Suppose we are looking at this because we want to distinguish between pure chance (bad luck), QRPs, and fraud (I neglect the post-hoc issue. If we were careful Bayesians it wouldn’t matter: the order in which things happened changes our priors, not the likelihoods from the different pieces of data).

    Under pure chance, the super-linearity is extraordinarily unlikely
    Under QRP, the super-linearity is more likely
    Under fraud, the super-linearity is even more likely

    Probably everyone agrees so far, except perhaps Förster,

    Now we look at the subgroups.

    Under pure chance and given the overall super-linearity, the subgroup *non* linearity is just what I’d expect
    Under QRP’s and given the overall super-linearity, the subgroup *non* linearity is just what I’d expect
    Under fraud chance and given the overall super-linearity, the subgroup non linearity is just what I’d expect

    … notice, that’s my judgement, three times. You may have one or more different judgements. But at this moment, in my judgement, subgroup *non* linearity is just what I’d expect anyway, given everything else observed so far, whatever the true underlying state of affairs (pure chance, QRPs, and fraud)

    I hope everyone can understand my argument. You can disagree with the judgements I made (the building blocks). But I think the argument stands up. You can easily destroy my conclusion by simply destroying any of the building blocks.

    1. Richard, I would like to discuss your claim that ‘Under pure chance and given the overall super-linearity, the subgroup *non* linearity is just what I’d expect’.

      Let me start by saying that I have not been able to find the actual expert report that is quoted in the “LOWI Advies 2014, nr. 05” document, so I have to base my conclusions on the quotes and interpretation of the expert report provided there. Based on the response of the complainant on the expert report, I came to the conclusion that the subgroup analyses indicated nonlinearity at the subgroup level. The report is however slightly ambiguous in this respect, since it could also be read as stating that the subgroup analyses simply did not show similar super-linearity rather than pointing at actual indications of nonlinearity. In what follows, I am assuming that the subgroup analyses indicate deviations from linearity at the subgroup level. If this is incorrect, and if what is meant in the report is simply that the p-values of the F-statistic for the subgroup analyses are approximately uniformly distributed over the different studies, my main argument does not hold (although there is still a matter of dependency between results that should be taken into account).

      I would argue that in talking about the results being ‘due to chance’, we should distinguish two ‘pure chance’ explanations. There is a ‘pure chance’ explanation of the super-linearity under the assumption of actual linearity of the group means (captured in the p-value of the original analysis, indicating ). There is another ‘pure chance’ explanation where we abandon the assumption of linearity, and where both the observed linearity and the super-linearity are taken to be due to chance. The probability of such an observation, as mentioned in the original report, is even less likely (and probably much less likely, although without a null distribution it is impossible to quantify this difference). Hence it matters whether we understand the ‘pure chance’ explanation to mean ‘pure chance and true linearity’ or ‘pure chance, no actual linearity’.

      The lack of linearity at the subgroup level casts doubts on the claim that we should expect the overall group means to display linearity, since we would expect this linearity to also show up at the subgroup level. I agree to some extent with you that under ‘pure chance and linearity’ we would not expect this linearity to show up at the subgroup level in the sense of p-values for the F-tests again being too close to one. Note however that we have some reason to actually expect this to happen even under pure chance: the overall group always consists of the two subgroups, and hence a chance finding of p-values too close to 1 when we only look at the overall groups would make it plausible that if we split those groups into subgroups, we may still expect to find some excess linearity in the sample means (and correspondingly high p-values) at the subgroup level. However, evaluating the strength of this dependency probably requires running simulations, and it is not my main argument why the subgroup findings matter.

      Even if we ignore the potential dependency between the p-values found for the subgroup and total group analyses and if we assume that under the ‘pure chance and true linearity’ explanation we would expect the p-values of the F-tests at the subgroup level to be uniformly distributed, it matters whether we find evidence of the absence of linearity at the subgroup level. If over these different studies, linearity seems to consistently be violated at the subgroup level, this casts doubt on the ‘true linearity’ assumption that has ‘generously’ been made throughout the different analyses (even though we may have strong a priori reasons based on substantive knowledge to question this assumption). This means that it becomes even harder to maintain the ‘pure chance given actual linearity’ explanation, and that people adhering to the ‘pure chance’ defense may be forced to switch to the ‘pure chance without actual linearity’ explanation, which would result in the conclusion that the observed super-linearity is even less likely to occur due to chance.

      Since the subgroup analyses seem to provide us with further reasons for questioning the ‘actual linearity’ assumption, and since it makes a difference for the assessment of the likelihood of the observed superlinearity being due to chance whether we can assume linearity of the means, I would argue that these subgroup results make the ‘pure chance’ explanation even less likely than it already was, although the extent to which it has become less likely will be difficult to quantify. If we assume that observing a mismatch between the p-values for the F-test at the overall group and the subgroup level does match the fraud hypothesis (and we have already seen explanations as to how this could happen under the assumption of fraud), we have to conclude that this new subgroup analysis provides additional support for the fraud explanation over the ‘pure chance and true linearity’ explanation. The subgroup analysis provides less strong support (but due to the expected dependency possibly still some support) for the fraud explanation over the ‘pure chance and no true linearity’ explanation.

      How this new evidence relates to the explanation of QRP is less straightforward given the diversity of possible QRPs, but given that Forster repeatedly does not avail himself of this explanation I do not think that this is the most fruitful explanation to pursue. It also seems to me that the extent to which QRPs would need to be applied to obtain these kind of outcomes would end up being indistinguishable from fraud, but that may be a matter of definition.

      What are your thoughts on this? And do you share my reading of the findings of the expert with regard to the subgroup analyses (or can it even be found somewhere)?

  21. Jesper, I think my answer to your response to my response would be even longer still so maybe we should discuss this in person e.g. by email. Moreover really one needs to do some experiments and calculations and for that one needs the data. I wrote to Förster asking if I could have it. He told me that he was happy for me to have it. He does have a list of conditions concerning its use.

    1. You are probably right, my post ended up being longer than planned as well. I am however definitely interested in your thoughts on the matter, so I’ll contact you by email.

      It is good news that Förster is willing to have people look into the data, even if it is with a list of conditions attached to it. The report remains relatively vague with regard to the subsequent results provided by the expert, and it would be great to get some more insights into the actual patterns in and composition of the data.

  22. After reading the “verdict”, I am left with the clear conclusion that some manipulation must have taken place on the means of the final complete data sets. But, to be fair, the following is not addressed:
    1. What could have been manipulated which way to achieve the result?
    2. Could those manipulations have arisen unintentionally (I assume that in psychology a lot of scaling and normalizing is done, some wrongful procedure might have had unintended consequences…).
    And what struck me most is the subsetting flaw in rationale. Imagine the data were unmanipulated and showing some “same” trend for the whole samples. If you then subdivide the samples into any category, for stochastic reasons, those categories will differ in each sample, and the differences will differ between samples. So, clearly, it will look like the differences are cancelling out each other in order to achieve the intended result, but that is of course a mirage, because the result is the a priori, i.e., if it would be otherwise you would have a different result for the unsubdivided samples.
    That is to say: the subsetting finding does NOT add anything to the argument, i.e., that the variance of the “linearity” to be too low.
    Looks like the committee and the statistical expert have become entangled in a little pitfall here – that is not to say that this is a point against the manipulation suspicion, only that several of the commenters might better avoid that one…

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.