Ioannidis (2005)
“Why most published research findings are false” — PLoS Medicine.
- Treats research findings as a screening test. Positive predictive value depends on:
- Prior odds R of a true relationship in the field
- Power 1 − β of the study
- Bias u (flexibility, selective reporting)
- PPV = (1 − β)R / (R + α − βR + u(1 − β + βR))
- With low prior odds (exploratory hypotheses) and low power (small N), most “significant” findings are false.
- Bias can be modelled and dominates as it grows; small studies with extreme flexibility have PPV near zero.
- Foundational reference for the field-level pessimism that the OSC project tests empirically.
Bem (2011) — precognition in JPSP
“Feeling the future” — nine experiments, 1,000+ participants, claimed evidence that future events influence past responses.
- Used standard JPSP-acceptable methods: that is the point.
- Wagenmakers, Wetzels, Borsboom & van der Maas (2011) Bayesian re-analysis: same data, Bayes factors strongly favour the null.
- Galak, LeBoeuf, Nelson & Simmons (2012) ran direct replications across seven studies, N > 3,000: no effect.
- Ritchie, Wiseman & French (2012): three further failed replications, initially rejected by JPSP.
- Crystallised the question: if standard methods can deliver “evidence” for precognition, what else are they delivering?
Simmons, Nelson & Simonsohn (2011)
“False-positive psychology” — Psychological Science.
- Quantified researcher degrees of freedom: optional stopping, selective reporting of conditions/measures, covariate inclusion, transformations.
- A simulation and a real experiment showed Type-I error can reach ~60% with four such choices undisclosed.
- Coined the “garden of forking paths” intuition (later formalised by Gelman & Loken 2013).
- Six author requirements + four reviewer guidelines — including: justify N a priori, list all conditions, report all variables, all exclusions.
- Direct intellectual ancestor of pre-registration and Registered Reports.
Stapel (2011) — fraud
Diederik Stapel, social psychologist at Tilburg, suspended after PhD students raised the alarm.
- ~58 retractions by 2015, including high-profile Science papers (e.g. on disorder priming racial stereotyping).
- Levelt, Noort & Drenth Commission (2012) report: years of outright data fabrication, undetected by co-authors, reviewers, editors.
- The point is not that fraud is the cause of the crisis — it is rare. Stapel made fabrication indistinguishable from normal practice until students checked. That diagnosis was the wake-up.
- Catalysed Dutch and broader policy on data archiving and verification.
Doyen et al. (2012) — elderly priming
Direct replication of Bargh, Chen & Burrows (1996): priming “elderly” stereotypes claimed to slow participants’ walking speed leaving the lab.
- Doyen et al. ran two experiments (Brussels) with infrared timing instead of stopwatch.
- No effect of prime on walking speed when experimenters were blind to condition.
- Effect appeared when experimenters knew the condition — consistent with experimenter expectancy, not unconscious priming.
- John Bargh’s combative blog response amplified the controversy and drew the wider community in.
- Part of a broader wave of failures for social priming effects (money priming, professor priming, etc.).
Many Labs 1 — Klein et al. (2014)
First large coordinated replication: 36 sites, 6,344 participants, 13 classic effects.
- Each site ran the same standardised protocol on its local sample.
- 10 of 13 effects replicated in the expected direction; some very robustly (anchoring), others not at all (currency priming, flag priming).
- Heterogeneity across sites was small for most replicated effects: the variation people feared (culture, language) was less of a story than expected.
- Demonstrated feasibility of coordinated, pre-registered, multi-site replication — directly inspired OSC and later Many Labs 2/3/5.