The Replication Crisis: Causes, Impact, and Solutions

Science's credibility rests on a deceptively simple promise: if a finding is real, someone else should be able to produce it again. The replication crisis is what happens when that promise breaks down at scale — not in fringe journals or fringe fields, but at the center of psychology, medicine, economics, and cancer biology. This page explains what the crisis is, how structural incentives produced it, where it hits hardest, and what the scientific community is doing to rebuild the foundations.

Definition and scope

In 2015, the Open Science Collaboration published results from its Reproducibility Project: Psychology, in which 270 researchers attempted to replicate 100 published psychological studies. Only 36 of those 100 studies produced results that held up (Science, Vol. 349, Issue 6251). That single figure — 36 out of 100 — became a kind of shorthand for a problem that had been simmering for decades.

The replication crisis refers to the widespread finding that a substantial fraction of published scientific results cannot be reproduced by independent researchers following the same methods. It is distinct from outright fraud. Most of the studies that fail to replicate were not faked — they were conducted by researchers doing what the system rewarded them to do, using methods the field considered standard. That distinction matters enormously, because it means the problem is structural, not merely ethical.

The scope extends well beyond psychology. A 2012 analysis by Amgen scientists found that only 6 of 53 landmark cancer biology studies could be reproduced internally (Begley & Ellis, Nature 483). In economics, the Institute for Replication has documented replication failures across top journals including the American Economic Review. The peer review process, as traditionally practiced, was never designed to catch these failures — reviewers evaluate manuscripts, not underlying datasets or raw analysis code.

How it works

The replication crisis did not appear overnight. It emerged from a set of interlocking incentive structures that, taken together, systematically inflated the literature with unreliable results.

The core mechanism is publication bias: journals historically preferred to publish positive, statistically significant findings over null results or failed replications. A study that finds an effect gets published; a study that finds nothing sits in a file drawer. Over time, the published literature comes to represent a biased sample of all experiments ever run — skewed toward results that look clean and dramatic.

Compounding that is a practice researchers call p-hacking (also described as "researcher degrees of freedom"). A researcher running an experiment has dozens of small choices to make: which participants to exclude, which covariates to include, when to stop collecting data. Each choice, made after seeing preliminary results, subtly shapes the outcome. Done without preregistration or transparency, this process can push a marginal finding past the conventional p < 0.05 threshold without any conscious intent to deceive.

Then there is underpowered study design. Statistical power measures a study's ability to detect a real effect when one exists. Studies with small sample sizes are underpowered — and underpowered studies that nonetheless achieve statistical significance are, paradoxically, more likely to have overestimated the size of the effect. This is sometimes called the "winner's curse." A finding dramatic enough to get published from a small sample is probably not as dramatic as it looks.

The relationship between statistical analysis in research and reproducibility is direct: analytical flexibility without preregistration creates a garden of forking paths where almost any dataset can be made to yield a significant result.

Common scenarios

The replication crisis manifests differently depending on the field and study type.

Social priming studies (psychology) — High-profile findings claiming that subtle environmental cues dramatically alter behavior (e.g., exposure to words associated with age making people walk more slowly) failed repeated replication attempts. The original studies used small samples, often fewer than 40 participants.
Preclinical cancer research — The Amgen analysis cited above is not an isolated finding. A similar effort by Bayer HealthCare found that only approximately 25% of published preclinical studies could be confirmed internally (Prinz et al., Nature Reviews Drug Discovery, 2011). Drug candidates built on unreliable preclinical findings fail in clinical trials at enormous cost.
Candidate-gene studies (psychiatry) — Decades of research claimed associations between specific genes and conditions like depression or schizophrenia. Large-scale genome-wide association studies (GWAS) using sample sizes in the tens of thousands have failed to confirm most of these earlier associations, which were typically based on samples of fewer than 500 individuals.
Ego depletion (behavioral economics/psychology) — The theory that willpower is a finite resource that depletes with use generated over 200 published studies. A preregistered multi-lab replication involving 23 laboratories found no evidence for the core effect (Hagger et al., Perspectives on Psychological Science, 2016).

Decision boundaries

Not all failures to replicate mean the same thing. The scientific community increasingly distinguishes between three categories:

Replication failure due to context sensitivity — The original effect is real but depends on conditions that varied between sites (cultural context, participant population, exact timing). This is a genuine scientific finding worth investigating, not a sign of misconduct.
Replication failure due to methodological fragility — Small samples, undisclosed analytic choices, or lack of blinding produced a false positive. The original effect probably does not exist at the claimed magnitude.
Non-replication due to improved methodology — The replication was conducted with greater statistical power and stricter controls. This is the most instructive case: it suggests the original finding should be downweighted, and the replication treated as more informative.

The distinction also applies to proposed solutions. Preprints and open access research accelerate the dissemination of replication attempts. Preregistration — registering a study's hypotheses and analysis plan before data collection — directly limits p-hacking. Open data mandates allow independent analysts to check raw files. Larger, multi-site collaborations through networks like the Many Labs project increase statistical power and cross-cultural generalizability.

None of these are panaceas. Preregistration prevents fishing expeditions but cannot substitute for good theory. Open data is valuable but introduces privacy complications in clinical research governed by Institutional Review Boards. The research ethics and integrity frameworks that institutions apply are necessary but operate downstream of the incentive structures that created the problem in the first place.

The broader question — how science self-corrects and at what speed — connects directly to how findings travel from journals to policy and public understanding. That chain is worth examining carefully, especially for anyone using scientific findings as a foundation for decisions. A good starting point is the National Science Authority home page, which maps the broader landscape of how scientific research is produced, evaluated, and communicated.

The Replication Crisis: Causes, Impact, and Solutions

Definition and scope

How it works

Common scenarios

Decision boundaries

References

Read Next