The Replication Crisis: Causes, Scope, and Reform Efforts

The replication crisis refers to a widespread, documented failure in scientific practice: independent researchers attempting to reproduce published findings frequently cannot. It spans psychology, medicine, economics, and biology — fields that collectively inform public health decisions, government policy, and clinical care. This page examines how the crisis is defined, what structural forces drive it, where the boundaries of the problem lie, and what the scientific community has done to address it.

Definition and scope
Core mechanics or structure
Causal relationships or drivers
Classification boundaries
Tradeoffs and tensions
Common misconceptions
Checklist or steps
Reference table or matrix
References

Definition and scope

In 2015, the Open Science Collaboration published a landmark study in Science that attempted to replicate 100 psychology experiments drawn from three leading journals. Only 36 of those 100 studies produced results that held up — a replication rate of 36% (Open Science Collaboration, Science, 2015). The scientific community had been quietly aware of the problem for years, but that number — printed plainly in one of the most cited journals in the world — made it impossible to dismiss.

Replication, in its strictest definition, means that an independent research team, following the original methodology as closely as possible, obtains results consistent with the original study's conclusions. The replication crisis is what happens when this process fails at scale, systematically, across disciplines. It is distinct from ordinary scientific revision — the gradual updating of knowledge as better data arrives — because it implicates not future refinements but whether the original findings were reliable in the first place.

The scope extends well beyond psychology. A 2016 survey of 1,576 researchers conducted by Nature found that more than 70% of respondents had tried and failed to reproduce another scientist's experiments (Baker, Nature, 2016). Biomedical research has been particularly scrutinized: a 2011 analysis by Bayer HealthCare found that only 25% of published preclinical studies could be reproduced internally (Prinz et al., Nature Reviews Drug Discovery, 2011). The Reproducibility Project: Cancer Biology, launched in 2013, attempted replication of 50 high-profile cancer studies and ultimately found substantial effect-size inflation in the majority it completed.

These numbers, taken together, are not a rounding error. They represent a structural feature of how science has been conducted and published — and understanding that structure is the starting point for any honest conversation about scientific research.

Core mechanics or structure

At the mechanical level, the replication crisis operates through a chain of decisions — many individually defensible, collectively corrosive. A researcher collects data, analyzes it, finds a statistically significant result (p < 0.05 by convention), and submits it for publication. A journal, incentivized to publish novel and positive findings, accepts it. No one has checked whether a different lab, running the same protocol, gets the same answer.

The p-value threshold is particularly central. The convention that p < 0.05 signals a "real" finding was never intended to function as a binary pass/fail gate for truth. Statistician Ronald Fisher, who formalized the p-value in the 1920s, intended it as one piece of evidence within a broader inferential context — not as a certification stamp. Yet it became exactly that, and a cottage industry of practices evolved around meeting it.

Those practices have names: p-hacking (testing multiple hypotheses or analysis variations until one crosses the threshold), HARKing — Hypothesizing After Results are Known — (presenting a post-hoc explanation as if it were a pre-registered prediction), and selective reporting (publishing only the experiment that worked while file-drawering the four that didn't). None of these practices require deliberate fraud. They can emerge from honest researchers under pressure, making small decisions that each seem reasonable in isolation.

The peer review process, which is supposed to catch these problems, occurs after the decisions have been made — reviewers see a finished manuscript, not a lab notebook. They can evaluate the logic of the analysis but rarely have access to the raw data needed to spot undisclosed analytic flexibility.

Causal relationships or drivers

Publication bias is the gravitational field that pulls everything else into a predictable orbit. Journals have historically been far more likely to publish positive results than null results, which means negative findings — often carrying equal scientific information — disappear into file drawers. A 2010 meta-analysis by Fanelli found that the proportion of papers reporting positive outcomes in peer-reviewed literature increased by more than 22 percentage points between 1990 and 2007 across scientific disciplines (Fanelli, PLOS ONE, 2010).

Incentive structures at the institutional level amplify this. Academic hiring, promotion, and grant funding have historically rewarded publication count and high-impact placements — not methodological rigor or replication attempts. A researcher who spends two years carefully replicating someone else's study, finds the original doesn't hold, and publishes a null result in a mid-tier journal has produced less career capital than a colleague who ran five quick studies, got five positive results through flexible analysis, and published in three top-tier journals. The crisis is, in part, a rational response to irrational incentives.

Underpowered studies compound the problem. Statistical power — the probability that a study will detect a true effect if one exists — depends on sample size. Small samples produce noisy estimates, inflated effect sizes, and results that are unlikely to hold in a larger, pre-registered replication. Many published studies in social psychology and behavioral economics used sample sizes of 20 to 40 participants, providing dramatically insufficient power to detect the small-to-medium effects that most human behavior research involves.

Classification boundaries

Not all replication failure is equivalent, and conflating types produces confused diagnosis. Three distinct categories matter:

Direct replication involves running the same study with the same methods, materials, and population. Failure here implicates the original finding most directly.

Conceptual replication tests the same hypothesis using different methods. Failure may mean the original finding was method-dependent — real under narrow conditions, not generalizable.

Systematic error — fabrication, falsification, or plagiarism — falls under research misconduct and fraud, a separate category governed by federal oversight bodies including the Office of Research Integrity (ORI) at the U.S. Department of Health and Human Services (HHS ORI). The replication crisis, strictly defined, is not primarily a fraud problem. It is a methodological and incentive problem that operates largely within the bounds of accepted — if problematic — practice.

The distinction matters because fraud and structural bias require different remedies. Conflating them misidentifies the problem and leads to solutions aimed at bad actors rather than at the system producing unreliable results from well-intentioned researchers.

Tradeoffs and tensions

Reform efforts have introduced genuine tensions that resist easy resolution.

Pre-registration — publicly logging a study's hypotheses and analysis plan before data collection — eliminates HARKing and constrains p-hacking. The Center for Open Science's OSF platform hosts tens of thousands of pre-registered studies (OSF). But critics note that pre-registration can restrict the kind of exploratory, hypothesis-generating work that has historically produced breakthroughs. Requiring researchers to commit to a specific path through a dataset before looking at it may be appropriate for confirmatory research and poorly suited to discovery science.

Open data mandates increase transparency but create legitimate concerns about participant privacy, particularly in clinical and behavioral research. The institutional review board framework that governs human subjects research was not designed for a world where raw datasets are posted publicly.

Replication studies consume resources that might otherwise fund new discovery. Allocating 20% of a federal grant budget to replication could improve reliability but reduces the volume of new science. The National Institutes of Health (NIH) and the National Science Foundation (NSF) have taken incremental steps toward requiring data sharing, but neither agency mandates large-scale replication funding (NIH Data Sharing Policy).

Common misconceptions

Misconception: A failed replication proves the original finding is false.
Replication failure is evidence against a finding, not proof it is wrong. Effect sizes vary across populations, contexts, and time. A social psychology finding that holds in U.S. undergraduate samples may not hold in other populations — that is a scope limitation, not necessarily a false original claim.

Misconception: The replication crisis means science cannot be trusted.
The crisis exists precisely because science includes mechanisms for self-correction that other knowledge systems lack. The Open Science Collaboration's 2015 study was published in Science — the field policed itself. That process is slow and uncomfortable, but it is functioning.

Misconception: The problem is unique to social science.
High-profile replication failures in cancer biology, pharmacology, and economics contradict the idea that only "soft" sciences have problems. The Bayer HealthCare analysis referenced above found a 25% replication rate in preclinical biomedical research — an area with substantial quantitative rigor.

Misconception: Peer review catches these problems.
Standard peer review does not include independent data re-analysis, replication attempts, or pre-registration checks. It evaluates plausibility and logic, not reproducibility. Statistical analysis in research requires tools that extend well beyond traditional review.

Checklist or steps

Indicators of methodological rigor in published research (evaluation framework):

Reference table or matrix

Replication rates and reform responses across selected disciplines

Field	Representative Replication Study	Approximate Replication Rate	Key Reform Mechanisms
Psychology	Open Science Collaboration (2015)	36% of 100 studies	Pre-registration, open data, Registered Reports
Cancer Biology	Reproducibility Project: Cancer Biology (2021)	Substantial effect-size inflation across completed subset	Detailed protocol publication, material sharing
Preclinical Biomedicine	Bayer HealthCare internal analysis (2011)	~25%	Blinded analysis, independent validation cohorts
Economics	Camerer et al., Science (2016)	61% of 18 studies from top economics journals	Pre-analysis plans, AEA RCT Registry
Social Science (general)	Camerer et al., Nature Human Behaviour (2018)	62% of 21 Social Sciences studies	OSF pre-registration, effect size reporting standards

Replication rates reflect studies replicating in the direction and magnitude of the original finding. Figures drawn from cited publications; methodological definitions vary across projects.