Research Data Management: Storage, Sharing, and Best Practices

Research data management (RDM) sits at the intersection of scientific rigor, institutional policy, and practical logistics — and it matters far more than most researchers realize until something goes wrong. This page covers how RDM is defined and scoped, how its core mechanisms operate in practice, where it most commonly comes into play across the research lifecycle, and how researchers and institutions navigate decisions about storage, access, and long-term preservation.

Definition and scope

A hard drive fails. A graduate student graduates and takes institutional knowledge with them. A journal requests raw data for post-publication review, and no one can locate a consistent file structure. These are not hypothetical edge cases — they are documented failure modes that motivated the formalization of research data management as a discipline.

RDM refers to the organized processes and policies governing how research data is collected, stored, documented, shared, and preserved throughout a project's lifecycle and beyond. The scope extends from the moment data is generated — a sensor reading, a survey response, a sequenced genome — through publication, archiving, and eventual reuse or deletion.

Federal funders have formalized this expectation. The National Science Foundation's Dissemination and Sharing of Research Results policy requires that most grant proposals include a Data Management Plan (DMP), a document specifying how data will be handled, stored, and made available. The National Institutes of Health similarly issued a Data Management and Sharing Policy that took effect in January 2023, applying to all NIH-funded research generating scientific data. These are not suggestions — proposals without adequate DMPs can be returned without review.

The scope of what counts as "research data" is deliberately broad. Under the NIH definition, it includes experimental measurements, observational records, images, interview transcripts, computational code, and the metadata needed to interpret any of the above.

How it works

A functional RDM system operates across four practical layers:

  1. Documentation and metadata — Data must be described consistently so it remains interpretable without the original researcher present. This means standardized file naming conventions, README files, and discipline-specific metadata schemas. In genomics, for instance, the Dublin Core Metadata Initiative and domain standards like Darwin Core define what fields must be recorded alongside biological specimen data.

  2. Storage architecture — Active data (in-use during analysis) is typically held on local or institutional servers with redundancy. Best practice follows the 3-2-1 rule: 3 copies of data, on 2 different media types, with 1 copy stored off-site or in the cloud.

  3. Access and permissions management — Not all data can be made fully open. Human subjects data governed by HIPAA or the Common Rule (45 CFR §46) requires tiered access controls, data use agreements, and sometimes de-identification before any sharing occurs. This is where research ethics and integrity intersects directly with data infrastructure.

  4. Long-term preservation and sharing — Repositories such as the Inter-university Consortium for Political and Social Research (ICPSR), Zenodo, and Dryad provide persistent identifiers (DOIs) for datasets, enabling citation and long-term access independent of any individual researcher's institutional affiliation.

The NIH's 2023 policy requires that all submitted datasets include sufficient metadata for discovery and use — not just deposit — which is a meaningfully higher bar than what most prior archiving practices met.

Common scenarios

RDM challenges cluster around three recognizable scenarios:

High-volume instrument data. Labs running mass spectrometers, genome sequencers, or imaging instruments can generate terabytes per week. The bottleneck is rarely storage cost; it is the metadata discipline required to make those files findable six months later. Computational and data-driven research pipelines increasingly automate metadata capture at the point of data generation to address this.

Human subjects and sensitive data. Clinical trial datasets, survey responses linked to identifiable individuals, and health records require different handling than ecological field observations. Clinical trials operating under FDA oversight and NIH funding must comply with both the Common Rule and trial-specific data sharing requirements, often depositing de-identified data in repositories like ClinicalTrials.gov within 12 months of study completion.

Longitudinal and collaborative projects. A 10-year cohort study with 15 participating institutions generates a coordination problem as much as a storage problem. Version control systems — increasingly Git-based even for non-code datasets — and shared data dictionaries become essential. The research collaboration and partnerships model in large-scale science has accelerated demand for formalized data governance agreements between institutions before a single data point is collected.

Decision boundaries

The critical fork in RDM decision-making is open versus restricted access, and it is rarely a binary choice.

Data type Typical access model Governing standard
Non-sensitive observational/environmental Fully open, public repository NSF DMP requirements
De-identified human subjects data Controlled access with DUA Common Rule, HIPAA
Identifiable clinical data Restricted; IRB-supervised HIPAA, 45 CFR §164
Proprietary/industry-sponsored research Embargo periods common Institutional IP agreements

A second decision boundary involves retention timelines. Federal regulations under 2 CFR §200.333 require that grant recipients retain financial and programmatic records — including data — for a minimum of 3 years after the final expenditure report. Many institutions set longer retention windows, particularly for clinical or regulatory research where data may need to support post-market surveillance or litigation.

Understanding where a project falls on these axes early — ideally before the data collection methods phase begins — determines which infrastructure choices are even available later. The broader landscape of scientific research practices this connects to is covered across nationalscienceauthority.com.

References