GA4GH Standards: How Genomic Data Is Shared Without Moving It

The first time I tried to help two research hospitals run a joint genomics analysis, the architecture review lasted ten minutes before it hit a wall that no amount of cloud budget could move. Site A's data couldn't legally leave its jurisdiction. Site B's consent forms permitted use only on B's premises. And even if both had said yes, a single whole-genome sequence is tens to hundreds of gigabytes — multiply by a cohort and "just copy it to a shared bucket" stops being a transfer and starts being a logistics project measured in weeks and five figures of egress. The data physically and legally did not want to move.

That constraint is the entire reason GA4GH exists, and it's why genomics solved a problem the rest of data engineering is only now taking seriously: how do you analyze data you can't centralize? The answer is to stop moving data to the compute and start moving compute to the data — but that only works if everyone agrees on how to discover datasets, how to access a slice of one, how to describe a variant, and how to prove a researcher is allowed. GA4GH is the body that wrote those agreements down as open standards. This is a tour of the ones that matter and how they fit together into a working federated system.

What is GA4GH?

The Global Alliance for Genomics and Health (GA4GH) is a not-for-profit standards alliance — over 5,000 individuals and organizations across six continents — formed in 2013 to enable responsible sharing of genomic and health-related data. It doesn't host data or run a platform. It produces open technical standards and policy frameworks, the way the IETF produces protocols, so that the patchwork of national biobanks, hospital systems, and research consortia can interoperate instead of each building a private silo. If you've touched clinical or research genomics infrastructure, you've used GA4GH standards whether you knew the name or not — they sit under tools like htsget-backed genome browsers, Beacon networks, and the workflow runners on every major cloud's genomics offering.

The standards split cleanly into five jobs: discover what data exists, access a piece of it, represent the science consistently, authorize who's allowed, and compute across sites. I'll go through them in that order, because that's the order a real federated query travels.

Discovery: finding data without exposing it

Beacon is the one most people meet first, and it's a beautifully minimal idea: a Beacon answers the question "do you have any genomes with a variant at this position?" with, in its simplest form, yes or no. That single bit lets a researcher discover that a relevant cohort exists somewhere — without that site ever exposing a patient, a genotype, or a record. Richer Beacon v2 responses can return counts and aggregate detail, always gated by the dataset's consent terms. It's discovery designed around the privacy constraint instead of fighting it.

Data Connect (formerly GA4GH Search) is the heavier sibling: a standard API for querying and discovering datasets regardless of how they were stored or formatted, returning tabular results with attached schemas. Beacon tells you a needle exists in some haystack; Data Connect lets you ask structured questions across catalogs of haystacks. Together they're how a federated analysis starts — by finding the nodes worth talking to before any data access happens.

Access: streaming a slice instead of copying the file

Once you know which dataset you want, two standards govern getting at it without the wholesale-copy problem. htsget is a protocol for downloading reads and variants for a subsection of the genome over HTTP — give it a region (say, chr7:55,019,000-55,211,000 around EGFR) and it streams just those records out of a BAM/CRAM/VCF, secured and byte-ranged, instead of shipping the whole multi-gigabyte file. For most analyses you care about a handful of genes, not the whole genome, so htsget turns a 100 GB transfer into a few megabytes.

DRS (the Data Repository Service) solves the other half: a standard way to resolve a data object's ID to its access methods and physical locations, across clouds and institutions. A workflow references a genome by a DRS ID (drs://server/object-id) rather than a hardcoded S3 path, and DRS tells the runtime where the bytes actually are and how to fetch them — so the same pipeline runs unmodified whether the data sits in AWS, GCP, or an on-prem store. Pair it with refget, which identifies reference sequences by the checksum of their content rather than by a filename like hg38.fa, and you can guarantee two sites aligned against bit-identical references — a subtle source of irreproducibility that content-addressing kills cleanly.

Representation: making the science computable

Federation is pointless if two sites encode the same biological fact differently, so GA4GH standardizes the data model too. Phenopackets — now an ISO standard — is a human- and machine-readable schema for packaging an individual's phenotypic and clinical data: observed features (using ontology terms like HPO), diagnoses, measurements, biosamples, and the genomic findings tied to them. It's the envelope that lets a phenotype-driven query mean the same thing in Toronto and Tokyo.

VRS, the Variation Representation Specification, does the same for variants: it defines a canonical, computable way to represent a genetic variant and derive a stable, normalized identifier for it. The problem it solves is that the "same" variant can be written half a dozen ways in different VCFs; VRS gives it one identity, so you can join across datasets without a fragile string-matching layer. If you've ever tried to reconcile variant calls from two pipelines by hand, you understand exactly why this exists. (This is also where genomics meets the knowledge-graph approach to clinico-genomics — stable variant IDs are what make the graph joinable.)

Authorization: passports, visas, and computable consent

Here's where genomics governance gets genuinely clever, and where the rest of us should be taking notes. You can't run federated analysis if every site maintains its own account for every researcher — that doesn't scale past a handful of partners. GA4GH's AAI (Authentication and Authorization Infrastructure) and Passports solve it with a visa metaphor: a researcher authenticates once with a trusted broker and receives a Passport containing signed visas — verifiable claims like "approved by Data Access Committee X for dataset Y" or "is a bona fide researcher." When they hit a data node, the node validates the visas cryptographically and decides access locally. No central account store, no per-site onboarding — just portable, signed assertions a resource server can verify on its own.

The other half of authorization is encoding what the data permits, and that's the Data Use Ontology (DUO). DUO lets data stewards tag a dataset with machine-readable permitted-use terms — "general research use," "disease-specific (oncology)," "no commercial use," "ethics approval required." Now the match between a researcher's authorization and a dataset's allowed uses can be computed automatically, instead of a committee reading consent language off a PDF. This is the computable cousin of the legal Data Use Agreement: DUO makes the DUA's restrictions something a machine can enforce at query time. And keeping the bytes safe end to end is Crypt4GH, an encrypted file format that keeps genomic data encrypted at rest and in transit throughout its lifetime, so a file can sit in shared storage without being readable by the storage operator.

The Passport/visa model is the part worth stealing for non-genomics systems. Most data platforms still authorize by maintaining per-system accounts and access lists, which collapses under multi-organization sharing. GA4GH's bet — authenticate once, carry signed claims, let each resource verify locally — is the same pattern that makes federated identity work on the web, applied to data access. If you're designing cross-org data sharing of any sensitive kind, the visa model scales where account provisioning doesn't.

Federated compute: bring the analysis to the data

The payoff standards are WES and TES, because they're what let the compute travel. The Workflow Execution Service (WES) is a standard REST API for submitting and running a workflow — a Nextflow or WDL/CWL pipeline — on any compliant backend. The Task Execution Service (TES) is the lower-level standard for running an individual task (a container with inputs and outputs) on any compute environment. Because the API is the same everywhere, you submit the identical analysis to each participating site's WES endpoint, it runs locally next to that site's data, and only the results — summary statistics, model updates, aggregate counts — come back. The raw genomes never move.

graph TD
    R["Researcher"]
    BROKER["AAI broker
(issues Passport + visas)"]
    R --> BROKER
    BROKER -->|"signed visas"| R
    R -->|"discover"| BEACON["Beacon / Data Connect
(which sites have relevant data?)"]
    subgraph SiteA["Site A (jurisdiction 1)"]
      WESA["WES endpoint"] --> DATAA["Genomes via DRS / htsget
(never leave site)"]
    end
    subgraph SiteB["Site B (jurisdiction 2)"]
      WESB["WES endpoint"] --> DATAB["Genomes via DRS / htsget
(never leave site)"]
    end
    R -->|"submit same workflow + visas"| WESA
    R -->|"submit same workflow + visas"| WESB
    WESA -->|"aggregate results only"| AGG["Combined result"]
    WESB -->|"aggregate results only"| AGG

A federated analysis. The researcher discovers relevant sites via Beacon, carries signed visas from an AAI broker, and submits the identical workflow to each site's WES endpoint. Compute runs next to the data — accessed locally through DRS/htsget under DUO-encoded consent — and only aggregate results return. The raw genomes never cross a jurisdiction.

The standards at a glance

Standard	Job	One-line definition
Beacon	Discovery	"Do you have this variant?" across datasets, honoring consent.
Data Connect	Discovery	Query/search datasets regardless of how they're stored.
htsget	Access	Stream reads/variants for a genomic region over HTTP.
DRS	Access	Resolve a data object ID to its locations and access methods.
refget	Access	Identify reference sequences by content checksum.
Phenopackets	Representation	Standard schema for phenotypic + clinical data (ISO).
VRS	Representation	Canonical, computable identifiers for variants.
Passports / AAI	Authorization	Portable signed visas asserting a researcher's permissions.
DUO	Authorization	Machine-readable permitted-use tags on datasets.
Crypt4GH	Security	Encrypted genomic file format, secure through its lifetime.
WES / TES	Compute	Standard APIs to run workflows / tasks on any backend.

Compliance: the framework underneath the protocols

The standards ride on a policy foundation, and skipping it is how technically-correct systems still get shut down by an ethics board. GA4GH's Framework for Responsible Sharing of Genomic and Health-Related Data sets the principles — consent, privacy, security, accountability — that the technical standards implement. Its companion policies (the Consent Policy, the Data Privacy and Security Policy) translate into design requirements that map onto regulation you already have to satisfy: GDPR in Europe (genomic data is special-category personal data), HIPAA in the US, and the various national biobank laws.

The thing to internalize is that GA4GH's design treats consent and privacy as architecture, not paperwork bolted on afterward. Beacon returns the minimum information that satisfies discovery. DUO makes consent computable so it can be enforced per query. Passports make authorization auditable and revocable. Federation means data stays under its origin's legal control. None of this is a compliance layer you add at the end — it's why the standards are shaped the way they are. (The same "governance as architecture" principle runs through how I've seen clinico-genomics governance built on Snowflake — different platform, identical instinct.)

The honest trade-offs

Federation isn't free, and anyone selling it as a clean win hasn't operated it. The real costs:

Operational complexity. A federated query that touches five sites depends on five WES endpoints, five sets of credentials-via-visas, and five teams' uptime. Debugging a failure that happened inside someone else's cluster, which you can't see into, is genuinely hard. Centralized data is easier to operate — that's the honest tension.
Uneven adoption. The standards are mature, but not every node implements every version. You will hit a site running Beacon v1 when you need v2, or a WES that supports CWL but not the workflow language you wrote in. Federation is only as strong as its least-capable participant.
Trust in the broker. The Passport model moves trust to the AAI brokers issuing visas. That's a feature for scale and a risk surface for security — a compromised or sloppy broker undermines every node that trusts it.
Performance. Bringing compute to data means your analysis runs on whatever hardware each site has, with no shuffle across sites. Algorithms that need all the raw data in one place (certain joins, some ML training) don't federate cleanly and need a redesign or a privacy-preserving technique layered on.

Adopting GA4GH standards is not the same as having an interoperable system. The trap I've watched teams fall into is checking the box — "we expose a Beacon, we have a DRS server" — and assuming federation now works. It doesn't, until you've tested an actual cross-site workflow end to end, reconciled standard versions with every partner, and proven the visa/DUO authorization path grants and denies correctly against real consent terms. The standards remove the need to invent protocols; they don't remove the integration work, the version negotiation, or the governance sign-off. Treat conformance as the start of interoperability testing, not the end.

What to carry away

GA4GH is the standards body that made federated genomics work, and the load-bearing idea is inversion: when data can't move for reasons of size, law, or consent, you move the compute instead. The standards exist to make that possible across organizations — discover with Beacon and Data Connect, access a slice with htsget and DRS, represent the science consistently with Phenopackets and VRS, authorize portably with Passports/AAI and computably with DUO, and run the same workflow anywhere with WES and TES — all sitting on the Framework for Responsible Sharing so consent and privacy are designed in, not bolted on.

Two ideas travel well beyond genomics. The Passport/visa model — authenticate once, carry signed claims, verify locally — is how cross-organization data sharing should work in any sensitive domain. And DUO's computable consent is the same instinct as a data contract: take an agreement humans enforced by hand and make a machine enforce it. For where these standards become a working platform, see building a clinico-genomics RAG on AWS and the real-world-evidence clinico-genomics series.