Data Catalog Architecture: Build vs Buy — DataHub, Amundsen, Purview, Unity Catalog

I've watched two data catalogs get switched off within a year of launch, and both failures looked identical from the outside: a technically sound deployment that nobody used. Not because the architecture was wrong — because adoption is a search-and-discovery UX problem first and a metadata-completeness problem second, and both rollouts optimized for the second while ignoring the first. If picking a catalog were purely a matter of comparing feature checklists, every team would land on the same answer. They don't, because the right answer depends on operational appetite as much as feature set — and that's the part vendor comparisons gloss over.

This is the architecture-first version of that decision: how the major open-source catalogs are actually built under the hood, where platform-native catalogs fit, when commercial makes more sense than open source, and the cost most teams don't budget for until they're a year into running one.

What is a data catalog, architecturally?

A data catalog is a searchable inventory of an organization's data assets — tables, columns, dashboards, pipelines — enriched with metadata that helps people find, understand, and trust them. Architecturally, every catalog has to solve the same three problems regardless of vendor: ingest metadata from every system that holds data, store and index it in a way that supports both structured queries and free-text search, and serve it through a UI and API that people actually want to use. Where catalogs diverge is in how much infrastructure each of those three steps demands, and that's the axis that should drive a build-vs-buy decision more than the feature comparison chart.

How do the open-source catalogs differ architecturally?

DataHub (originated at LinkedIn), Amundsen (originated at Lyft), and OpenMetadata are the three open-source projects that dominate self-hosted catalog conversations, and their component architectures reflect genuinely different design philosophies, not just different logos.

	DataHub	Amundsen	OpenMetadata
Origin	LinkedIn	Lyft	Independent (Collate)
Core stack	Relational DB + Elasticsearch + a graph DB (JanusGraph/Neo4j) + Kafka for streaming ingestion	Neo4j or Atlas + Elasticsearch	MySQL/Postgres + Elasticsearch
Update model	Real-time, event-driven via Kafka	Primarily batch ingestion	Batch + incremental connectors
Operational footprint	Heaviest — multiple stateful components to run and tune	Lighter — fewer moving parts	Simplest — fewest components
Strongest at	Real-time lineage, fine-grained governance, large-scale enterprise metadata	Fast, simple, Google-like search and discovery	Collaboration features, broad connector library out of the box

The pattern worth internalizing: DataHub's architectural sophistication is also its operating cost. Real-time, event-driven metadata propagation through Kafka into a graph database is genuinely powerful — column-level lineage, near-instant updates when an upstream schema changes — but it means standing up and operating a graph database and a Kafka pipeline as platform infrastructure, on top of the catalog itself. Amundsen's lighter stack is a direct trade against that capability: less real-time sophistication, meaningfully less to operate. OpenMetadata sits architecturally between the two but leans toward simplicity, which is part of why it's become the default recommendation for teams that want broad connector coverage without DataHub's operational weight.

graph TD
    SRC["Source systems
(warehouses, BI tools,
pipelines, dashboards)"]
    ING["Metadata ingestion
(push via API, or pull via crawlers/connectors)"]
    STORE["Storage layer
(relational + search index,
+ graph DB for DataHub)"]
    UI["Search & discovery UI"]
    GOV["Governance layer
(ownership, tags, lineage, classification)"]
    SRC --> ING --> STORE
    STORE --> UI
    STORE --> GOV

The shape every catalog shares. Metadata arrives either pushed by an instrumented source (an Airflow or dbt integration emitting events) or pulled by a scheduled crawler/connector hitting the source's API. It lands in a storage layer that has to support both structured filtering and free-text search — which is why most catalogs run a relational store plus a search index rather than one general-purpose database. The governance layer (ownership, classification, lineage) and the search UI are both read paths over the same store, but they're the two halves users actually judge the catalog by.

Push versus pull: how does metadata actually get into the catalog?

Pull (crawler-based) ingestion means the catalog periodically scans source systems — querying a warehouse's information schema, hitting a BI tool's API, walking an orchestrator's metadata store — and reconciles what it finds against what it already knows. This is simple to set up (point a connector at a source, schedule it) but inherently stale between scans, and it can miss anything transient that happened between runs. Push (event-based) ingestion means instrumented systems emit metadata events as things happen — closer to how OpenLineage captures lineage — which gets you near-real-time freshness at the cost of needing every source system instrumented rather than just crawlable. Most production catalog deployments end up using both: pull for the long tail of sources where instrumentation isn't worth the effort, push for the handful of high-value pipelines where freshness actually matters.

What's the actual difference between technical, business, and governance metadata?

This distinction is easy to gloss over and it's the one that determines whether your catalog gets adopted by engineers only, or by the analysts and business stakeholders who are the actual point of having a catalog. Technical metadata is what a crawler can extract automatically — column names, types, row counts, last-modified timestamps. Business metadata is the human layer crawlers can't infer — what does this table mean, who owns it, is it safe to use for a board report. Governance metadata is the compliance and policy layer — classification tags (PII, confidential), retention rules, access policies. A catalog that's all technical metadata and no business metadata is a glorified information_schema browser; it'll get used by the data team and ignored by everyone else, which is exactly the failure mode I opened with. The catalogs that get genuine cross-org adoption are the ones where filling in business metadata is easy enough that people actually do it — which is a UX problem, not an architecture problem, and it's why two catalogs with identical ingestion architecture can have wildly different adoption outcomes.

Where do platform-native catalogs like Unity Catalog and Purview fit?

If most of an organization's data genuinely lives in one platform, a platform-native catalog removes an entire category of problem the standalone tools have to solve from scratch: it already knows every table, every permission, every lineage edge, because it is the platform's own metadata layer rather than a system reconciling against it from outside. Unity Catalog is the clearest example for a Databricks-centric estate — governance, lineage, and access control are native to the platform rather than bolted on. Microsoft Purview plays a similar role across the Azure/Fabric estate, and AWS Glue Data Catalog is the equivalent metastore-as-catalog for AWS-centric lakehouses. The honest trade-off: platform-native catalogs are excellent within their platform's boundary and noticeably weaker the moment your estate spans multiple platforms — a shop running Databricks, Snowflake, and a half-dozen SaaS tools will find that no single platform-native catalog actually sees the whole picture, which is exactly the gap standalone catalogs (open source or commercial) exist to close.

When does commercial (Alation, Collibra) beat open source?

The honest framing: open-source catalogs are not free, they're a different allocation of cost. Running DataHub, OpenMetadata, or Amundsen well takes real, ongoing engineering time — provisioning and tuning the storage layer, building and maintaining connectors for sources without an out-of-the-box integration, handling upgrades, and fielding the inevitable "why doesn't this table show up" tickets. That's commonly somewhere around half to a full engineer's time on an ongoing basis once a deployment is past the pilot stage, which is real cost even though no invoice says so. Commercial tools like Alation and Collibra trade that engineering time for licensing spend, plus typically stronger out-of-the-box governance workflows, stewardship features, and vendor support — they're often the better fit for organizations where data governance is a compliance-driven mandate with dedicated budget and a non-engineering team expected to own stewardship, rather than an engineering-led initiative. If you don't have either spare engineering capacity or governance budget, that's a real signal the rollout will stall regardless of which tool you pick.

The architecture comparison is the easy 20% of this decision; the integration tax is the hard 80%, and it's invisible until you're past the pilot. A catalog that doesn't talk to your lineage tool, your data-quality checks, and your access-control system becomes a second source of truth that drifts from the first — exactly the rot that makes hand-maintained lineage diagrams useless, now applied to your entire metadata layer. Before committing to any catalog, walk through how it will actually connect to whatever you're using for lineage and quality (does it consume OpenLineage events, or do you need a separate sync job?) and how its access model maps onto your real governance layer (row and column-level security enforced where the query actually runs, not just documented in the catalog). A catalog with beautiful search and a governance model nobody else's tooling respects is theater.

What's the actual adoption driver, if it's not feature completeness?

Search-and-discovery UX, full stop. Every catalog post-mortem I've seen comes down to the same root cause: people couldn't find what they were looking for fast enough, so they went back to asking in Slack or pinging the data team directly, and usage cratered. Amundsen's whole design philosophy — fast, simple, Google-like search with usage/popularity signals surfaced prominently — exists because Lyft learned this the hard way before building it. The lesson generalizes past any one tool: a catalog with perfect metadata completeness and a clunky search experience loses to a catalog with 70% metadata coverage and search that returns the right table in the first three results. Optimize for that ruthlessly before you optimize for connector count or governance feature depth.

What to carry away

Every catalog solves the same three problems — ingest, store/index, serve — but the open-source options make genuinely different architectural bets: DataHub's Kafka-plus-graph-database stack buys real-time, fine-grained lineage at real operational cost; Amundsen's lighter stack buys simplicity and fast search at the cost of real-time sophistication; OpenMetadata sits in between with the broadest out-of-the-box connector coverage. Platform-native catalogs (Unity Catalog, Purview, Glue Data Catalog) are excellent within a single platform's boundary and weak across a multi-platform estate, which is exactly where standalone catalogs — open source or commercial — earn their keep.

Self-hosting open source is not free; it's roughly half an engineer's ongoing time once you're past pilot, and that's the honest comparison point against commercial licensing. But none of this matters if the catalog isn't searchable fast enough to beat asking in Slack — adoption lives or dies on discovery UX and how easy it is to fill in business metadata, not on the architecture diagram. Pick the storage and ingestion model that fits your operational appetite, then spend the real effort on making the thing fast to search and easy to enrich, because that's what determines whether anyone outside the data team ever opens it twice.