Data Quality: Dimensions, Scoring, and Where It Actually Lives in the Stack

"Is the data good?" is the question every stakeholder eventually asks, and it's a bad question — not because it's unfair, but because "good" isn't one thing. I've watched teams argue past each other in exactly this way: the platform team points at 99.9% pipeline uptime, the analyst points at a report with three duplicate customer records, and both are technically right, because uptime and duplicate-freedom are different dimensions of quality that happen to both be true at once. Data quality isn't a single score — it's a small set of named, independently measurable properties, and most of the friction in a "the data is bad" conversation comes from nobody having agreed which property is actually broken.

This is the framework that pipeline testing (dbt tests, Great Expectations, Soda — pre-deploy CI checks) and data observability (freshness, volume, schema, distribution, lineage — production monitoring) both quietly assume but never define: what "quality" actually means, how to score it, how to triage when it breaks, and which layer of the stack should be catching which class of problem.

What are the actual dimensions of data quality?

The DAMA-DMBOK (Data Management Body of Knowledge) framework — the closest thing the data management field has to a standard reference — names six dimensions that show up, by one name or another, in nearly every serious data quality program: accuracy, completeness, consistency, timeliness, validity, and uniqueness. Each one describes a genuinely different way data can be wrong, and a dataset can score well on some and badly on others simultaneously — which is exactly the source of the "is the data good" disagreement above.

DimensionWhat it meansExample violation
AccuracyData correctly represents the real-world entity it describesA customer's recorded address is their old one — the record exists, is well-formed, and is wrong
CompletenessAll necessary data is present, no missing required fields or records30% of orders have a null shipping_country, breaking regional revenue reports
ConsistencyThe same fact agrees across systems and sourcesA customer's tier is "Gold" in the CRM and "Silver" in the billing system
TimelinessData is available and current enough for its intended useYesterday's inventory count loads at 2pm, after the morning restocking decision already happened
ValidityData conforms to its defined format, type, or domain of allowed valuesA status field contains "shiped" — a value outside the defined enum
UniquenessNo unintended duplicate records for the same real-world entityThe same customer appears three times from three signup channels, inflating a headcount metric

DMBOK 2.0 lists additional dimensions in its full taxonomy (integrity, reasonableness, currency among them), but these six do almost all of the practical work — they're the ones with unambiguous, machine-checkable definitions, which matters because a dimension you can't turn into a query isn't one you can actually monitor or enforce.

How are DQ dimensions different from observability's "five pillars"?

This is the distinction that trips people up, because both frameworks use the word "quality" loosely and both eventually point at the same underlying bad data. The dimensions above are a rubric — they describe what "quality" means for a given dataset, independent of how or when you check it. Data observability's pillars (freshness, volume, schema, distribution, and lineage — see the observability deep dive for the full breakdown) are a monitoring strategy — they describe how you detect drift in production, continuously, without anyone writing a specific check for a specific failure in advance.

They overlap but aren't the same thing. A freshness alert (an observability pillar) might be the symptom that surfaces a timeliness violation (a DQ dimension) — but a completeness violation (a null column creeping in) might never trip a freshness or volume alert at all, because the data arrived on time and in the expected row count, just with a field silently empty. Observability is very good at catching the failure modes it's built to watch for — drift, delay, volume anomalies — and comparatively weak at catching a dimension violation that doesn't manifest as drift, like a systematically wrong-but-stable accuracy problem (a mis-mapped currency conversion that's been wrong, consistently, since day one). That gap is exactly why a DQ framework and an observability strategy are complementary layers, not substitutes for each other.

graph TD
    DIM["DQ dimensions
(the rubric: what 'quality' means)"] A1["Accuracy"] --> DIM A2["Completeness"] --> DIM A3["Consistency"] --> DIM A4["Timeliness"] --> DIM A5["Validity"] --> DIM A6["Uniqueness"] --> DIM OBS["Observability pillars
(the monitoring strategy: how drift is caught)"] B1["Freshness"] --> OBS B2["Volume"] --> OBS B3["Schema"] --> OBS B4["Distribution"] --> OBS B5["Lineage"] --> OBS DIM -.->|"a dimension violation
MAY surface as"| OBS

Two related but distinct frameworks. Dimensions define what "quality" means for a record or dataset; observability pillars define how drift gets detected continuously in production. A dimension violation sometimes trips an observability alert (a completeness collapse shows up as a volume anomaly) and sometimes doesn't (a stable, systematic accuracy error never looks like drift to a monitoring system, because it was wrong from the start and stays wrong).

How do you turn six dimensions into one number leadership can act on?

Composite quality scoring aggregates dimension-level checks (each one either a pass rate — "98.7% of rows passed the completeness check" — or a binary pass/fail at the dataset level) into a single score per dataset or table, and the design decision that matters most is weighting. Treating all six dimensions as equally important is the default most teams start with, and it's usually wrong: a marketing analytics table can tolerate a completeness gap in an optional field far better than a billing table can tolerate an accuracy error in a charge amount, and a score that weights both the same way produces a number nobody trusts, because it doesn't track what actually matters for that specific dataset. The fix is weighting each dimension per table by business criticality — decided deliberately, with the data owner, not defaulted to uniform — so the composite score for a finance table punishes accuracy and consistency failures much harder than it punishes a timeliness miss on a field nobody urgently needs same-day.

# A composite DQ score definition — weights set deliberately per table,
# not defaulted to equal, because "quality" means something different
# for a billing table than for a marketing engagement table
table: billing.invoices
dimensions:
  accuracy:     { weight: 0.35, check: charge_amount_reconciliation }
  completeness: { weight: 0.25, check: required_fields_not_null }
  consistency:  { weight: 0.20, check: cross_system_customer_tier_match }
  validity:     { weight: 0.10, check: enum_and_format_checks }
  uniqueness:   { weight: 0.10, check: dedup_by_invoice_id }
composite_score_threshold: 0.95
owner: billing-data-team

How should DQ issues be triaged — is every violation a page?

No, and treating every violation as equally urgent is how a DQ program trains people to ignore alerts. A working severity framework separates issues by actual business impact: critical (blocks a financial close, breaks a regulatory report, corrupts a customer-facing number — pages someone now), high (a key internal dashboard is wrong, needs same-day attention but doesn't wake anyone up), medium (a non-critical field or a low-traffic report, fix within the sprint), and low (cosmetic, batched into routine cleanup). The ownership question matters as much as the severity level — a triage framework without a named owner per severity tier just relocates the argument from "is this bad" to "whose problem is this," which is a worse argument to have during an actual incident than before one.

Severity should be a property of the table and the specific check, decided in advance — not improvised in the moment an alert fires. The fastest way to build alert fatigue is discovering, live, that a "critical" alert fires every day for something that turns out to be routine, or that a genuinely critical failure got labeled "low" by default and sat unaddressed for a week. Set severity when the check is written, alongside the DQ weighting decision above, and revisit it on a schedule — not reactively, after the framework has already lost people's trust.

Where in the stack should DQ actually be enforced?

Across four layers, each catching a different class of problem, and knowing which layer is responsible for which class is what keeps a DQ program from either leaving gaps or duplicating the same check three times for no benefit.

LayerCatchesMisses
Ingestion-time validationMalformed records, schema violations, type mismatches — before bad data ever landsSemantic accuracy (a well-formed but wrong value), cross-system consistency
Transformation-time (dbt/GE tests)Logic bugs, referential integrity, known-shape assertions on transformed data — see pipeline testingAnything that only manifests over time, or in data the tests didn't anticipate checking
Runtime observabilityDrift, freshness delays, volume anomalies, schema changes appearing after deploy — see data observabilityStable, systematic errors present since day one — nothing "drifted," so nothing trips
BI-layer sanity checksThe last line of defense before a human sees a wrong number — reasonableness bounds on a dashboard metricEverything upstream that already should have been caught — this layer is a safety net, not a strategy

The practical rule: push each class of check to the earliest layer capable of catching it, and don't rely on a later layer to catch what an earlier one should have. Malformed and type-invalid data belongs at ingestion, not discovered three transformations downstream. Logic and referential-integrity problems belong in transformation-time tests, where dbt's test framework runs on every model build. Drift and freshness genuinely can't be caught earlier than runtime, because they're properties of behavior over time, not of a single record — that's the class observability exists for. And a BI-layer sanity check that's actually catching real problems regularly is a signal something upstream is under-covered, not a sign the program is working as designed.

What to carry away

"Is the data good" only becomes an answerable question once you name the dimension in play — accuracy, completeness, consistency, timeliness, validity, and uniqueness are six genuinely different properties, and a dataset can pass some while failing others at the same time. DQ dimensions are the rubric for what quality means; observability's pillars are the monitoring strategy for catching drift — related, overlapping in practice, but not interchangeable, and a program that only has one of the two has a real, specific blind spot (dimensions alone won't catch drift over time; observability alone won't catch a stable error that was wrong from day one).

Turn dimensions into one number only by weighting deliberately per table against business criticality, never defaulting to equal weights across dimensions that don't actually matter equally. Triage by real business impact, decided in advance, with a named owner per severity tier. And enforce each class of problem at the earliest layer capable of catching it — ingestion for malformed data, transformation-time tests for logic, runtime observability for drift, BI-layer checks as the last-resort net — because relying on a downstream layer to catch what an earlier one should have is how "the data is bad" conversations keep happening after the DQ program is supposedly in place.