Text-to-SQL and the Semantic Layer: Why Chat-With-Your-Data Breaks

"Chat with your data" is the demo that sells the project and then quietly knifes it. I watched one land perfectly: the VP typed "what was revenue last quarter by region," the assistant wrote SQL, returned a tidy table, and the room was sold. Two weeks into the pilot, an analyst noticed the numbers were off — the model had summed an amount column that included refunds and cancelled orders, joined customers to orders in a way that fan-out-duplicated a chunk of rows, and used the order created date when finance reports on the closed date. Every answer was fluent, fast, and wrong, and a wrong number looks exactly like a right one until someone with context catches it.

That's the whole problem with naive text-to-SQL in one story. The model isn't bad at SQL — it's excellent at SQL. It's bad at knowing what your business means, because that meaning isn't in the schema. The fix isn't a smarter model; it's giving the model a smaller, governed problem to solve. This is why text-to-SQL over a raw warehouse breaks, what a semantic layer changes, and the patterns and verification UX that make conversational analytics trustworthy enough to put in front of an executive.

What text-to-SQL really is — and why it's so easy to demo

Text-to-SQL is translating a natural-language question into a SQL query that answers it. It demos trivially because a modern model, shown a schema, will write syntactically perfect SQL for almost anything you ask — and on a clean question against a tidy table, it's usually right. That early success is exactly what makes it dangerous: the failure mode isn't an error message, it's a plausible number that's subtly wrong, and there's no red squiggle for "this joined the wrong way."

The root cause is that a schema describes structure, not meaning. The columns tell the model that orders.amount is a number and orders.status is a string. They don't tell it that "revenue" means net amount excluding refunds and test accounts, that the canonical date is closed_at not created_at, that you join orders to customers through a bridge table to avoid double-counting, or that "active customer" has a specific 90-day definition the whole company agreed on. A human analyst carries that context in their head; the model has to guess it from column names — and it guesses confidently.

The model's problem isn't writing SQL — it's inferring business semantics the schema doesn't encode. Which revenue, which date, which join path, which filters, what a metric even means. Naive text-to-SQL asks the model to re-derive your company's analytics conventions on every question, fluently and invisibly. It will be wrong often enough that no one can trust any single answer.

The specific ways it produces wrong numbers

It's worth naming the failures precisely, because each one is silent:

Wrong measure. Sums gross when you mean net, counts rows when you mean distinct customers, ignores the "exclude internal accounts" filter everyone applies by reflex.
Join fan-out. A one-to-many join multiplies the fact rows, so a SUM double- or triple-counts. The query runs, the number is just inflated.
Wrong grain or date. Created vs closed vs shipped date; transaction grain vs daily grain. Off by a reporting convention nobody wrote down.
Ambiguity resolved by guessing. "Top customers" — by revenue or count? This year or all time? The model picks one silently instead of asking.
Right SQL, wrong table. Two tables that look similar (a raw staging table and a curated mart); it picks the one with the convenient name.

None of these throw. That's why "it's about 90% accurate" — a phrase I've heard sell many a pilot — is a trap for analytics. A chatbot that's 90% right about trivia is delightful. A system that gives executives a wrong number one time in ten, with no signal which time, doesn't get more cautious use — it gets distrusted entirely the first time someone's caught quoting a bad figure in a board deck.

The fix: make the model choose metrics, not write SQL

The semantic layer is the answer, and the shift is subtle but total: instead of the model generating free-form SQL against raw tables, it generates a structured query against defined metrics and dimensions, and a deterministic engine compiles that to correct SQL. You define — once, in a governed artifact — what the entities are, what each metric means (revenue = net, excluding refunds and test accounts), what dimensions you can group by, how tables join, and what filters always apply. The model's job collapses from "author arbitrary SQL" to "pick the right metric, dimensions, and filters from a documented menu." That's a vastly smaller, safer search space, and the SQL it compiles to is correct by construction because a human defined the joins and the measure logic.

graph LR
    Q["User question
(natural language)"]
    LLM["LLM
maps question to a
structured metric query"]
    SEM["Semantic layer
(metrics, dimensions, joins,
filters — defined once)"]
    SQL["Compiled SQL
(correct by construction)"]
    WH["Warehouse"]
    ANS["Answer + the SQL,
shown for verification"]
    Q --> LLM
    SEM -->|"governed vocabulary"| LLM
    LLM -->|"metric: revenue
group_by: region
filter: last quarter"| SEM
    SEM --> SQL --> WH --> ANS

The model never writes raw SQL against tables. It maps the question onto governed metrics/dimensions from the semantic layer; the layer compiles deterministic SQL with the correct joins and measure logic baked in. The hard, ambiguous step (what things mean) was solved once, by a human, not re-guessed per question.

This is precisely what the serious tools do. Snowflake Cortex Analyst runs over a semantic model/view you define; the dbt Semantic Layer (MetricFlow) compiles metric queries to SQL; Looker's LookML, Cube, and AtScale are the same idea from the BI side. They differ in syntax and ecosystem, but the architecture is identical: a governed metrics definition in the middle, the model constrained to it. A trimmed semantic model reads like a contract for analytics:

name: orders
description: One row per order; the company revenue source of truth.
base_table: PROD.ANALYTICS.FCT_ORDERS
dimensions:
  - name: region
    expr: region
  - name: order_date
    expr: closed_at          # finance reports on close date, not created
metrics:
  - name: revenue
    description: Net revenue, excluding refunds and internal/test accounts.
    expr: SUM(net_amount)
    filters: ["is_test = FALSE", "status = 'completed'"]
  - name: active_customers
    description: Distinct customers with a completed order in the last 90 days.
    expr: COUNT(DISTINCT customer_id)
synonyms:
  revenue: ["sales", "net sales", "turnover"]

Now "revenue last quarter by region" can only compile one way — the right way — and "sales" resolves to the same metric because the synonyms say so.

What helps even before you have a full semantic layer

Building a complete semantic model is real work (more on that below), so here's what moves the needle incrementally, roughly in order of impact:

Curate and describe the schema. Point the model at vetted marts, not raw tables; add column descriptions, units, and synonyms. Half of naive failures are "it used the wrong table."
Verified queries. A library of trusted question→SQL pairs. On a new question, retrieve the closest verified examples as few-shot context — and for a near-exact match, serve the verified SQL directly instead of generating. This is the single highest-trust pattern.
Constrain access. A read-only role, mandatory LIMIT, statement timeouts, and query validation (parse before executing, reject anything touching tables outside the allowlist). The model should be unable to do harm even when wrong.
Show the SQL and the assumptions. Display the generated query and a plain-language restatement ("net revenue, by region, for Q1, excluding test accounts"). Users catch wrong assumptions instantly when they can see them.
Let it say "I'm not sure." A graceful "I can't answer that confidently with the modeled metrics" beats a confident wrong number every time. Design the out.

Evaluate it like you mean it

You cannot ship this on vibes — the failure mode is invisible, so you need measurement. The standard metric is execution accuracy: run the generated query and the gold query against the warehouse and compare result sets, over a curated benchmark of real questions (easy lookups, multi-step, deliberately ambiguous). Track it as a gate the way you'd gate any model change. This is the same eval-driven discipline as any LLM system — and text-to-SQL is one of the few places you get a near-objective scorer for free, because a query either returns the right rows or it doesn't. Use that gift.

The semantic model is the 80% of the work, and skipping it is why most "chat with your data" projects die. The model and the chat UI are the easy, demo-able 20%. The semantic layer — pinning down every metric definition, every join path, every filter the business takes for granted, and keeping it in sync as the warehouse changes — is the unglamorous 80%, and it's exactly the part teams skip because the demo already "worked." It's also organizationally hard: getting finance, sales, and product to agree on one definition of "active customer" surfaces disagreements that were always there and merely hidden. But that agreement is the deliverable. A text-to-SQL system is only as trustworthy as the semantic layer under it, and a beautiful chat box over undefined metrics is a confident-wrong-number generator with a friendly face.

What to carry away

Naive text-to-SQL breaks not because models are bad at SQL but because they're forced to guess business meaning the schema doesn't encode — and they guess fluently, producing wrong numbers indistinguishable from right ones. The fix is architectural: a semantic layer that defines metrics, dimensions, joins, and filters once, so the model selects from a governed vocabulary and a deterministic engine compiles correct SQL. That's what Cortex Analyst, the dbt Semantic Layer, and LookML-style tools all implement.

Until you have the full layer, you still win with curated schemas, a verified-query library, read-only guardrails, showing the SQL, and a graceful "I'm not sure" — and you keep yourself honest with execution-accuracy evals. The load-bearing truth: the semantic layer is the work, not the chatbot. Get the definitions agreed and modeled, and conversational analytics becomes trustworthy; skip them, and you've built a faster way to mislead your executives. It's the same single-source-of-truth instinct behind good BI semantic models — now doing double duty as the thing that keeps an LLM honest.