# Evaluating LLM and Agent Systems in Production: Evals That Actually Work

The assistant demoed beautifully. It answered the five questions the stakeholders asked in the room, the answers were fluent and confident, everyone clapped, and we shipped it. Three weeks later someone forwarded a screenshot: the bot had cheerfully told a customer the opposite of our refund policy, cited a document that didn't say that, and nobody had noticed because nobody was looking. There was no alarm to trip. We had unit tests on the retrieval code and the API, and exactly zero tests on the only thing that mattered — whether the answers were right.

That's the demo-to-production gap, and evaluation is the bridge across it. An LLM system that "works in the demo" has been tested on a handful of questions someone thought to ask, in the best possible framing, with a human ready to wave away anything weird. A production system gets ten thousand questions you didn't anticipate, phrased badly, about edge cases, with no human in the loop — and it will fail confidently, which is the worst way to fail. This is how I evaluate these systems for real: the offline harness, LLM-as-judge without fooling yourself, what's different about RAG and agents, online evaluation, and the discipline of letting evals gate what ships.

## What does it mean to evaluate an LLM system?

Evaluating an LLM or agent system means systematically measuring whether it produces correct, faithful, safe, and useful outputs — across a representative set of inputs, offline before you ship and online once you have — so you can change the system with evidence instead of vibes. The last clause is the point. Without evals, every prompt tweak, model swap, or retrieval change is a leap of faith: you eyeball a few outputs, they look fine, you ship, and you find out in production whether you made it better or worse.

The reason this needs its own discipline is that LLM output breaks every assumption traditional testing rests on. There's no single correct answer to compare against — many phrasings are right. Output is non-deterministic; the same input can produce different text. Quality is multi-dimensional — an answer can be fluent but wrong, correct but unfaithful to its sources, right but unsafe. You can't write `assert output == expected`. So you need a different toolkit, built around scoring rather than equality.

**An eval is a dataset of inputs plus a way to score the outputs.** That's the whole primitive. The art is in choosing inputs that represent (and stress) real usage, and scorers that actually track the quality you care about. Everything else — harnesses, judges, dashboards — is plumbing around those two choices.

## The scoring toolkit: from string match to LLM-as-judge

Scorers fall into a few families, cheapest and most reliable first. Use the cheapest one that captures what you care about, and reach for a judge only when nothing simpler will do.

| Scorer | How it works | Good for |
| --- | --- | --- |
| Deterministic checks | Exact/regex match, valid JSON, schema conformance, "contains X", length | Structured output, format, refusals, must-include facts |
| Reference-based | Compare to a gold answer — embedding similarity (BLEU/ROUGE are weak for this) | Tasks with a known answer (extraction, classification) |
| LLM-as-judge | A strong model scores the output against a rubric, with reasons | Open-ended quality: helpfulness, correctness, tone, faithfulness |
| Human review | A person labels a sample against guidelines | Ground truth, calibrating judges, the cases you can't automate |

A surprising amount is catchable with deterministic checks — does it return valid JSON in the schema, did it refuse the thing it should refuse, did it include the disclaimer legal requires. Those are free, instant, and never wrong, so they belong in every suite. But the questions that matter most ("is this answer actually correct and grounded in the source?") are open-ended, and that's where **LLM-as-judge** earns its place: you give a capable model the input, the output, and a rubric, and ask it to score — ideally with a written rationale you can audit.

## LLM-as-judge, without fooling yourself

LLM-as-judge is powerful and seductive, and the seduction is the danger: it produces a confident number that *feels* like ground truth and isn't. A judge model has biases — it favors longer answers, it favors the first option in a pairwise comparison (position bias), it rates its own family of models more highly, and it can be wrong in exactly the cases your system is wrong. Treat the judge as a noisy instrument that needs calibration, not an oracle.

What makes it trustworthy in practice: a **specific rubric** (not "rate 1–10" but "score 1 if the answer contradicts the context, 3 if unsupported, 5 if fully grounded"), **pairwise comparison** (A vs B is more reliable than absolute scores, with the order randomized to cancel position bias), and — the step everyone skips — **calibrating the judge against human labels** on a sample, so you know its agreement rate before you trust its verdicts. A judge that agrees with humans 60% of the time is a random number generator with good prose.

```text
You are grading an answer for FAITHFULNESS to the provided context.

Context: {{context}}
Question: {{question}}
Answer: {{answer}}

Score strictly:
  1 — the answer states something the context contradicts
  3 — the answer includes claims the context does not support
  5 — every claim in the answer is supported by the context

Return JSON: {"score": <1|3|5>, "reason": ""}
```

## RAG and agents need their own metrics

For a [RAG system](rag-fundamentals), a single "is the answer good" score hides where it broke. Decompose it: **context relevance** (did retrieval fetch the right chunks?), **faithfulness/groundedness** (is the answer supported by those chunks, or did the model make it up?), and **answer relevance** (does it actually address the question?). This decomposition is what tools like RAGAS popularized, and it's diagnostic gold — a low faithfulness score with high context relevance means your retrieval is fine and your prompt is letting the model hallucinate; the reverse means fix retrieval first.

For **agents**, the final answer is only half the story — you have to evaluate the *trajectory*. Did it choose the right tool, with the right arguments, in a sensible order? Did it loop? Did it stay within a cost and latency budget? An agent that returns the right answer after 14 tool calls and $0.80 is failing even though the output is correct. So agent evals run at two levels: component (each tool call, each step) and end-to-end (task completion), with cost and step-count as first-class metrics alongside correctness.

```mermaid
graph TD
    CHANGE["Change(prompt, model, retrieval, tool)"]
    OFFLINE["Offline eval(golden set + adversarial cases)"]
    GATE{"Pass vslast version?"}
    DEPLOY["Deploy"]
    ONLINE["Online eval(sample prod traffic, judges, user feedback)"]
    FAIL["New failure modes"]
    SET["Eval dataset"]
    CHANGE --> OFFLINE --> GATE
    GATE -->|"regression"| CHANGE
    GATE -->|"clears bar"| DEPLOY --> ONLINE
    ONLINE --> FAIL
    FAIL -->|"add as cases"| SET
    SET --> OFFLINE
          
```

Eval-driven development: every change clears an offline gate before deploy, production is sampled and scored online, and the new failures you find in production become permanent cases in the eval set — so the system can't regress on the same mistake twice.

## Offline harness and eval-driven development

The offline harness is where evals change how you build, not just how you report. You curate an **eval dataset** — a "golden set" of representative inputs plus deliberately nasty ones (ambiguous questions, adversarial phrasings, the edge cases that bit you before) — and you run every candidate version against it, scoring each. Then you wire it into CI as a **gate**: a prompt change or model upgrade that regresses the scores doesn't merge. This is eval-driven development, and it flips the loop from "ship and hope" to "prove it's better, then ship."

It also makes the otherwise-terrifying decisions tractable. Should you move from the expensive frontier model to a cheaper one? Run both against the eval set; if quality holds within your bar, take the savings with evidence. Does the new retrieval strategy help? The eval set answers in minutes. You start small — twenty to fifty hand-built cases beat zero, and beat a thousand auto-generated ones nobody curated — and you grow the set from real production failures.

## Online evaluation: production is the eval set you can't write

No offline set covers what real users do. So the second half is online: **sample production traffic**, run judges and guardrail checks asynchronously over it, track quality metrics as a time series, and capture **user feedback** (thumbs, corrections, escalations) as a cheap real-world label. This is where evaluation meets [LLM observability](llm-observability) — you need the traces (inputs, retrieved context, tool calls, outputs) to score anything, and the same traces power both debugging and online scoring. The payoff is detecting the policy-contradiction screenshot from my opening *from your own dashboard*, in hours, not from a customer weeks later.

**The two failure modes are vibes shipping and eval theater — and the second is sneakier.** Everyone knows shipping on vibes (eyeball three outputs, deploy) is bad. The subtler trap is building an elaborate eval dashboard with twelve metrics that nobody reads and no gate acts on — eval theater, all ceremony, no decisions. Right behind it is over-trusting the LLM judge (a number is not ground truth until you've checked it against humans) and overfitting to a stale golden set (you optimize the prompt until the eval is green while production quietly rots, because the eval set stopped resembling reality). An eval is only real if a bad score *stops something* — blocks a merge, pages someone, rolls back a deploy. If nothing happens when the number drops, you don't have evaluation. You have a chart.

## What actually works

- **Start with 20–50 hand-curated cases.** Real inputs, a few adversarial ones, the failures you already know about. Small and real beats large and synthetic.

- **Layer the scorers.** Cheap deterministic checks first (format, refusals, must-includes), LLM-as-judge for open-ended quality, human spot-checks to calibrate the judge.

- **Decompose RAG and agents.** Score retrieval and generation separately; score agent trajectories and cost, not just final answers.

- **Gate CI on it.** A regression against the last version blocks the merge. This is the step that turns evals from reporting into engineering.

- **Close the loop.** Every production failure becomes a permanent eval case, so you never regress on the same bug twice. The eval set is a living asset, versioned like code.

- **Calibrate the judge.** Measure its agreement with human labels before you trust it, and re-check when you change the judge model.

## What to carry away

Evaluation is the discipline that separates an AI demo from an AI product, because the demo only ever faced the questions you chose and production faces the ones you didn't. An eval is just a dataset of inputs plus a scorer; build the harness around representative-and-adversarial inputs and the cheapest scorer that captures real quality. Use **deterministic checks** where you can, **LLM-as-judge** (with a rubric, pairwise, and calibrated against humans) where you must, and **decompose RAG and agents** so a low score tells you where it broke. Run it **offline as a CI gate** and **online over sampled production traffic**, and feed every real failure back into the set.

The one idea to keep: an eval is only real if a bad score stops something. Wire it to a gate, and "we think this is better" becomes "we measured that it's better" — which is the entire difference between the systems that survive contact with real users and the ones that quietly embarrass you three weeks after the applause. This is the measurement layer under everything in [AI strategy](ai-strategy), the safety net beside [LLM observability](llm-observability), and the gate that makes shipping a [production assistant](building-ai-assistant-snowflake-cortex) defensible.
