A single agent loop, left un-instrumented, once routed every step — including the trivial ones — to the same frontier reasoning model the hard steps genuinely needed. The bill wasn't a surprise so much as an indictment: nobody had made a deliberate choice about which model each step deserved, so the system defaulted to the most expensive one for all of them. That's the story behind most out-of-control LLM spend I've seen — not a single dramatic mistake, but the absence of the same discipline warehouse FinOps already applies to compute, now applied to tokens, where the unit economics are different enough that the old playbook doesn't transfer cleanly.
This is that discipline: why a task can cost close to two orders of magnitude more depending purely on which model handles it, prompt caching and the output-token asymmetry it doesn't touch, model routing as the highest-leverage lever available, and the FinOps loop — visibility, attribution, optimization, accountability — applied to spend a warehouse cost dashboard never sees.
Why is LLM cost so much more volatile than warehouse compute cost?
A warehouse query costs roughly what its data volume and complexity dictate, and that relationship is stable — the same query costs about the same amount whenever you run it. An LLM call's cost depends on a choice that's invisible in the code path unless someone explicitly logs it: which model handled it. A task routed to a frontier reasoning model can cost on the order of 190 times more than the identical task handled by a fast, smaller model — which means the single most consequential cost decision in an LLM system isn't infrastructure sizing, it's model selection, made per call, often by a default nobody revisited after the prototype.
Model routing — sending a query to a fast, cheap model by default and escalating to a deeper, more expensive one only when complexity actually warrants it — has become standard practice specifically because that cost gap is real and because most production traffic doesn't need the expensive model's capability. This is the single highest-leverage lever in LLM cost management, ahead of caching, ahead of prompt engineering, precisely because the multiplier it controls is the largest one in the system.
What does prompt caching actually save, and what doesn't it touch?
Prompt caching stores the computed key-value tensors behind a repeated prompt prefix — a static system prompt, a tool schema, a long few-shot block — so a subsequent call reusing that exact prefix doesn't pay full price to reprocess it; cached input tokens typically bill at up to 90% off. For an agent loop that re-sends the same system prompt and tool definitions on every step, this is close to free money: static structure that never changes is exactly what caching is built for, and teams commonly see 50–90% savings on the cached portion of input tokens across a long agent trajectory.
The trap is assuming caching solves the whole cost problem, because it only discounts input tokens — and the honest number to know is that the median output-to-input cost ratio across major providers sits around 4:1, with some premium reasoning models reaching 8:1. No provider's caching scheme touches output-token cost at all. A system that generates long, verbose responses — or an agent that "thinks out loud" at length before acting — can be fully cached on the input side and still expensive, because the output side, where the real asymmetry lives, isn't a caching problem; it's a prompt-design and generation-length problem.
| Lever | What it controls | What it doesn't touch |
|---|---|---|
| Model routing | Which model handles a given call — the largest cost multiplier in the system | Requires a reliable complexity/confidence signal to route on |
| Prompt caching | Repeated input prefixes — system prompts, tool schemas, static context | Output tokens, and any input that genuinely changes every call |
| Semantic caching | Near-duplicate queries with acceptable answer reuse | Requires a defined error-rate tolerance for "close enough" |
| Prompt compression | Redundant or low-information tokens within a prompt | Diminishing returns past a point; can degrade output quality if overapplied |
What is semantic caching, and when is it worth the complexity?
Exact-match caching only helps when a request is byte-identical to a previous one — rare for genuinely user-driven queries. Semantic caching extends the idea to queries that are meaningfully similar rather than identical, returning a cached response when a new prompt is close enough in embedding space to a previously answered one, under an explicit, configured error-rate tolerance. This is worth the added complexity specifically for high-volume, narrow-domain traffic — a support bot fielding the same handful of question shapes repeatedly — and not worth it for genuinely open-ended, high-stakes queries where "close enough" isn't actually close enough. Production stacks increasingly layer exact-match, semantic, and prefix caching together rather than picking one, because each catches a different repetition pattern.
graph TD
REQ["Incoming request"]
EXACT{"Exact match
in cache?"}
SEM{"Semantically
similar, within
error tolerance?"}
ROUTE{"Complexity
signal"}
FAST["Fast, cheap model"]
DEEP["Frontier reasoning model"]
REQ --> EXACT
EXACT -->|"yes"| CACHED["Return cached response
(near-zero cost)"]
EXACT -->|"no"| SEM
SEM -->|"yes"| CACHED
SEM -->|"no"| ROUTE
ROUTE -->|"low"| FAST
ROUTE -->|"high"| DEEP
The layered cost-control path most production LLM systems converge on: check exact-match cache, then semantic cache, and only pay for a live model call when neither hits — at which point routing, not caching, becomes the dominant cost decision.
How does the FinOps loop actually apply to LLM spend?
The FinOps Foundation's own framework — visibility, attribution, optimization, accountability — carries over to GenAI spend almost unchanged in structure, even though the underlying cost drivers are different. Visibility means knowing token counts, model choice, and cost per call in the first place, which most teams don't have until they add explicit instrumentation — an LLM API bill by default tells you a total, not which feature or which agent step drove it. Attribution means tagging every call back to the feature, team, or customer that generated it, the same discipline as tagging cloud resources for chargeback, and it's the step that turns "AI spend went up" into "this specific agent's retry loop went up." Optimization is where routing, caching, and prompt design live. Accountability means the team whose feature drives the cost actually sees that cost and owns the trade-off — the same behavioral-control mechanism showback and chargeback provide for cloud compute, applied to a cost category most engineering orgs still treat as an unowned platform expense.
# Minimal per-call cost attribution — the visibility step most
# LLM systems skip until the bill forces the question
def call_llm(prompt, model, feature_tag, team_tag):
response = client.chat.completions.create(model=model, messages=prompt)
log_cost_event(
feature=feature_tag,
team=team_tag,
model=model,
input_tokens=response.usage.prompt_tokens,
cached_tokens=response.usage.prompt_tokens_cached or 0,
output_tokens=response.usage.completion_tokens,
)
return response
I've seen a team celebrate a caching rollout that cut their input-token bill by 70% and completely miss that output tokens — untouched by caching — had grown 3x in the same period as the product added longer, more elaborate agent responses. The total bill barely moved, and the caching win looked like a failure until someone actually split input from output cost. Track input and output token cost as separate line items from day one, not a blended total — a caching or routing win on one side can be silently cancelled by unmonitored growth on the other, and a blended number will hide exactly that.
What to carry away
LLM cost is dominated by a decision invisible in most code paths — which model handled a given call — and that decision can swing cost by close to two orders of magnitude, which is why model routing is the single highest-leverage lever, ahead of caching or prompt tuning. Prompt caching is close to free savings for static prefixes but only discounts input tokens; the output side, where costs run roughly 4x input on average and up to 8x for premium reasoning models, is a generation-length and prompt-design problem that no caching scheme touches.
Apply the same FinOps loop already proven on cloud compute — visibility, attribution, optimization, accountability — to token spend specifically, because an LLM bill without per-call, per-feature attribution tells you a total and nothing about where to act. And track input and output cost separately: they respond to different levers, and a win on one side can hide a silent regression on the other if you're only watching the blended number.