Fine-Tune vs RAG vs Prompt: How to Actually Decide

The most expensive mistake I see teams make with LLMs is reaching for fine-tuning to solve a problem that fine-tuning can't solve. "The model doesn't know our products, so let's fine-tune it on our catalog." Six weeks and a GPU bill later, the model still hallucinates product details — because they tried to teach facts with a technique that changes behavior. The three ways to adapt an LLM — prompting, retrieval-augmented generation, and fine-tuning — are not a quality ladder where fine-tuning is the top rung. They fix different problems, and picking the wrong one is how you burn a quarter. This is the decision framework I actually use.

The single distinction that resolves most of these debates: RAG and prompting change what the model knows in the moment; fine-tuning changes how the model behaves in general. Knowledge is retrieval. Behavior is training. Confuse the two and nothing works.

The three techniques, precisely

Prompt engineering

You change the instructions and examples you send at inference time — the system prompt, few-shot examples, output-format directives, chain-of-thought scaffolding. Nothing about the model changes; you're steering a fixed model with better inputs. It's the cheapest, fastest thing to try, it requires no infrastructure, and in 2026 — with large context windows and strong instruction-following — it solves far more than people expect. Always the first move. If a clearer prompt fixes it, you're done.

Retrieval-augmented generation (RAG)

You retrieve relevant documents at query time and put them in the prompt, so the model answers from supplied context rather than its parametric memory. I've written about RAG end to end; the role it plays here is specific: RAG is how you give a model knowledge it didn't have — your documents, fresh data, private data — without retraining. The knowledge stays in an external store you can update instantly, and answers can cite their sources. RAG is the right tool whenever the problem is "the model doesn't know something."

Fine-tuning (and why LoRA changed the math)

You continue training the model on your examples, adjusting its weights so it internalizes a behavior: a tone, a format, a structured-output schema, a domain's style, a classification task. Crucially, full fine-tuning (updating every weight) is expensive and rarely necessary anymore. PEFT (parameter-efficient fine-tuning), and especially LoRA (Low-Rank Adaptation), changed the economics: instead of updating billions of weights, LoRA freezes the base model and trains small low-rank "adapter" matrices — a tiny fraction of the parameters. You get most of the behavioral benefit at a fraction of the compute and storage, and you can swap adapters per task. Fine-tuning is the right tool when the problem is "the model doesn't behave the way I need," consistently, in a way no prompt reliably enforces.

Do not fine-tune to inject knowledge. It's the costly misconception that wastes the most money in applied LLM work. Fine-tuning teaches patterns and behavior, not facts — and worse, it bakes whatever facts it does absorb into weights that go stale the moment your data changes, with no way to cite a source and a real risk of confidently hallucinating around the edges of what it half-learned. If your company's policy changes, a fine-tuned model keeps reciting the old one until you retrain; a RAG system is correct the instant you update the document. So when someone says "fine-tune it on our knowledge base," the answer is almost always RAG. Reserve fine-tuning for behavior the model should always exhibit regardless of which documents are in front of it.

The decision: match the technique to the problem

graph TD
    START["The model isn't doing what I need"]
    Q1{"Is it a KNOWLEDGE gap
or a BEHAVIOR gap?"} KNOW["Knowledge: it doesn't know
facts / docs / fresh data"] BEHAVE["Behavior: tone, format,
style, a specific task"] Q2{"Did a better prompt
already fix it?"} PROMPT["Prompt engineering
(start here, always)"] RAG["RAG
(external, updatable,
citable knowledge)"] Q3{"Can prompting +
few-shot enforce it
reliably enough?"} FT["Fine-tune (LoRA/PEFT)
(internalize the behavior)"] START --> Q1 Q1 -->|knowledge| KNOW --> RAG Q1 -->|behavior| BEHAVE --> Q2 Q2 -->|yes| PROMPT Q2 -->|no| Q3 Q3 -->|yes| PROMPT Q3 -->|no| FT

The flow I run every time. First classify the gap as knowledge or behavior — that one fork eliminates most wrong turns. Knowledge gaps go to RAG (or a bigger/fresher prompt), never to fine-tuning. Behavior gaps start with prompting and few-shot, and only escalate to fine-tuning when prompting demonstrably can't enforce the behavior reliably enough at the cost and latency you need. Fine-tuning is the last resort, not the prestige option.

PromptingRAGFine-tuning (LoRA)
ChangesThe inputThe input (with retrieved context)The model's weights
FixesQuick behavior & format nudgesKnowledge gapsPersistent behavior / style / task
Fresh / changing dataManualYes — update the store, instantNo — stale until retrained
Citations / provenanceNoYesNo
Cost & effortLowestMedium (retrieval infra)Highest (data + training + serving)
Latency / token cost at inferenceLowHigher (context tokens)Low (behavior is in weights)

When fine-tuning genuinely wins

I don't want to talk anyone out of fine-tuning where it's right — it's powerful for the cases it fits. Reach for it when:

  • You need a consistent behavior no prompt reliably enforces: a strict output schema, a specific voice, a domain's terminology and reasoning style, refusing certain requests a particular way.
  • You're doing a narrow, high-volume task (classification, extraction, routing) where a small fine-tuned model beats a big prompted one on cost and latency — fine-tuning a smaller model to match a larger one's behavior is a classic, excellent cost play.
  • Prompt bloat is hurting you: if you're spending thousands of tokens of few-shot examples on every call to coax a behavior, fine-tuning bakes that in and shrinks (and speeds up) every request.
  • Latency matters and RAG/long prompts are too slow: behavior in weights costs no extra input tokens at inference.

Notice none of those is "the model needs to know our data." That's always RAG's job.

They combine — and usually should

The framing as a three-way choice is a simplification for deciding where to start. In production these compose, and the strongest systems use all three. A common, genuinely good architecture: RAG for the knowledge, fine-tuning for the behavior, prompting to orchestrate. You fine-tune a model (often a smaller one) to reliably produce your output format and domain style, you feed it retrieved context via RAG so its facts are fresh and citable, and you steer the whole thing with a well-engineered prompt. The fine-tuned model knows how to answer; RAG supplies what to answer from.

Climb the ladder in cost order, and stop as soon as it works. Prompt first — it's free and fixes more than you'd think. Add RAG when the gap is knowledge. Fine-tune only when you've proven a behavior gap that prompting can't close, or when a fine-tuned small model is a deliberate cost/latency win over a prompted large one. Every rung up adds infrastructure, evaluation burden, and a thing that can rot — so the discipline isn't "use the most powerful technique," it's "use the cheapest technique that actually solves this problem." And whichever you pick, you can't tell if it worked without evals — adaptation without measurement is just vibes.

What to carry away

Prompting, RAG, and fine-tuning aren't a quality ranking — they fix different problems, and the master distinction is knowledge versus behavior. RAG and prompting change what the model knows and does in the moment; fine-tuning (now cheap and practical thanks to LoRA/PEFT) changes how it behaves in general. The expensive, recurring mistake is fine-tuning to inject knowledge: it bakes facts into weights that go stale and can't be cited, when RAG would have been correct the instant you updated a document.

So decide by classifying the gap. Knowledge gap → RAG (or a better prompt). Behavior gap → prompt and few-shot first, fine-tune only when that demonstrably isn't enough or when a fine-tuned small model is a deliberate cost win. Then remember they compose: the best production systems fine-tune for behavior, retrieve for knowledge, and prompt to orchestrate. Climb the ladder in cost order, prove each step with evals, and stop the moment the problem is solved — which is more often at the prompt rung than anyone's GPU vendor would like you to believe.