# Model Serving in Production: KServe, Seldon, and BentoML

The model is trained, the metrics are good, the registry has the winner. Then someone says "great — now make it an API," and the work that's left turns out to be bigger than the work that's done. A `predict()` call in a notebook and a service that answers thousands of requests a second, stays up, scales with traffic, doesn't bankrupt you on idle GPUs, and lets you ship a new version without taking the old one down — those are different universes. **Model serving** is the discipline of crossing that gap, and it's where a lot of promising ML quietly dies because nobody owned the last mile.

Model serving is taking a trained model and running it as a production service — reliable, scalable, observable, and updatable. The problems it has to solve are a serving **runtime**, **autoscaling** (including the GPU-cost-saving trick of scale-to-zero), **batching** for throughput, and safe **rollout strategies** like canary and shadow. I'll walk those, then show how **KServe**, **Seldon Core**, and **BentoML** split the work differently.

## Why "wrap it in Flask" isn't the answer

Everyone's first model API is a Flask app with a `/predict` route, and it works right up until it has to be real. It serves one request at a time, has no autoscaling, no health checks, no versioning story, no batching, no metrics, and pins a GPU whether or not anyone's calling it. Each of those is a production requirement, and rebuilding them per model is exactly the wasted, error-prone effort serving frameworks exist to absorb. The point of a serving framework is that these concerns are solved once, consistently, so you supply the model and get the production behavior.

## The serving runtime and the inference graph

At the core is the **runtime**: the process that loads your model into memory and exposes a standard inference endpoint (typically HTTP and gRPC, increasingly following a shared "inference protocol" so clients are portable across frameworks). Two ways to get one: *pre-built model servers* that already know how to load a standard format — an MLflow model, a scikit-learn or XGBoost pickle, a Triton-served ONNX/TensorRT model — so you deploy without writing serving code; or a *custom runtime* when your inference needs arbitrary Python (preprocessing, business logic, multiple models).

Real inference is rarely just the model. There's preprocessing (tokenize, normalize, fetch features), the prediction, and postprocessing (threshold, format, enrich). Serving platforms model this as an **inference graph** — a pipeline of steps, sometimes including a **transformer** stage in front of the predictor and an **explainer** alongside it. Keeping these as declared stages rather than tangled in one script is what keeps a serving setup maintainable.

```mermaid
graph LR
    REQ["Request"]
    PRE["Transformer(preprocess / fetch features)"]
    PRED["Predictor(the model runtime)"]
    POST["Postprocess(threshold, format)"]
    RESP["Response"]
    REQ --> PRE --> PRED --> POST --> RESP
    PRED -.->|"in parallel"| EXP["Explainer (optional)"]
          
```

A serving inference graph. The request flows through a transformer (preprocessing, feature lookup), the predictor (the model runtime), and postprocessing — with an optional explainer for interpretability. Modeling these as declared, separable stages rather than one monolithic handler is what makes a production serving setup observable and maintainable.

## Autoscaling and scale-to-zero

Inference traffic is bursty, and models — especially on GPUs — are expensive to keep running idle. So serving platforms autoscale on load, adding replicas under traffic and removing them when it drops. The feature that matters most for cost is **scale-to-zero**: when a model gets no traffic, scale it down to *no* replicas and stop paying for it entirely; when a request arrives, spin a replica back up. For the long tail of models that are used occasionally — internal tools, low-traffic endpoints, the dozens of models a platform team hosts — scale-to-zero is the difference between an affordable platform and a GPU bill that gets the whole effort cancelled.

**Scale-to-zero's bill comes due as the cold start.** When a request hits a scaled-to-zero model, someone waits for a pod to schedule, pull a multi-gigabyte image, load the model into memory (and onto the GPU), and warm up — seconds to minutes. That's fine for a batch or internal tool and unacceptable for a user-facing, latency-sensitive endpoint. The rule I use: scale-to-zero for spiky, latency-tolerant, or rarely-used models; a warm minimum replica count for anything a user waits on synchronously. Turning it on everywhere to save money is how you ship a "fast" model that takes 40 seconds to answer the first request after lunch.

## Batching for throughput

Models — neural networks especially — are far more efficient predicting on a batch than on single inputs, because the hardware is built for parallel matrix math. **Dynamic (adaptive) batching** exploits this without changing your client: the server briefly holds incoming requests (a few milliseconds) to group them into a batch, runs one inference, and splits the results back to the callers. You trade a little per-request latency for a large throughput gain and much better hardware utilization. It's the serving-side cousin of the batching that dominates [LLM inference](llm-inference-internals), and the same tension applies — bigger batches mean more throughput but more latency, so the batch window is a knob you tune to your SLA.

## Rolling out a new version safely

The reason serving frameworks beat a hand-rolled service most decisively is deployment strategy. Swapping a model in production is risky — the new one might be worse on live traffic in ways your offline eval missed. Two patterns de-risk it, and both are first-class in serving platforms:

- **Canary** — route a small slice of live traffic (say 10%) to the new version while the rest stays on the proven one. Watch metrics; if it holds up, shift more; if not, roll back instantly. You limit the blast radius of a bad model.

- **Shadow (mirror)** — send the new version a *copy* of real traffic but don't return its responses to users. You see exactly how it behaves on production inputs — latency, errors, prediction distribution — with zero user risk, then promote once you trust it.

Both are about confronting a model with real production data before trusting it, because offline metrics never tell the whole story — the same lesson the [training/serving skew](feature-stores-feast-tecton) problem teaches from the data side.

## KServe vs Seldon vs BentoML

|  | KServe | Seldon Core | BentoML |
| --- | --- | --- | --- |
| Runs on | Kubernetes (built on Knative) | Kubernetes | Anywhere (its own packaging; deploy to many targets) |
| Sweet spot | Standardized serverless inference on K8s | Complex inference graphs, governance | Packaging & serving DX, not tied to K8s |
| Scale-to-zero | Yes (via Knative) | Available | Depends on deployment target |
| Strength | Standard inference protocol, autoscaling out of the box | Rich multi-step graphs, A/B, explainers, monitoring | "Bento" bundle = model + code + deps; great developer flow |
| You bring | A model in a supported format (or custom runtime) | Components wired into a graph | A service definition in Python |

The honest framing: **KServe** standardizes serverless model serving on Kubernetes — pre-built runtimes, a common inference protocol, autoscaling and scale-to-zero via Knative — and is the natural choice when you're K8s-native and want consistency across many models. **Seldon Core** leans into complex inference graphs, A/B and canary, explainers, and monitoring — strong when serving is more than one model in a line. **BentoML** approaches from the developer-experience and packaging angle: bundle the model, code, and dependencies into a portable "Bento" and deploy it to many targets, without committing to Kubernetes. Choose by where your complexity lives — orchestration, inference-graph richness, or packaging and portability.

These pair naturally with the rest of the stack rather than replacing it. The [MLflow Model Registry](mlflow-experiment-tracking-registry) is the handoff: serving pulls "the Production version" from the registry, so promotion is a registry transition and serving picks it up. And serving is exactly where you wire in [monitoring](llm-observability) — latency, throughput, error rate, and the prediction distribution that reveals drift. Serving without that telemetry is how a model silently degrades for weeks before anyone notices.

## What to carry away

Model serving is the last mile that decides whether a good model becomes a useful product. The gap from notebook to service is real work: a **runtime** exposing standard endpoints (often as a multi-stage inference graph), **autoscaling** with **scale-to-zero** to keep idle GPUs from sinking the budget, **dynamic batching** to trade a little latency for a lot of throughput, and **canary/shadow rollouts** to confront a new version with real traffic before trusting it. Rebuilding those per model is the waste that serving frameworks eliminate.

**KServe** for standardized serverless inference on Kubernetes, **Seldon Core** for rich inference graphs and governance, **BentoML** for packaging and portability beyond K8s — pick by where your complexity actually sits. Mind the cold-start cost of scale-to-zero and keep a warm floor for anything users wait on. Wire it to the [registry](mlflow-experiment-tracking-registry) for clean promotion and to [observability](llm-observability) for the drift you can't see offline, and the model you trained finally does its job where it counts — in production.
