# MLflow Deep Dive: Experiment Tracking, the Model Registry, and Reproducibility

Every team that does machine learning hits the same wall around model number forty. Someone asks "which version is in production, and can we reproduce it?" and the honest answer is a shrug — the good run was a notebook that's been edited since, the hyperparameters lived in a variable that got overwritten, the training data was "the usual export," and the model file is `model_final_v2_REAL.pkl` on somebody's laptop. ML is experimental by nature: you run hundreds of variations, and almost all the bookkeeping that a normal software project gets from version control simply doesn't exist for experiments, parameters, metrics, and the resulting model artifact. **MLflow** exists to fix that, and it's become the default open-source tool for it.

MLflow is an open-source platform for managing the machine learning lifecycle. It's organized into four components — **Tracking**, **Projects**, **Models**, and the **Model Registry** — and the reason to understand them as a set is that together they answer one question: *can you reproduce, compare, and safely promote a model?* I'll take each component, then the registry's staging model, then the reproducibility problem underneath it all.

## Tracking: the lab notebook you'll actually keep

**MLflow Tracking** records what happened in each training attempt so you can compare and reproduce them. The central unit is a **run** — one execution of your training code — and against each run you log four kinds of things:

- **Parameters** — the inputs you chose: learning rate, max depth, feature set, model type. The knobs.

- **Metrics** — the outcomes: accuracy, AUC, loss, RMSE. These can be logged over time (per epoch), so you get curves, not just final numbers.

- **Artifacts** — output files: the serialized model, plots, a confusion matrix, sample predictions.

- **Metadata** — source code version (the Git commit), start/end time, who ran it, and tags.

Runs group into **experiments** (say, "churn-model"), and the tracking server gives you a UI to sort and compare runs across metrics — which is where the value lands. Instead of guessing whether last week's run was better, you sort the experiment by AUC and read off the parameters that produced the top one. The logging is a few lines dropped into existing training code.

```python
import mlflow

mlflow.set_experiment("churn-model")
with mlflow.start_run():
    mlflow.log_params({"max_depth": 8, "n_estimators": 400, "features": "v3"})
    model = train(...)                       # your existing training
    mlflow.log_metric("auc", auc)            # outcomes
    mlflow.log_metric("val_loss", val_loss)
    mlflow.sklearn.log_model(model, "model") # the artifact, in a standard layout
    # MLflow also records the Git commit and run timestamps automatically
```

```mermaid
graph TD
    EXP["Experiment: churn-model"]
    R1["Run 1params + metrics + artifacts"]
    R2["Run 2params + metrics + artifacts"]
    R3["Run 3 (best AUC)params + metrics + artifacts"]
    REG["Model Registry(register the winner)"]
    EXP --> R1
    EXP --> R2
    EXP --> R3
    R3 -->|"register this run's model"| REG
          
```

Tracking organizes runs under an experiment, each capturing the params, metrics, artifacts, and code version that produced it. You compare runs to find the winner — then register that specific run's model into the Registry. The link from a registered model back to the exact run (and its Git commit and data) is what makes the model reproducible later.

## Models and Projects: standard packaging

The next two components are about packaging, and they're what let a model and its training move between tools. **MLflow Models** defines a standard format for a saved model with the notion of **flavors** — a single saved model can be described in multiple ways (a "sklearn" flavor, a generic "python_function" flavor) so that downstream tools don't need to know which library trained it. The `python_function` flavor is the important one: anything that supports it can load the model and call `predict()` the same way, whether it came from scikit-learn, PyTorch, or XGBoost. That uniform interface is what makes a serving layer able to deploy "any MLflow model" without special-casing every framework.

**MLflow Projects** packages the *training* code with its dependencies (a conda or pip environment) and an entry point, so a run can be re-executed reproducibly elsewhere — the same code, the same environment, the same parameters. It's the piece that turns "it worked on my machine" into "run this project and get the same result."

## The Model Registry: from "best run" to "in production"

Tracking finds your best model; the **Model Registry** governs what happens to it next. It's a central store of named, versioned models with a lifecycle. You register a model from a run, and it gets a version number under a model name (e.g. `churn-model` version 7). Each version can move through **stages** — typically *None → Staging → Production → Archived* — and the registry tracks who moved what, when, and can require approval to transition.

The reason this matters operationally: it decouples the model name your serving system asks for from the specific version behind it. Production code requests "the Production version of `churn-model`," and promoting a new model is a stage transition in the registry, not a code change or redeploy. Rollback is the same — move the previous version back to Production. The registry becomes the single source of truth for "what's live," with an audit trail of how it got there.

| Component | Answers | Unit |
| --- | --- | --- |
| Tracking | What did each experiment do, and which was best? | Run (params, metrics, artifacts) |
| Projects | Can I re-run this training reproducibly? | Packaged code + environment |
| Models | Can any tool load and serve this model? | Saved model + flavors |
| Registry | Which version is in production, and how did it get there? | Named, versioned, staged model |

The registry is where MLOps starts to look like the data governance you already practice. Reference a model by `models:/churn-model/Production` in your serving code and you've created the same indirection that a [catalog](unity-catalog) gives data assets — a stable name over a governed, versioned, audited thing. Promotions and rollbacks become metadata operations with a paper trail, which is exactly what you need when someone eventually asks "why did the model change on the 14th?"

## The real problem: reproducibility

Step back and the whole tool is aimed at one hard thing. A model's behavior is determined by the *combination* of code, parameters, and training data — and in research-style ML workflows all three drift constantly and none are captured by default. Reproducibility means being able to point at a production model and recover the exact code commit, the exact parameters, and (ideally) the exact data snapshot that produced it. MLflow nails two of those three directly: it logs parameters and the code version with every run, and links the registered model back to that run.

**MLflow does not version your training data, and that's the gap that breaks reproducibility most often.** It records the code commit and the parameters, but "which exact rows did this train on?" is on you — and a model trained on last month's data is a different model even with identical code and hyperparameters. Pair MLflow with real data versioning: log a dataset hash or snapshot id as a run parameter, train against immutable inputs (a time-traveled table or a pinned data version), and record the pointer. Skip this and you'll reproduce everything about a model except the part that actually changed its predictions. Tracking the code without the data is a reproducibility story with the ending torn out.

One note on the moment: MLflow 2.0 has just landed (late 2022), adding pipeline-style "recipes" and refinements on top of these four components — but the core mental model of Tracking / Projects / Models / Registry is unchanged and is what you should learn first.

## What to carry away

MLflow brings software-engineering discipline to the inherently messy process of building models, through four components that together answer "can you reproduce, compare, and safely promote a model?" **Tracking** logs every run's parameters, metrics, artifacts, and code version so you can compare experiments and find the winner. **Projects** and **Models** package training and the model in standard, framework-agnostic forms so they move between tools. The **Model Registry** versions and stages models, decoupling "the Production model" from any specific version so promotion and rollback are governed metadata operations with an audit trail.

Adopt it for the experiment-tracking value first — it pays for itself the moment you can sort runs by metric instead of guessing — then grow into the registry as you start shipping models to production. And do the one thing MLflow won't do for you: version your training data alongside the code, or your reproducibility story has a hole exactly where it matters. The registry's named-and-staged model is the natural handoff point to a [serving](vector-databases) layer and to the [observability](llm-observability) you'll need once it's live.
