# Shipping ML on Azure Machine Learning: An MLOps War Story

The model was the easy part. We had a gradient-boosted classifier that scored well in a notebook, the data scientist was happy, and the business wanted it live. What followed was the part nobody puts in the demo: turning that pickle file into a service that retrains on a schedule, deploys without downtime, can be rolled back in seconds when a release goes bad, and tells you when the incoming data has drifted away from what the model was trained on. We did it on **Azure Machine Learning**, and most of what I learned was about everything *around* the model. This is that story — what Azure ML gives you, the workflow that actually held up, and the lessons that cost us real time.

One framing up front, because it's the thing I'd tell my past self: Azure ML is not a place you click buttons to train models. It's a control plane for orchestrating compute, data, and deployments in your subscription, driven from code. The teams that treat the Studio UI as the product struggle; the teams that treat it as a YAML-and-CLI platform with a UI for inspection ship.

## The pieces, and how they fit

Azure ML organizes everything under a **Workspace** — the top-level container that ties together your compute, data, models, and endpoints, backed by a storage account, a key vault, and a container registry it provisions alongside. Inside it, the assets you actually work with are a small, learnable set.

| Asset | What it is | Why it matters in production |
| --- | --- | --- |
| **Compute cluster** (AmlCompute) | An autoscaling pool of VMs for training/batch jobs | Scale to zero when idle; the difference between a sane bill and a shocking one |
| **Data asset** | A versioned reference to data in a datastore (blob/ADLS) | Reproducibility — a job pins an exact data version, not "whatever's in the folder today" |
| **Environment** | A versioned Docker image + conda/pip dependencies | Training and serving run the *same* dependencies — kills "works on my machine" |
| **Job / Component** | A unit of work (command or pipeline) defined in YAML | Training and pipelines as code, reproducible and parameterized |
| **Registered model** | A versioned, named model artifact (MLflow-flavored or custom) | The thing your deployment references and your CI/CD promotes |
| **Managed online endpoint** | A managed, autoscaling HTTPS endpoint with deployments behind it | Real-time serving with blue/green traffic control — no cluster to babysit |

The mental model that made it click for me: **train as a job on a compute cluster, register the resulting model, then deploy that registered model to an endpoint.** Each arrow in that sentence is a versioned, auditable handoff, and that's exactly what you want when an auditor or an incident asks "what was running, trained on what, in what environment?"

```mermaid
graph LR
    DATA[("Data asset(versioned, in ADLS)")]
    JOB["Training job(command/pipeline,on a compute cluster)"]
    MLF["MLflow tracking(metrics, params,the model artifact)"]
    REG["Registered model(named + versioned)"]
    EP["Managed online endpoint"]
    BLUE["blue deployment(100% traffic)"]
    GREEN["green deployment(new version, 0% then ramp)"]
    DATA --> JOB --> MLF --> REG --> EP
    EP --> BLUE
    EP --> GREEN
          
```

The end-to-end path. A versioned data asset feeds a training job on an autoscaling cluster; MLflow captures metrics and the model artifact; the model is registered; the registered model is deployed behind a managed online endpoint. The endpoint holds two deployments — blue serving live traffic and green carrying the new version at 0% — so promotion is a traffic-percentage change, and rollback is the same change in reverse.

## Everything as YAML (and why the UI is a trap)

The single highest-leverage decision was committing to the **v2 CLI/SDK and YAML** for every asset, and using the Studio only to look at results. A training job is a YAML file. The compute, the environment, the data inputs, the deployment — all YAML, all in git. Here's a command job, which is representative of the whole style:

```yaml
# train-job.yml — a training job defined as code
$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
command: python train.py --data ${{inputs.training_data}} --reg ${{inputs.reg_rate}}
code: ./src
inputs:
  training_data:
    type: uri_folder
    path: azureml:churn_features:3   # pinned data asset VERSION, not "latest"
  reg_rate: 0.01
environment: azureml:churn-train-env:5   # pinned environment version
compute: azureml:cpu-cluster
experiment_name: churn-classifier
```

Submitting it is one line — `az ml job create -f train-job.yml` — and because the data and environment are pinned to versions, that job is reproducible months later. Compare that to the alternative, where someone configured a run by clicking through the Studio: there's no diff, no review, no way to recreate it, and the configuration evaporates the moment they leave the team.

**The UI-versus-code drift is a real trap, not a style preference.** The Studio will happily let a teammate tweak a deployment's instance count or an environment inline, and now your git YAML no longer describes reality — the next `az ml ... create` from CI silently reverts their change, or worse, your "infrastructure as code" is quietly lying. Pick one source of truth. We made it code, gave most people read-only Studio access for inspection, and routed every change through a pull request. Configuration that only exists because someone clicked is configuration you will lose.

## MLflow is the tracking layer — lean on it

Azure ML adopted [MLflow](mlflow-experiment-tracking-registry) as its tracking and model-format standard, which is genuinely good news: you log with the open MLflow API and it lands in your workspace. In practice that means `mlflow.autolog()` in the training script captures params, metrics, and the model with almost no code, and the model gets saved in the MLflow format that the endpoints know how to serve *without you writing a scoring script at all*. That last part is underrated — an MLflow-flavored model deploys with no custom inference code, because the flavor already knows how to load and predict.

```python
import mlflow
mlflow.autolog()                 # params, metrics, and the model, captured automatically
model = train(X_train, y_train)  # your normal training
# the run now holds an MLflow model you can register and deploy as-is
mlflow.register_model(f"runs:/{mlflow.active_run().info.run_id}/model", "churn-classifier")
```

## Real-time vs batch: pick the right endpoint

Azure ML offers two endpoint types, and choosing wrong is a common, expensive mistake. **Managed online endpoints** are for synchronous, low-latency requests — a service calls them and waits for a prediction. **Batch endpoints** are for scoring large volumes asynchronously — you point them at a folder of inputs and they spin up a cluster, score everything, write results, and scale back down.

|  | Managed online endpoint | Batch endpoint |
| --- | --- | --- |
| Pattern | Synchronous request/response | Asynchronous bulk scoring |
| Latency | Milliseconds, always warm | Minutes — spins up compute per run |
| Compute | Always-on instances (you pay 24/7) | Compute cluster, scales to zero between runs |
| Use it for | Live fraud scoring, in-app predictions | Nightly churn scores, large file scoring |

The trap: putting a workload that runs once a night behind an always-on online endpoint, paying for idle GPU around the clock to serve one batch job. If nothing is waiting synchronously for the answer, it's a batch job.

### Blue/green deployments are the headline feature

The reason I reach for managed online endpoints is the deployment model. An **endpoint** is a stable URL; behind it sit one or more **deployments**, and the endpoint splits traffic across them by percentage. That makes safe releases mechanical:

```bash
# green carries the new model at 0% traffic — created, warmed, but unseen by users
az ml online-deployment create -f green-deployment.yml --all-traffic false

# send 10% to green, watch metrics, then ramp — or roll back instantly to 0
az ml online-endpoint update --name churn-ep --traffic "blue=90 green=10"
az ml online-endpoint update --name churn-ep --traffic "blue=0 green=100"  # full cutover
# rollback is the same command in reverse — no redeploy, just a traffic flip
```

Rollback being a traffic percentage rather than a redeploy is the whole point: when a release misbehaves at 10%, you're back to safety in seconds, not in however long a redeploy takes. We standardized on create-green-at-zero, ramp, then retire blue — and never deployed straight to 100%.

## The CI/CD pipeline that tied it together

The glue was a GitHub Actions pipeline (Azure DevOps works identically) authenticating to Azure with a workload-identity federation / service principal and a managed identity for the workspace. The flow: on a merge to main, build/register the environment if it changed, submit the training pipeline job, evaluate the candidate against the current production model, and — only if it wins on the agreed metric — register it and deploy to green for a human to ramp. The model-promotion gate is the part that separates MLOps from "a script that deploys whatever trained last."

**Use managed identity, never keys.** Give the workspace and its compute a managed identity and grant it RBAC on the storage and key vault, so jobs read data and secrets without a single connection string in your code or pipeline. It removes the most common Azure ML security smell — datastore credentials pasted into notebooks — and it's less work once it's set up, not more. Authenticate CI to Azure with federated credentials (OIDC) rather than a long-lived service-principal secret for the same reason.

## The lessons that cost us time

- **GPU quota is the silent blocker — request it early.** New subscriptions have near-zero quota for the GPU SKUs you want, and an increase request can take a day or more to approve. We discovered this the afternoon before a deadline. Check and raise quota the day you start, per region and per SKU family, not the day you deploy.

- **Environment image builds are slow — and they're a dependency you version.** The first build of a custom environment image takes many minutes, and a sloppy `conda` spec rebuilds from scratch constantly. Start from a curated Azure ML environment, pin versions, and treat the environment as a versioned artifact you rebuild deliberately — not something CI rebuilds on every run.

- **Compute instances left running are pure waste.** A compute *instance* (the personal dev box) bills while it's on, idle or not. Set auto-shutdown, and prefer compute *clusters* with `min_instances: 0` for jobs so they scale to zero between runs. Most surprise Azure ML bills are idle compute, not training.

- **Online-endpoint deploys are not instant.** Creating or updating a deployment provisions instances and pulls the image — expect minutes, sometimes more on first deploy. Build that latency into your release plan; it's why create-green-early-then-ramp beats deploy-on-demand.

- **Wire up data drift monitoring before you need it.** A model silently degrading because the world moved is the failure you won't catch from infra metrics. Azure ML's monitoring on endpoints (comparing serving data to the training baseline) is the thing that tells you *why* accuracy fell — set it up at launch, not after the first bad quarter.

## What to carry away

Azure Machine Learning is best understood as a code-driven control plane, not a UI: define your data, environments, jobs, and deployments as versioned YAML in git, use the Studio to inspect rather than to configure, and you get reproducibility and auditability for free. Train as a job on an autoscaling compute cluster, let MLflow capture the run and the model, register the model, and deploy it behind a managed online endpoint where blue/green traffic splits make releases and rollbacks a percentage change rather than a redeploy.

The rest is operational discipline that the tutorials skip: request GPU quota on day one, treat environment images as slow-to-build versioned artifacts, scale compute to zero so idle resources don't quietly drain the budget, use managed identity instead of keys, and turn on drift monitoring before the model degrades rather than after. Get the platform mechanics right and the model — the part everyone obsesses over — really does turn out to be the easy part. For the broader frame this sits inside, the [serving stage and the DataOps undercurrent](fundamentals-data-engineering-lifecycle) are exactly what this is: production software discipline applied to ML.
