# Building a Data Platform for Robot Fleets on Azure: IoT Operations, Digital Twins, and Fabric

A client asked me last quarter why their robot-fleet dashboard could tell them a forklift's battery percentage but not which dock door it was blocking, which shift crew was responsible for it, or what its maintenance history looked like relative to the three other units in the same aisle. The telemetry pipeline was fine. What was missing was a model of the fleet as a system of related things, not just a firehose of sensor readings with a robot ID attached. That gap is where Azure's actual story diverges from AWS's and GCP's, and it's the reason I'd point a Microsoft-shop client at this architecture rather than just translating the AWS pattern service-for-service.

I've written the edge-capture fundamentals and the bandwidth-versus-completeness tension already, in the [AWS version of this architecture](robotics-data-platform-aws) and the [GCP version](robotics-data-platform-gcp) — a robot fleet running ROS 2 with cameras, lidar, and joint-state telemetry generates more data than you should ship continuously, full stop, regardless of cloud. I won't re-derive that here. What I will do is walk through where Azure is genuinely different, not just relabeled: a graph-based digital twin service with no real analog on this blog's other two cloud comparisons, an edge runtime built on Kubernetes and Arc rather than a lightweight standalone agent, and an analytics layer — Microsoft Fabric — I've covered in enough depth elsewhere on this blog that I can lean on that authority rather than describe it generically.

## How does Azure IoT Hub anchor device connectivity and telemetry?

**Azure IoT Hub** is Microsoft's managed cloud gateway for bidirectional communication between IoT devices and the cloud, and in this architecture it's the connectivity backbone every robot (or its edge agent) talks to first. Each robot authenticates to IoT Hub with per-device credentials, publishes telemetry (battery state, mission status, sensor health, fault codes) over MQTT or AMQP, and receives commands and configuration pushed down from the cloud side — the same request/response shape you'd expect from any managed IoT gateway, scaled to the message-per-second volumes a few hundred robots actually produce.

The detail that matters more than the transport protocol is **device twins**. A device twin is a JSON document IoT Hub maintains per device, holding reported properties (state the device pushes up — firmware version, current battery health, last-known position) and desired properties (configuration the cloud side wants applied — a new capture-filtering threshold, a target firmware version) separately, and reconciling the two asynchronously rather than requiring the robot to be online at the exact moment you want to change its configuration. That distinction is the whole value: you set a desired property while a robot is in a dead zone in the back of a warehouse, and IoT Hub applies it the moment connectivity returns, instead of that change silently failing or you having to build your own retry-on-reconnect logic. It's a smaller, flatter version of what a full digital twin does — which is exactly why Azure Digital Twins, covered below, exists as a separate, richer layer rather than trying to cram graph relationships into the device twin model itself.

## What does Azure IoT Operations actually run at the edge, and how is it different from Greengrass?

**Azure IoT Operations** is Microsoft's edge data plane, introduced in 2025 and built on **Azure Arc** and Kubernetes, with an MQTT broker at its core handling local pub/sub between whatever's running on the edge cluster — inference workloads, protocol translators, the components that decide what telemetry gets forwarded to IoT Hub versus buffered or discarded locally. If you've read the AWS piece, this is the Azure analog to Greengrass, and I bridged this same ROS 2-to-cloud problem years ago on the Greengrass side in [an earlier piece on ROS 2 and AWS IoT Core](ros2-greengrass-iot-core) — the underlying problem (get selected sensor data off a robot reliably, run local inference, close an OTA loop) doesn't change by vendor, but the shape of the solution does here in a way worth being honest about.

The honest difference: Greengrass is a lightweight, standalone edge agent you can run directly on a robot's onboard compute with a relatively small footprint. Azure IoT Operations assumes you're running an Arc-enabled Kubernetes cluster at the edge — which means real container orchestration, real cluster management overhead, and a heavier resource footprint than a single-process edge agent needs to carry. That's not automatically a bad trade. If your organization already runs Arc-managed Kubernetes at other edge sites — a factory floor, a distribution center — putting the robot fleet's edge compute on the same operational model means one set of cluster-management practices, one GitOps deployment pipeline, one observability stack, instead of a robotics-specific edge runtime living outside that world. But if your onboard compute is a single small board with no appetite for running a Kubernetes control plane, IoT Operations is asking for more than the job strictly requires, and you'll feel that mismatch immediately in resource budget on the robot itself.

```mermaid
graph TD
    R["Robot fleetROS 2 / MCAP sensor streams"] --> IOTOPS["Azure IoT OperationsArc/Kubernetes edge, MQTT broker"]
    IOTOPS -->|"filtered telemetry"| HUB["Azure IoT Hubdevice twins, per-device state"]
    HUB --> ADT["Azure Digital TwinsDTDL graph model"]
    HUB --> ADLS["ADLS Gen2 raw lakepartitioned by fleet/robot/date/session"]
    ADLS --> FABRIC["Microsoft FabricOneLake, Lakehouse, Real-Time Intelligence"]
    FABRIC --> AML["Azure Machine Learningpolicy / VLA training"]
    AML -->|"updated policy"| IOTOPS
    IOTOPS -->|"deployed to fleet"| R
          
```

Device twins in IoT Hub hold per-robot state; Azure Digital Twins builds the relationship graph on top of that same telemetry. Fabric's OneLake sits underneath both the streaming analytics path and the training-data ETL, which is the layer this architecture leans on hardest if you're already invested in that ecosystem.

**Azure IoT Central is the third data point in a pattern worth naming explicitly.** Microsoft stopped allowing new application resources on **Azure IoT Central** starting April 1, 2024, and the service is on a path to full retirement on March 31, 2027. IoT Central was Microsoft's higher-level, more "batteries-included" managed IoT application platform — built on top of IoT Hub, meant to get you to a working device-management app faster than assembling IoT Hub, device provisioning, and a UI yourself. It's the same category of lesson as AWS RoboMaker's September 2025 discontinuation and AWS IoT FleetWise's April 2026 close to new customers: the higher-level, more opinionated, more "batteries-included" IoT and robotics products across all three major clouds have a rockier survival record than the lower-level primitives sitting underneath them. IoT Hub, IoT Edge and its successor IoT Operations, S3 and its equivalents (ADLS Gen2, GCS) — the boring, general-purpose building blocks — keep getting invested in for a decade. The higher-level product wrapping them gets reconsidered every few years as the vendor's strategy shifts. Three data points across three vendors is a pattern, not a coincidence, and it should genuinely shift how much you're willing to depend on a managed product versus the primitive underneath it when you're picking what to build your core pipeline on.

## What makes Azure Digital Twins a genuine differentiator here?

**Azure Digital Twins** is Microsoft's graph-based digital twin service, and it's the piece of this architecture I don't have a clean equivalent for on the AWS or GCP side of this blog's comparisons. AWS has IoT TwinMaker, but it's a narrower, less central piece of AWS's robotics story than Digital Twins is for Azure's. GCP doesn't have a first-party graph-based twin modeling service with comparable prominence at all. What Digital Twins actually gives you is a live, queryable graph: you model a robot, its components (battery, arm, sensor payload), its physical location (a specific dock, a specific aisle, a specific site), and the relationships between all of them — "robot r-0472 is located in aisle-12," "robot r-0472 has-component battery-pack-viii" — using **DTDL**, the Digital Twins Definition Language, a JSON-LD-based schema format for describing twin types and their relationships.

The twin graph updates in real time from IoT Hub telemetry — a robot's reported battery percentage flows into its corresponding twin's property, and because the twin is a node in a graph with explicit relationships to other twins, you can ask questions that are structurally awkward against a flat telemetry table: "which robots share a charging station with r-0472, and are any of them also showing elevated fault rates this shift." That's a graph traversal, not a join across five tables that happen to share a foreign key you invented after the fact. For a fleet where physical topology and equipment relationships actually matter to the questions people ask — which is most fleets operating in a real facility rather than an open field — that's a real capability, not a marketing distinction.

```json
{
  "@id": "dtmi:fleet:Robot;1",
  "@type": "Interface",
  "displayName": "Robot",
  "contents": [
    { "@type": "Property", "name": "batteryPercentage", "schema": "double" },
    { "@type": "Property", "name": "faultCode", "schema": "string" },
    { "@type": "Relationship", "name": "locatedIn", "target": "dtmi:fleet:Aisle;1" },
    { "@type": "Relationship", "name": "hasComponent", "target": "dtmi:fleet:BatteryPack;1" }
  ],
  "@context": "dtmi:dtdl:context;2"
}
```

Be honest with yourself about the cost side of this, too: modeling a fleet as a DTDL graph is real upfront design work — you have to decide what your interfaces are, what relationships matter, and maintain that model as the fleet's physical layout and equipment mix changes. It's not something you get for free by pointing Digital Twins at your existing telemetry. Teams that skip the modeling exercise and just dump flat telemetry into it get a marginally fancier IoT Hub, not the graph-query capability that's the actual point.

## How does Azure Data Lake Storage Gen2 fit as the raw sensor lake?

**Azure Data Lake Storage Gen2** is Azure's hierarchical-namespace object storage, and it plays the same role here that S3 plays in the AWS piece and GCS plays in the GCP piece: the durable landing zone for raw ROS 2 bag and MCAP sessions before conversion to training-ready Parquet. The partitioning discussion doesn't change by cloud — `fleet_id / robot_id / date / session_id` as the composite key, so "give me robot 47's sessions from a specific date" is a prefix scan rather than a full-container listing, and a lifecycle policy that moves cold sessions to cooler storage tiers automatically rather than requiring someone to reclassify data by hand. ADLS Gen2's hierarchical namespace gives you real directory semantics on top of blob storage, which matters slightly more here than on S3/GCS because Fabric's OneLake, covered next, is built directly on ADLS Gen2 — so getting the raw lake's structure right isn't just about query performance, it's about what Fabric sees when it reads that same data as a lakehouse source.

## Why does Microsoft Fabric do more work here than "just another data warehouse"?

**Microsoft Fabric** is Microsoft's unified analytics platform, and I've gone deep on its internals and its migration path from classic Azure data services [elsewhere on this blog](ms-fabric-internals) — I won't repeat that architecture lesson here, just apply it to robot fleet data specifically. The piece that matters most for this pipeline is **OneLake**, Fabric's single logical data lake sitting on top of ADLS Gen2: your raw MCAP sessions and your converted training Parquet can live in the same underlying storage that every Fabric workload — Lakehouse, Warehouse, Real-Time Intelligence — reads from directly, without a separate copy step between "the data lake" and "the thing Fabric queries."

**Fabric Real-Time Intelligence**, built around Eventstream for ingesting continuous telemetry, is where the fleet-health analytics workload lives — the equivalent of the BigQuery streaming path in the GCP architecture, except here it's native to the same platform your training-data ETL runs on, not a separate warehouse you're bridging to. An ops lead asking "which robots in the west facility had elevated torque faults in the last six hours" is a KQL query against an Eventstream-fed table, not a cross-platform data pull. **Fabric Lakehouse and Warehouse** handle the heavier lifting: the MCAP-to-Parquet ETL for training data, cross-session aggregation, and the kind of ad-hoc fleet-wide analysis a data science team runs against months of accumulated sessions. If your organization already has Power BI dashboards and a Fabric-based data estate for the rest of the business, this is the argument for Azure that has nothing to do with robotics specifically: one platform, one governance model, one set of skills your team already has, rather than a second analytics stack purpose-built for the robot fleet alone.

| Layer | Azure service | Job |
| --- | --- | --- |
| Device connectivity | Azure IoT Hub | Per-device auth, telemetry ingestion, device twins for async state/config |
| Edge compute | Azure IoT Operations | Arc/Kubernetes-native MQTT broker, local inference, selective capture |
| Fleet graph model | Azure Digital Twins | DTDL-modeled graph of robots, components, locations, relationships |
| Raw sensor lake | Azure Data Lake Storage Gen2 | Partitioned rosbag/MCAP storage, tiered lifecycle by session age |
| Unified analytics | Microsoft Fabric (OneLake, Real-Time Intelligence, Lakehouse) | Streaming fleet-health analytics, training-data ETL, ad-hoc analysis |
| Training | Azure Machine Learning | Policy/VLA training on GPU compute, pipelines for retrain-validate-redeploy |

## How does Azure Machine Learning close the retrain-and-redeploy loop?

**Azure Machine Learning** is Azure's managed ML platform, and for this architecture it plays the same role SageMaker plays in the AWS piece and Vertex AI plays in the GCP piece: GPU compute for training the policy or VLA model, plus a pipelines layer — Azure ML pipelines — for turning retraining into a repeatable, versioned workflow rather than a script someone runs from a laptop. The loop looks the same shape it does on the other two clouds: labeled sessions accumulate in the Fabric-managed training dataset, a pipeline triggers a retrain once enough new data has landed, the retrained policy gets validated in simulation before touching real hardware, and only a policy that clears validation gets pushed back down through Azure IoT Operations to the fleet, staged as a canary rollout to a subset of robots before the rest. The mechanics here are genuinely comparable in depth to SageMaker and Vertex AI Pipelines — Azure ML doesn't do anything meaningfully different at this layer, which is a fair thing to say plainly rather than manufacturing a distinction that isn't there.

```python
# conceptual shape of an Azure ML pipeline retrain trigger
from azure.ai.ml import MLClient, dsl, Input

@dsl.pipeline(name="fleet-policy-retrain")
def retrain_pipeline(min_new_sessions: int = 5000):
    check = check_new_labeled_data(threshold=min_new_sessions)
    with dsl.condition(check.outputs.ready, equal_to="true"):
        train = train_policy(dataset=check.outputs.dataset_uri)
        validate = validate_in_simulation(model=train.outputs.model)
        with dsl.condition(validate.outputs.passed, equal_to="true"):
            deploy_to_fleet(model=train.outputs.model)
```

## Where does Azure actually win against AWS and GCP for this workload, and where is it weaker?

Azure's real edge is Digital Twins plus Fabric, and I mean that as a genuine, not a diplomatic, answer. No equivalent on this blog's AWS or GCP comparisons gives you a first-party, prominent, graph-based fleet model the way Azure Digital Twins does — if physical topology, equipment relationships, and cross-robot context are load-bearing for the questions your ops team actually asks, that's a real capability gap in the other two clouds' stories, not a marketing difference. And if your organization is already deep in Microsoft Fabric and Power BI for the rest of its data estate, running the robot fleet's analytics on the same platform instead of standing up BigQuery or Athena as a second stack is a genuine operational win that has nothing to do with robotics and everything to do with not maintaining two data platforms.

The honest weakness is Azure IoT Operations. It's the newest edge runtime of the three — introduced in 2025, built on a fundamentally heavier architectural model (Arc-managed Kubernetes) than Greengrass's standalone agent or the general-purpose edge compute pattern GCP leans on — and it doesn't have the multi-year production track record either of the other two options does. If you're already running Arc-managed Kubernetes at your edge sites for other reasons, that's a real point in Azure's favor, not against it — you're extending an operational model you've already paid the cost of adopting. If you're not, you're taking on a container-orchestration dependency at the edge that the other two clouds don't require for the equivalent job. And IoT Central's retirement — no new application resources since April 2024, full shutdown by March 2027 — is the third data point in the "don't bet your core pipeline on the high-level managed IoT product" lesson, following RoboMaker and FleetWise. It should weigh on you a little more, not less, that this pattern is now three-for-three across three different vendors.

## What to carry away

The Azure-native version of this architecture runs IoT Hub and device twins for connectivity and per-device state, Azure IoT Operations as the Arc/Kubernetes-native edge runtime, Azure Digital Twins as a genuine graph-based fleet model neither AWS nor GCP matches in prominence, ADLS Gen2 as the raw sensor lake, Microsoft Fabric as the unified analytics layer spanning streaming fleet-health metrics and training-data ETL, and Azure Machine Learning for the training and retrain-redeploy loop. Pick this stack deliberately for two reasons and two reasons only: you need graph-based fleet modeling that a flat telemetry table can't answer cleanly, or you're already standardized on Fabric and Power BI and don't want a second analytics platform just for the robots. Pick it reluctantly if you need the most battle-tested edge runtime available today — that's still Greengrass or a general-purpose edge-compute pattern, not IoT Operations, at least until it accumulates a few more years of production mileage. And factor the IoT Central retirement into how much confidence you extend to any single vendor's high-level managed IoT product going forward — three cautionary tales across three clouds is enough data to change your default assumption, not just a one-off. For the full three-way comparison including NVIDIA's blueprint, see the [capstone comparison piece](robotics-data-factory-design-comparison), now updated to include this architecture as a fourth option.
