# Designing a Robotics Data Factory: Comparing the AWS, GCP, Azure, and NVIDIA Reference Architectures

Seven articles into this topic area, I keep getting asked the same question in a different wrapper: "just tell me which one to build." Fair enough. I've now written the DIY reference architecture on AWS, GCP, and Azure, walked through NVIDIA's opinionated Cosmos/OSMO blueprint, and covered the real-world teleoperation data collection that underlies all of it. This is the piece where I stop being even-handed and actually tell you what I'd build, and when I'd build something else instead.

## What are the four approaches actually being compared?

Four distinct architectural bets, each covered in its own article on this blog. The **AWS DIY reference architecture** — Greengrass edge capture, S3 partitioned raw lake, Batch/EMR ETL to Glue-cataloged Parquet, SageMaker Ground Truth labeling, EC2/SageMaker training — assembles durable, general-purpose AWS primitives into a robotics pipeline, deliberately avoiding dependence on narrow robotics-branded managed services after watching AWS RoboMaker get discontinued and IoT FleetWise close to new customers. The **GCP DIY reference architecture** — Pub/Sub streaming ingestion, Dataflow ETL, GCS raw lake, BigQuery for fleet-health analytics, Vertex AI Pipelines for retrain-and-redeploy — makes the same kind of general-purpose-primitives bet on GCP's stack instead, with a genuinely different ingestion philosophy (streaming-native rather than batch-first) and a stronger ad-hoc analytics story through BigQuery. The **[Azure DIY reference architecture](robotics-data-platform-azure)** — IoT Hub and device twins for connectivity, Azure IoT Operations as the Arc/Kubernetes-native edge runtime, Azure Digital Twins for graph-based fleet modeling, ADLS Gen2 as the raw lake, Microsoft Fabric for unified analytics, Azure Machine Learning for training — is the same general-purpose-primitives philosophy applied a third way, distinguished by a genuine differentiator (a first-party graph-based digital twin service neither AWS nor GCP matches in prominence) and a genuine weakness (the newest, least battle-tested edge runtime of the three). **NVIDIA's Open Physical AI Data Factory Blueprint** — Cosmos Curator, Cosmos Transfer, Cosmos Evaluator, orchestrated by OSMO — is the opposite kind of bet: an opinionated, vendor-specific, synthetic-data-centric pipeline purpose-built for exactly this problem, at the cost of depending on NVIDIA's specific stack rather than general-purpose cloud primitives.

Azure's stack deserves a beat of its own before the table, because it isn't a find-replace of the other two clouds' patterns. Its clearest edge is **Azure Digital Twins** — a DTDL-modeled graph of robots, components, and physical locations that lets you ask relationship questions ("which robots share a charging station with this one, and are any also showing elevated fault rates") that are awkward against a flat telemetry table on either AWS or GCP. Paired with **Microsoft Fabric**'s OneLake as a single logical lake underneath both the streaming fleet-health path and the training-data ETL, Azure's pitch is strongest for an organization already standardized on Fabric and Power BI for the rest of its data estate — one analytics platform instead of a second one built just for the robots. Its honest weakness is **Azure IoT Operations**, the Arc/Kubernetes-native edge runtime introduced in 2025: it's architecturally heavier than Greengrass's standalone agent or GCP's general-purpose edge-compute pattern, and it simply hasn't accumulated the production track record either of the other two has. And Azure adds a third data point to the managed-IoT-retirement pattern — **Azure IoT Central** stopped accepting new application resources in April 2024 and fully retires in March 2027, joining AWS RoboMaker and IoT FleetWise as evidence that the higher-level, more opinionated IoT products across all three major clouds age worse than the primitives underneath them.

And underneath all three sits the piece none of them replace: **real-world teleoperation data collection** — the expensive, human-hours-priced layer that produces the ground-truth demonstration data every one of these pipelines eventually needs to validate against, no matter how much synthetic data a Cosmos Transfer-style pipeline generates.

```mermaid
graph TD
    subgraph DIY["DIY cloud reference architectures"]
        AWS["AWS: Greengrass, S3,Batch/EMR, SageMaker"]
        GCP["GCP: Pub/Sub, Dataflow,GCS, BigQuery, Vertex AI"]
        AZURE["Azure: IoT Hub/Operations,Digital Twins, Fabric, Azure ML"]
    end
    NVIDIA["NVIDIA blueprint:Cosmos Curator/Transfer/Evaluator + OSMO"]
    TELEOP["Real-world teleoperationdata collection"]
    DIY --> TRAIN["Trained policy"]
    NVIDIA --> TRAIN
    TELEOP -->|"ground-truth validationand fine-tuning data"| TRAIN
          
```

Four architecturally different paths to a trained policy, converging on the same requirement: none of them, including NVIDIA's synthetic-data-heavy blueprint, eliminates the need for real-world teleoperated data somewhere in the loop.

## How do they actually compare, axis by axis?

| Axis | AWS DIY | GCP DIY | Azure DIY | NVIDIA Blueprint |
| --- | --- | --- | --- | --- |
| Bandwidth/edge-capture story | Greengrass selective capture, mature edge tooling | Pub/Sub streaming-native, strong for continuous telemetry | IoT Operations, Arc/Kubernetes-native but least battle-tested | Not an edge-capture product — assumes data already exists |
| Synthetic vs. real emphasis | Real-fleet data-centric; synthetic is a separate concern | Real-fleet data-centric; same as AWS in this respect | Real-fleet data-centric; same as AWS/GCP | Synthetic/augmented-data-centric by design |
| Managed-service maturity for this use case | General-purpose primitives, not robotics-specific | General-purpose primitives, not robotics-specific | General-purpose primitives, plus a genuine graph-twin differentiator | Purpose-built specifically for physical AI data generation |
| Vendor lock-in risk | Low — primitives outlast robotics-branded services | Low — same reasoning as AWS | Low on primitives; DTDL modeling work is its own soft lock-in | Higher — real dependency on NVIDIA's Cosmos/OSMO stack |
| Cost model | Storage + compute + labeling headcount, scales with fleet size | Similar, streaming ingestion cost shape differs | Similar, plus Fabric capacity cost if not already provisioned | GPU-compute-heavy, scales with generation volume |
| Operational maturity / support runway | RoboMaker and FleetWise retirements are a real lesson here | No equivalent robotics-specific retirement yet to my knowledge | IoT Central retirement is a third data point in the same pattern | New (2026 announcement) — support runway still unproven |

The retirement row deserves a second look because it's the axis most teams underweight, and it's no longer a two-cloud pattern. Building on durable, general-purpose primitives (S3, Batch, Pub/Sub, BigQuery, IoT Hub, ADLS Gen2) means your core pipeline doesn't depend on a narrow, opinionated managed service that a cloud vendor might discontinue — which is exactly what happened to RoboMaker in September 2025, to IoT FleetWise's new-customer availability in April 2026, and now to Azure IoT Central, which stopped accepting new application resources in April 2024 and fully retires in March 2027. Three retirements across three vendors is a pattern, not a coincidence: the higher-level, more "batteries-included" IoT and robotics products consistently have a rockier survival record than the lower-level primitives underneath them. NVIDIA's blueprint is new enough, and different enough in kind — it's a data-generation pipeline, not a simulation-and-fleet-telemetry service — that it's not a direct parallel to any of the three. But it's worth naming plainly: adopting any vendor's opinionated, narrowly-scoped product for a core pipeline dependency carries the same category of risk, regardless of which logo is on the box, and three data points now say you should weight that risk higher than a single retirement would suggest.

## When would I actually recommend each one?

If you're pre-product-market-fit on the robotics side and need to move fast on training-data volume without a large existing data-engineering team, start with NVIDIA's blueprint or a similarly opinionated synthetic-data stack. You get a working, purpose-built pipeline faster than assembling one from general-purpose primitives, and at an early stage the lock-in risk matters less than the speed-to-first-useful-model. If you're already deep in AWS, GCP, or Azure for the rest of your data estate — data warehouse, existing ML infrastructure, existing team expertise — and you need long-term architectural control over a pipeline that's going to be core infrastructure for years, build the DIY reference architecture on whichever cloud you're already standardized on, following the general-purpose-primitives principle the RoboMaker/FleetWise/IoT Central lesson teaches. Between the three clouds specifically: lean AWS if your fleet's data shape is closer to batch-oriented session capture (which is how most mobile-robot fleets naturally produce data) and you want Greengrass's mature edge tooling; lean GCP if you're already BigQuery-centric for analytics and want a genuinely streaming-native ingestion story rather than adapting a batch-first pattern to streaming data; lean Azure if you're already deep in Microsoft Fabric and Power BI for the rest of your data estate and want first-party, graph-based digital-twin fleet modeling that neither AWS nor GCP offers with comparable prominence — accepting in return the least battle-tested edge runtime of the three.

Whichever path you take, budget real money for teleoperated real-world data collection regardless. No synthetic-data pipeline, including NVIDIA's, gets you a production-ready policy on its own yet — the ground-truth validation and fine-tuning signal still comes from real robots doing real tasks under real human supervision, and every serious program I've seen ends up paying for both, not choosing one over the other.

**If you only read one article in this whole series, make it this recommendation:** build vs. adopt is not a permanent choice, and the right first move for a small team (adopt an opinionated blueprint) is not the right long-term move for an org that's making robotics core to the business (build on durable general-purpose primitives). Plan to migrate from the first to the second as the program matures, and don't let the sunk cost of an early opinionated adoption keep you locked into it past the point where architectural control starts mattering more than speed.

## What actually goes wrong if you pick the "wrong" one?

Less than you'd think, honestly, and that's worth saying because it takes the pressure off treating this as a one-shot, unrecoverable decision. Teams that adopt NVIDIA's blueprint early and later need more architectural control can migrate the training and orchestration layers onto general-purpose cloud primitives without discarding the synthetic-data corpus they've already generated — the data itself is portable even if the generation pipeline isn't. Teams that build the full DIY architecture early and later want NVIDIA's synthetic-data capabilities can adopt Cosmos components as an additional data-generation stage feeding into their existing S3/GCS/ADLS lake, rather than replacing anything. The actual mistake isn't picking AWS over GCP over Azure or DIY over NVIDIA — it's treating whichever real-world teleoperation data collection budget you started with as optional past the first year, because that's the one piece that doesn't have a "migrate later" option. You either fund it or your policies plateau on the sim-to-real gap indefinitely.

## What to carry away

The AWS, GCP, and Azure DIY reference architectures are the same underlying bet on three different clouds: durable general-purpose primitives over narrow, opinionated managed services, informed directly by the RoboMaker, FleetWise, and IoT Central retirement lessons. Azure's version earns its place as a genuine option rather than a checkbox — Azure Digital Twins gives you graph-based fleet modeling nothing else on this blog's cloud comparisons offers, and Microsoft Fabric's unified analytics story is a real reason to pick it if you're already invested in that ecosystem, though you're accepting the newest and least battle-tested edge runtime of the three in return. NVIDIA's Open Physical AI Data Factory Blueprint is a genuinely different bet — opinionated, purpose-built, synthetic-data-centric — that trades architectural control for speed and specificity. Real-world teleoperation data collection isn't a fifth option alongside these four; it's the layer every one of them still needs underneath, because no synthetic pipeline gets you a production-ready policy without real-world validation and fine-tuning data. If you're moving fast pre-PMF, start with an opinionated blueprint. If you're building long-term core infrastructure on a cloud you already run — including Azure, if graph-based fleet modeling or a unified Fabric analytics estate matters to you — build the DIY architecture on that cloud. Either way, fund the teleoperation data collection from day one — it's the one line item in this whole comparison that doesn't have a cheaper substitute.
