Building a Data Platform for Robot Fleets on AWS: From Edge Capture to Training-Ready Data

The question that stops most robotics teams cold isn't "can we train a good policy." It's "how do we get the data off five hundred robots and into a place where anyone can train on it." I've watched teams nail the modeling side — a solid VLA fine-tune, a sim-to-real pipeline that actually works — and then quietly stall for two quarters because nobody designed the boring part: the pipe between a robot's camera and a training job's S3 bucket. That pipe is a data engineering problem wearing a robotics costume, and it's exactly the kind of problem I get called in for. This is the reference architecture I'd actually build on AWS in mid-2026, including two services I'd have recommended eighteen months ago that you should not build on today, and why that matters more than any individual service choice.

I covered the model side of this world in three earlier pieces — what Physical AI actually is, how VLA models turn perception into motor commands, and why simulation is really a synthetic-data pipeline. All three assume a training-ready dataset shows up. This piece is about how it actually gets there, from a real fleet, on real AWS infrastructure.

What does a robot actually log, and why can't you ship all of it?

A single mobile robot running ROS 2, with a few RGB cameras, a lidar, and joint-state telemetry, can generate tens of gigabytes per hour — a stereo camera pair alone at 30fps and modest resolution is a meaningful chunk of that, before you add lidar point clouds and IMU/proprioception at high frequency. Multiply by a fleet of a few hundred robots running multi-hour shifts and you're staring at bandwidth and storage numbers that don't clear a sanity check if your plan is "stream everything continuously to the cloud." This is the same math that sinks connected-vehicle telemetry projects, and it sinks robotics data platforms for the identical reason: nobody budgets for the network cost of raw sensor data until the first month's bill arrives.

Robots log this data as ROS 2 bag files (the native serialization format for ROS 2 topics) or, increasingly, MCAP — a container format designed specifically for robotics and multi-modal timeseries data, built to be self-describing and streamable, and now the format most new tooling standardizes on because it's not tied to one middleware. Either way, what you're capturing is the same thing: camera frames, lidar scans, joint states, proprioceptive sensor readings, all timestamped and multiplexed across topics. The question you have to answer before you write a line of infrastructure code is: how much of this leaves the robot, when, and at what fidelity? Get that wrong and you're either paying to ship data nobody will ever train on, or you're missing the one failure-mode clip you actually needed.

How does AWS IoT Greengrass fit at the edge?

AWS IoT Greengrass is AWS's edge runtime — software that runs on the robot's onboard compute (or a nearby edge box) and lets you deploy containerized components, run local inference, and manage software and model updates across a fleet without round-tripping every decision through the cloud. In this architecture it does three jobs. First, it runs the on-device inference for whatever policy or perception model is currently deployed, so the robot doesn't need a live cloud connection to operate — which matters because robots are frequently not on a reliable network at the moment they need to act. Second, it's where you put your selective capture logic: a Greengrass component that watches ROS 2 topics and decides, in real time, what gets buffered for later upload versus what gets discarded or heavily downsampled. Third, it's the OTA (over-the-air) channel — the same mechanism that pushes a new perception model down to the fleet is the one that eventually receives the retrained policy coming back out of your training pipeline, closing the loop this whole architecture exists to close.

The practical shape of that middle job — selective capture — is a set of rules running as a Greengrass component: keep full-resolution camera frames and lidar for the thirty seconds around a detected anomaly (a torque spike, an emergency stop, a perception-confidence drop), downsample everything else to a fraction of native resolution and frame rate, and drop redundant proprioceptive samples above the rate anyone will actually use for training. None of this is exotic engineering — it's the same "sample smartly at the edge" pattern every IoT telemetry system eventually reinvents, and reinventing it is exactly what current AWS guidance tells you to do, for a reason covered next.

Two services you'd have reached for on this exact architecture are no longer where you should start, and the reasons matter more than the fact. AWS RoboMaker — AWS's managed robotics simulation and development service — reached end of support on September 10, 2025; it was no longer available to new customers before that date, and AWS's own migration guidance for existing customers is to move simulation workloads onto AWS Batch running containerized simulation instead. AWS IoT FleetWise — built for connected-vehicle telemetry, using rule-based conditional collection against the Vehicle Signal Specification (VSS) so you only ship the signal ranges and event windows you actually need — closed to new customers on April 30, 2026, ahead of this article's publish date; existing customers keep running it, but there's no new feature investment and you can't newly adopt it. Neither retirement means the underlying pattern was wrong. RoboMaker's actual lesson: don't build your core data/training pipeline dependency on a narrow, robotics-branded managed service when the equivalent job runs fine on durable general-purpose primitives — S3, Batch, EC2 GPU instances, containers — that AWS has every incentive to keep investing in for a decade. FleetWise's lesson is subtler: the selective-collection pattern it embodied (rule-based conditional capture, a standardized signal schema so different fleets/vehicles/robots describe events comparably) is genuinely the right architecture. The specific managed service just isn't the vehicle for it anymore if you're starting today — you'd reimplement that pattern yourself, as a custom Greengrass component plus IoT Core rules, or by following one of AWS's partner-published "Guidance for Connected Mobility on AWS"-style reference architectures, rather than adopting FleetWise fresh.

Why does S3 partitioning strategy matter this much for robot-fleet data?

Because the wrong partition scheme turns every downstream query — "give me all camera frames from robot 47's shift on June 3rd" — into a full-bucket scan, and at fleet scale that's the difference between a five-second Athena query and a five-minute one that costs real money every time someone runs it. The partition key that works in practice is a composite: fleet_id / robot_id / date / session_id, with the session boundary defined by something meaningful in robot terms — a shift, a mission, a single continuous operation — rather than an arbitrary time window. This mirrors how you'd partition IoT or clickstream data by device and day, except the "device" here is a physical asset with a maintenance history and the "session" has a start and end condition tied to a real-world task, which is worth encoding in the path rather than inferring later from timestamps.

s3://robot-fleet-raw/
  fleet_id=warehouse-west/
    robot_id=r-0472/
      date=2026-07-14/
        session_id=sess-88213/
          camera_front.mcap
          camera_rear.mcap
          lidar.mcap
          joint_states.mcap
          session_manifest.json

Storage class is the second lever, and it's genuinely a tiered decision rather than a single answer. Sessions actively feeding a current training run belong in S3 Standard — you want them fast and cheap to scan repeatedly during iteration. Sessions from the last month or two, still plausibly useful for the next training cycle but not in active rotation, are a fit for S3 Intelligent-Tiering, which shifts them automatically as access patterns change without you hand-managing lifecycle rules per session. The long tail — sessions from six months ago that you're keeping because deleting robot data feels irreversible and you might need it for some future edge case — belongs in Glacier-class storage, accessed rarely and cheaply. The lifecycle policy that actually gets this right transitions by session age automatically, because nobody is going to manually reclassify ten thousand sessions a month, and if the policy requires a human in the loop it won't happen consistently.

How do you get from rosbag/MCAP to training-ready Parquet?

You need a batch ETL layer, because a training job doesn't want to read MCAP files directly — it wants columnar, queryable, schema-consistent data it can filter and join efficiently. The conversion job reads MCAP sessions from the raw S3 lake, demultiplexes the topics, aligns timestamps across sensor streams (camera, lidar, joint state rarely sample at the same rate, and you need a consistent join key across them), and writes out Parquet partitioned the same way as the raw data. For the compute layer, AWS Batch — the same service RoboMaker migrated its simulation workloads onto — is a good fit for this precise reason: it's a straightforward containerized batch job, one container per session or per day's worth of sessions, and it scales cleanly with Spot instances since a failed conversion job just retries rather than losing anything. For heavier aggregate transforms — joining across sessions, computing fleet-wide statistics, deduplicating near-identical frames — an EMR/Spark cluster is the better tool, because that's a genuinely distributed join-and-aggregate problem rather than an embarrassingly parallel per-session conversion.

Once the Parquet lands, register it in the AWS Glue Data Catalog so it's queryable via Athena or usable directly as a Spark/EMR data source without anyone having to know the physical S3 layout. This is the step teams skip under deadline pressure and regret within a month, because the alternative is every data scientist re-deriving "which sessions have a valid grasp-success label" by hand from S3 prefixes, which doesn't scale past the second person who needs to ask that question.

graph TD
    R["Robot fleet
ROS 2 / MCAP sensor streams"] --> GG["AWS IoT Greengrass
edge inference + selective capture"] GG -->|"filtered/compressed
uplink"| S3RAW["S3 raw lake
partitioned by fleet/robot/date/session"] S3RAW --> BATCH["AWS Batch / EMR
MCAP to Parquet ETL"] BATCH --> GLUE["Glue Data Catalog
queryable via Athena"] GLUE --> LABEL["SageMaker Ground Truth
human-in-the-loop labeling"] LABEL --> TRAIN["EC2 GPU / SageMaker
policy training"] TRAIN -->|"updated policy"| GG GG -->|"deployed to fleet"| R

The full loop: selective capture at the edge keeps the uplink affordable, S3 partitioning keeps the raw lake queryable at fleet scale, Batch/EMR does the unglamorous MCAP-to-Parquet conversion, Ground Truth adds the human labels the policy training still needs, and Greengrass is both the capture layer and the OTA channel that closes the loop back to the robot.

Why do you still need human labeling if the model learns from demonstration?

Because raw sensor data plus raw actions isn't automatically the label a VLA or policy-training pipeline needs — you frequently need humans to annotate success/failure outcomes, segment a long session into discrete task attempts, tag objects and grasp points in camera frames, or flag sessions where something went wrong in a way the automated telemetry didn't clearly capture. SageMaker Ground Truth is AWS's managed data-labeling service for exactly this: it manages the labeling workforce (your own team, a vendor workforce, or Mechanical Turk depending on sensitivity), the labeling UI, and quality-control sampling, and it plugs directly into the S3/Glue-cataloged data you already have without a separate export step. The realistic expectation to set here: labeling remains the bottleneck in almost every fleet data pipeline I've seen, not the model architecture and not the compute. You can generate hundreds of thousands of synthetic simulation trajectories in an afternoon — I go through exactly how in the sim-to-real piece — but real-world session labeling is still fundamentally a human-hours problem, and it scales with headcount, not GPU budget.

Where does training and the OTA feedback loop close?

Training runs on GPU compute — EC2 GPU instances (P4/P5-class) for teams that want direct control over the training environment, or SageMaker's managed training jobs for teams that would rather not own the cluster orchestration. Either way, the input is the labeled, Parquet-backed, Glue-cataloged dataset from the steps above, and the output is a retrained or fine-tuned policy. The part that's easy to build once and then neglect is what happens next: getting that policy back onto the fleet. This is where Greengrass's OTA deployment capability does its second job — the same component-deployment mechanism used to push a new capture-filtering rule can push a new model artifact, with staged rollout (a canary subset of robots first, full fleet after validation) rather than a flag-day cutover to every robot at once. Skipping the staged rollout is how one bad checkpoint takes down a whole warehouse shift instead of three test robots.

LayerAWS serviceJob
Edge capture & inferenceAWS IoT GreengrassLocal policy execution, selective sensor capture, OTA deployment channel
Raw sensor lakeAmazon S3Partitioned rosbag/MCAP storage, tiered lifecycle by session age
Batch ETLAWS Batch, EMR/SparkMCAP to Parquet conversion, cross-session aggregation
CatalogAWS Glue Data CatalogQueryable schema over the Parquet lake, Athena access
LabelingSageMaker Ground TruthHuman-in-the-loop annotation for success/failure, segmentation, grasp points
TrainingEC2 GPU instances, SageMaker training jobsPolicy/VLA fine-tuning on labeled, cataloged data
Simulation (post-RoboMaker)AWS Batch, containerized simulationValidation and synthetic data generation, migrated off RoboMaker

What actually goes wrong running this in production?

Three things, consistently. First, connectivity is worse than anyone's architecture diagram assumes — warehouse WiFi has dead zones, cellular coverage in a yard or a rural site drops out, and a robot that can't currently upload still has to keep operating and keep buffering, which means your edge storage and buffering logic has to handle hours, not seconds, of disconnection gracefully. Second, the labeling bottleneck I mentioned above isn't a footnote, it's usually the actual constraint on how fast you can iterate — teams that budget generously for GPU compute and treat labeling as an afterthought end up GPU-rich and label-poor, sitting on unlabeled sessions nobody has capacity to annotate. Third, there's a real, unresolved tension between "capture everything, because we don't know what future training run will need it" and "capture selectively, because bandwidth and storage cost money" — and I don't think there's a clean answer. The honest approach is to be deliberate about it: define your selective-capture rules explicitly (the FleetWise pattern, rebuilt yourself), review them periodically as your training needs evolve, and accept that you will occasionally regret discarding something. The alternative — capturing everything indefinitely — just moves the regret to the finance review instead.

What to carry away

The robot-fleet data platform problem on AWS decomposes into pieces this blog already treats as familiar: edge filtering with Greengrass, a partitioned S3 lake, batch ETL with Batch or EMR into Glue-cataloged Parquet, human labeling through Ground Truth, and GPU training on EC2 or SageMaker — closing the loop with OTA policy deployment back through Greengrass. The two retirements worth internalizing are not really about AWS RoboMaker or IoT FleetWise specifically. RoboMaker's shutdown on September 10, 2025 says: don't anchor your core pipeline to a narrow managed service when general-purpose primitives do the job and outlast it. FleetWise's close to new customers on April 30, 2026 says the opposite lesson about the same category of decision: a good architectural pattern (rule-based selective collection, standardized signal schemas) can outlive the specific product that popularized it, and you should be willing to rebuild the pattern yourself rather than assume "the managed version" will always be there. Bandwidth cost, connectivity gaps, and the labeling bottleneck are the three things that will actually consume your first two quarters — plan for them explicitly rather than discovering them in a cost review. If you're evaluating this same problem on GCP, I've laid out the genuinely different — not just relabeled — version of this architecture in the companion piece on Pub/Sub, Dataflow, and Vertex AI.