Building a Data Platform for Robot Fleets on GCP: Pub/Sub, Dataflow, and Vertex AI for Fleet Learning

Every cloud vendor will sell you the same four boxes on a slide — ingest, store, process, train — for a robot fleet data platform, and the boxes really are similar. What isn't similar is which box each cloud is actually good at, and I found that out the hard way porting a client's fleet-data pipeline from a Kinesis-and-Lambda pattern to GCP, expecting a mechanical translation. It wasn't one. GCP's genuine strengths here sit in a different place than AWS's: streaming ingestion that feels native rather than bolted-on, and an analytics layer — BigQuery — that turns "how many robots hit a torque fault last Tuesday" from a data-engineering ticket into a query someone runs while still on the call.

I wrote the edge-capture fundamentals — ROS 2 bag files, MCAP, why you can't ship every sensor stream continuously, and the bandwidth economics that force selective capture — in the AWS version of this architecture; that part doesn't change by cloud, so I won't re-derive it here. This piece picks up from "data has left the robot" and follows the GCP-native path from there, plus the on-device options that differ from the Greengrass pattern.

What are the edge/on-device options on GCP?

GCP's edge story for robotics splits into two tiers depending on how much on-device inference you need. For lightweight, power-constrained perception tasks — object detection, basic anomaly flagging — Coral, Google's Edge TPU hardware line, gives you dedicated ML acceleration on a small form factor board that can sit directly on the robot alongside its main compute, running quantized TensorFlow Lite models without needing a live cloud connection. For heavier onboard inference — running an actual VLA policy or a larger perception stack — most fleets still lean on general-purpose edge compute like NVIDIA Jetson-class hardware, with GCP entering the picture as the thing that receives telemetry and pushes updates rather than running directly on the device. That's a real difference from the AWS pattern: Greengrass is explicitly a managed edge-runtime layer AWS wants running on the robot itself, arbitrating both inference and fleet management. GCP's answer is comparatively less prescriptive about what runs on-device and more focused on what happens once data reaches the cloud boundary — which is either a strength or a gap depending on how much you want a vendor opinion about your edge runtime.

Why is Pub/Sub the right ingestion backbone for fleet telemetry?

Pub/Sub is Google Cloud's fully managed, durable publish-subscribe messaging service, and it's the right ingestion layer here because a robot fleet is, structurally, a huge number of independent event producers that need to be decoupled from whatever's consuming their data downstream. Each robot (or its Greengrass-equivalent edge agent) publishes filtered sensor events, health telemetry, and mission-status updates to Pub/Sub topics; consumers — a Dataflow ETL job, a real-time monitoring dashboard, an alerting system — subscribe independently, at their own pace, without the robot needing to know or care who's listening or whether they're keeping up. This decoupling matters specifically for fleets because a robot's network connection is intermittent by nature — a robot going briefly offline in a warehouse dead zone shouldn't block or crash a downstream consumer, and a downstream consumer falling behind during a batch reprocessing job shouldn't cause the robot to drop telemetry it can't retry.

The detail that actually matters for correctness, not just throughput, is ordering keys. Pub/Sub supports per-key ordered delivery, and using robot ID as the ordering key guarantees that events from a single robot arrive at a subscriber in the sequence they were published — which you need, because a sequence of joint-state or mission-status events being processed out of order can silently corrupt a session reconstruction downstream. You don't need global ordering across the whole fleet — that would be needless coordination overhead across thousands of independent robots — you need it per-robot, which is exactly what ordering keys give you without paying for a stronger guarantee you don't actually need.

What does Dataflow actually do in this pipeline?

Dataflow is GCP's fully managed service for running Apache Beam pipelines, and it does both the streaming and batch ETL work in this architecture from a single programming model — which is a genuine ergonomic advantage over stitching together separate streaming and batch tools. On the streaming side, a Dataflow job subscribes to the Pub/Sub telemetry topics and does real-time work: computing fleet-health aggregates over sliding windows (average battery drain per hour, torque-fault rate per fleet per shift), flagging anomalies as they happen, and writing structured telemetry into BigQuery for immediate querying. On the batch side, a separate (or the same, in Beam's unified model) Dataflow pipeline processes the accumulated raw sensor sessions out of GCS — converting MCAP sessions into training-ready Parquet, aligning timestamps across camera/lidar/proprioception streams, the same conversion job the AWS piece runs on Batch or EMR, just expressed as a Beam pipeline instead.

Windowing is the Beam concept that does the heavy lifting for fleet-health metrics specifically: a sliding or session window over the Pub/Sub stream lets you compute "rolling fault rate over the last 15 minutes per robot" without maintaining that state yourself, and Beam's watermark handling deals with the reality that telemetry from a robot that was briefly offline arrives late relative to robots that stayed connected — a correctness problem that gets ugly quickly if you hand-roll it.

graph TD
    R["Robot fleet
ROS 2 / MCAP + Coral/Jetson edge inference"] --> PS["Pub/Sub
per-robot ordering keys"] PS --> DFSTREAM["Dataflow (streaming)
windowed fleet-health aggregates"] PS --> GCS["GCS raw sensor lake
partitioned by fleet/robot/date/session"] DFSTREAM --> BQ["BigQuery
ad-hoc fleet-health analytics"] GCS --> DFBATCH["Dataflow (batch)
MCAP to Parquet ETL"] DFBATCH --> VERTEX["Vertex AI
policy / VLA training"] VERTEX --> PIPE["Vertex AI Pipelines
retrain, validate in sim, redeploy"] PIPE -->|"updated policy"| R

Pub/Sub decouples the fleet's event stream from every downstream consumer at once — a streaming Dataflow job for fleet-health metrics into BigQuery, and the raw data landing in GCS for later batch conversion. Vertex AI Pipelines is what turns training into a repeatable, validated loop rather than a one-off job someone runs by hand.

Why is BigQuery the genuine differentiator here?

BigQuery is Google Cloud's serverless, columnar data warehouse, and the honest answer to "why does it matter for robot fleets" is that it collapses the distance between "we have telemetry" and "someone can ask a real question about it right now." A fleet operations lead asking "which robots in the west warehouse had elevated joint torque in the last six hours" shouldn't need a data engineer to write a job — that's a SQL query against a BigQuery table fed continuously by the streaming Dataflow pipeline, and it comes back in seconds against billions of rows without anyone provisioning or tuning a warehouse. This is the point where the AWS-equivalent pattern (Athena over Glue-cataloged Parquet in S3) is a reasonable analog but a meaningfully rougher one in practice — Athena-over-S3 is a strong pattern for the training-data lake, but BigQuery's combination of native streaming ingestion (via Dataflow or direct streaming inserts) and interactive-speed SQL over huge, constantly-growing telemetry tables is the smoother experience specifically for fleet-health and operational analytics, as opposed to the training-data-lake side of the house, where the two clouds are closer to parity.

Use caseTable
Fleet-health dashboard, ad-hoc ops queriesBigQuery, fed by streaming Dataflow
Raw sensor session storageGCS, partitioned by fleet/robot/date/session
Training-ready converted dataParquet in GCS, queryable via BigQuery external tables or direct Vertex AI dataset ingestion

How does Vertex AI close the retrain-and-redeploy loop?

Vertex AI is Google Cloud's managed ML platform, covering GPU/TPU training, model registry, and serving, and for this architecture the piece that matters most is Vertex AI Pipelines — a managed orchestration layer (built on Kubeflow Pipelines) for defining the training workflow as a repeatable, versioned DAG rather than a script someone runs from their laptop. The loop that actually matters for a fleet that's continuously collecting new data looks like: new labeled sessions accumulate in the training dataset, a Vertex AI Pipeline triggers a retrain (scheduled or threshold-based, once enough new data has accumulated), the retrained policy gets validated against a simulation environment before touching real hardware — the same sim-validation discipline I covered in the sim-to-real piece, cloud-agnostic in principle but naturally expressed here as a pipeline stage — and only a policy that clears validation gets pushed out to the fleet. Vertex AI Pipelines is what turns that from a manual, easy-to-skip process into something that runs the same way every time, with lineage you can actually audit when someone asks which dataset a deployed policy was trained on.

# conceptual shape of a Vertex AI Pipeline retrain trigger
from kfp import dsl

@dsl.pipeline(name="fleet-policy-retrain")
def retrain_pipeline(min_new_sessions: int = 5000):
    check = check_new_labeled_data(threshold=min_new_sessions)
    with dsl.If(check.output == "ready"):
        train = train_policy(dataset=check.outputs["dataset_uri"])
        validate = validate_in_simulation(model=train.outputs["model"])
        with dsl.If(validate.output == "pass"):
            deploy_to_fleet(model=train.outputs["model"])

Where does Intrinsic fit in the GCP-adjacent ecosystem?

Intrinsic is a robotics software company that spun out of X, Alphabet's moonshot factory, as an independent Alphabet-owned company in 2021, building a software platform aimed at making industrial robots easier to program and reconfigure without bespoke integration work for every task. It's worth knowing about specifically as the GCP-adjacent player in this space — Alphabet-owned, not a Google Cloud product itself, but the natural reference point if a client asks "does Google have a robotics play beyond cloud infrastructure." I'm deliberately not going further than that here — treat Intrinsic as an ecosystem data point, not a component of the reference architecture above, since its product roadmap is a separate question from the data-platform pattern this piece is about.

Where I'd actually pick GCP over AWS for this workload, and where I wouldn't. If ad-hoc fleet-health analytics — the kind of question an ops lead asks live, not the kind a data scientist plans a notebook around — is a first-class requirement, BigQuery's ergonomics are the deciding factor for me, and Pub/Sub plus Dataflow feels like it was designed for exactly this streaming-decoupled-from-batch pattern rather than adapted to it. If your priority is a deeper bench of robotics-specific edge/device-management tooling and history — even acknowledging AWS's own robotics-specific service, RoboMaker, didn't survive — AWS's broader IoT and device-fleet management lineage (Greengrass, IoT Core, the whole IoT service family) gives you more prior art and more third-party integrations to lean on. Neither cloud has a finished, robotics-native "this is the one true stack" story yet; both are general-purpose data and ML platforms with robotics bolted on at the edges, which is exactly why the reference architecture in both pieces looks like a data engineering pattern first and a robotics pattern second.

What to carry away

The GCP-native version of this architecture leans on Pub/Sub for durable, per-robot-ordered streaming ingestion, Dataflow's unified Beam model for both the real-time fleet-health path and the batch MCAP-to-Parquet conversion, GCS as the raw session lake, BigQuery for the ad-hoc analytics workload that's genuinely stronger here than its AWS/Athena equivalent, and Vertex AI Pipelines for a retrain-validate-redeploy loop with real lineage instead of a script someone runs by hand. The edge-capture fundamentals and the underlying bandwidth-versus-completeness tension are identical to the AWS architecture — physics and physical time don't change by cloud vendor. What changes is which layer feels native versus bolted-on, and for fleet-health analytics specifically, that's a real, opinionated reason to pick GCP rather than a coin flip between equivalent services.