Ray for ML: Distributed Compute, Ray Train/Serve/Tune, and Running It on EKS

The job was a hyperparameter sweep across forty model configurations, each needing a full GPU, and the team's instinct was to reach for the tool everyone already had running: Spark. It fit badly. Spark's execution model wants to parallelize a data transformation across partitions of a dataset — it has no clean way to say "run these forty independent, stateful, GPU-bound Python training loops and let me collect the results as they finish." What actually fit was Ray, and the gap between those two tools is the entire reason Ray exists: Spark parallelizes data, Ray parallelizes arbitrary Python — tasks, actors, GPUs, whatever heterogeneous, often stateful work your ML pipeline actually needs, not just a DataFrame transformation.

This is Ray as a working solutions architect actually uses it: the task/actor programming model and cluster architecture underneath it, what Ray Train, Tune, Serve, and Data each solve, when Ray genuinely beats Spark for ML workloads and when it doesn't, and running it in production on Amazon EKS with the KubeRay operator — where most of the real operational pain actually lives.

What is Ray, and why does it exist alongside Spark?

Ray is an open-source distributed compute engine built specifically for AI/ML workloads — a core distributed runtime plus a set of libraries (Train, Tune, Serve, Data) layered on top for the specific stages of an ML pipeline. It was designed from the ground up for reinforcement learning and model training workloads, which explains its shape: instead of Spark's data-parallel model (split a dataset into partitions, run the same transformation on each), Ray's primitive is arbitrary Python functions and classes running as distributed tasks and actors, scheduled across a cluster with no requirement that the work be homogeneous or stateless. That's a genuinely different bet, and it's why Ray tends to win specifically where Spark struggles: reinforcement learning, hyperparameter search, simulation, and deep learning training — workloads that are computation-heavy, often stateful, and don't decompose cleanly into "the same operation on every row."

How does a Ray cluster actually work — tasks, actors, and the object store?

Ray's programming model has exactly two primitives, and the distinction between them is the first thing to internalize. A task is a stateless remote function call — decorate a Python function with @ray.remote, call it, and Ray schedules it on some worker in the cluster, returns a future immediately, and lets you collect the result later. An actor is a stateful remote class — decorate a class instead, instantiate it remotely, and every method call on that actor handle runs against the same persistent process, with state that survives between calls. Tasks are the right fit for embarrassingly parallel, independent work (score these 10,000 images); actors are the right fit for anything that needs to hold state across calls (a model loaded into GPU memory that you want to call predict() against repeatedly, without reloading it every time).

Underneath both primitives, a Ray cluster runs a small set of architectural pieces worth naming. The Global Control Store (GCS), running on the head node, is the cluster's metadata backbone — actor locations, cluster state, system-level coordination. Each node runs a Raylet daemon that handles local scheduling and resource management for that node. And the distributed object store is what makes Ray's data-sharing model fast: objects are stored in shared memory local to the node that produced them, and reading an object from a task running on the same node is a zero-copy operation — no deserialization, no memory copy — which is a real, measurable performance property, not a marketing claim, and it's exactly why data-locality-aware scheduling matters so much in Ray's design.

graph TD
    subgraph HEAD["Head node"]
        GCS["GCS
(cluster metadata,
actor locations)"]
        RL1["Raylet"]
    end
    subgraph W1["Worker node"]
        RL2["Raylet"]
        OS1["Object store
(shared memory)"]
        T1["Tasks / Actors"]
    end
    subgraph W2["Worker node"]
        RL3["Raylet"]
        OS2["Object store
(shared memory)"]
        T2["Tasks / Actors"]
    end
    GCS --> RL1
    RL1 -.->|"schedule"| RL2
    RL1 -.->|"schedule"| RL3
    T1 -->|"zero-copy read
(same node)"| OS1
    T2 -->|"zero-copy read
(same node)"| OS2

Ray's cluster architecture: the GCS on the head node tracks cluster-wide metadata and actor locations, while each node's Raylet handles local scheduling and its own slice of the distributed object store. A task reading an object already sitting in its own node's object store pays no serialization cost — cross-node reads do, which is why co-locating a task with the data it needs is a real, first-order performance decision in Ray, not an afterthought.

What do Ray Train, Tune, Serve, and Data each actually solve?

Ray Core (tasks, actors, the object store) is the substrate — almost nobody builds an ML pipeline directly on it. The four AI libraries layered on top map onto the four stages of an ML workflow, and each solves a specific, named problem rather than being a generic "distributed X" wrapper.

Library	Solves	Typical use
Ray Data	Framework-agnostic, streaming data loading and transformation across training/tuning/prediction	Feeding batches to a distributed training job without materializing the whole dataset in memory first
Ray Train	Distributed training and fine-tuning, abstracting cluster setup for PyTorch/TensorFlow-style distributed training loops	Multi-GPU, multi-node training without hand-rolling the distributed coordination yourself
Ray Tune	Distributed hyperparameter search — scheduling, early stopping, checkpointing best results	Running dozens or hundreds of training configurations in parallel and keeping only what's worth keeping
Ray Serve	Scalable, programmable online model serving from any framework	Production inference APIs, including composing multiple models/business logic into one deployment graph

The reason these compose well together, rather than being four unrelated tools that happen to share a name, is that they all sit on the same Ray Core primitives and the same object store — a Ray Data pipeline can stream batches directly into a Ray Train job running on the same cluster with no serialization hop through an external system, and a Ray Tune sweep is, under the hood, just many parallel Ray Train runs coordinated by Tune's scheduler. This overlaps in scope with, but is architecturally distinct from, dedicated model-serving platforms like KServe, Seldon, and BentoML — Ray Serve's advantage is that it shares infrastructure and code with the rest of a Ray-based training pipeline rather than requiring a hand-off to a separate serving stack, at the cost of being a less specialized, Kubernetes-native serving tool than something purpose-built for that one job.

When does Ray actually beat Spark, and when does Spark still win?

The honest, practitioner-level answer is workload shape, not a blanket "Ray is newer so it's better." Spark's data-parallel execution model is genuinely well-suited to data-centric ETL and preprocessing at scale — its optimizer, shuffle engine, and mature SQL surface are things Ray doesn't try to replicate. Ray earns its place specifically on workloads Spark handles awkwardly: reinforcement learning, hyperparameter search, simulation, and deep learning training, plus anything that needs to orchestrate heterogeneous compute — CPUs for preprocessing, GPUs for training, all in one coordinated pipeline — because Ray's task/actor model doesn't force every stage into the same data-parallel shape a Spark job does. Ray is also the stronger fit for unstructured, multimodal data (video, images, text) precisely because that data doesn't decompose into clean row-wise transformations the way tabular ETL does.

In practice, the two aren't usually a choice between one or the other — the common production pattern is Spark for the data-intensive extraction and preprocessing stage, handing off to Ray for the computation-heavy training, tuning, or inference stage that follows. Treating this as an either/or decision is a common early mistake; treating it as a pipeline with a handoff point is usually the more honest architecture.

How do you actually run Ray in production on EKS?

The standard, supported way to run Ray on Kubernetes — including EKS — is the KubeRay operator, which manages Ray clusters as native Kubernetes custom resources rather than requiring you to hand-manage pods yourself. KubeRay exposes three CRDs, each for a different operational shape: RayCluster manages a long-lived Ray cluster's full lifecycle — creation, deletion, autoscaling, fault tolerance — for workloads you want to keep running and submit work to repeatedly. RayJob is the batch pattern: it creates a RayCluster, submits a job once the cluster is ready, and can be configured to tear the cluster down automatically when the job finishes — the right shape for a one-off training run or a scheduled batch job, not something you want sitting idle burning EC2 spend. RayService wraps a RayCluster plus a Ray Serve deployment graph and adds zero-downtime upgrades and high availability, which is what you actually want for a production inference endpoint rather than the raw RayCluster primitive.

# A minimal RayJob shape on EKS — the operator provisions the cluster,
# runs the entrypoint, and can tear the cluster down when it's done
apiVersion: ray.io/v1
kind: RayJob
metadata:
  name: training-sweep
spec:
  entrypoint: python train.py --config sweep.yaml
  shutdownAfterJobFinishes: true
  rayClusterSpec:
    headGroupSpec:
      rayStartParams: {}
      template:
        spec:
          containers:
            - name: ray-head
              image: rayproject/ray:2.55.1
    workerGroupSpecs:
      - groupName: gpu-workers
        replicas: 4
        minReplicas: 0
        maxReplicas: 8
        rayStartParams: {}
        template:
          spec:
            containers:
              - name: ray-worker
                image: rayproject/ray:2.55.1
                resources:
                  limits:
                    nvidia.com/gpu: 1

Autoscaling on EKS is a two-layer story worth understanding separately rather than assuming it's one mechanism. The Ray Autoscaler, enabled via enableInTreeAutoscaling: true on the KubeRay side, watches Ray-level resource demand (how many actors/tasks are waiting for CPU/GPU) and requests more Ray worker pods when demand exceeds current capacity. If the underlying Kubernetes cluster doesn't have the node capacity to schedule those new pods, the Kubernetes Cluster Autoscaler (or Karpenter) is the second layer that actually provisions new EC2 nodes to satisfy them. Both layers need sensible bounds — minReplicas/maxReplicas on the Ray side and a real node group max on the AWS side — because a misconfigured or absent max on either layer is exactly how a runaway hyperparameter sweep turns into an unbounded EC2 bill.

Node sizing on EKS follows a specific, non-obvious best practice: size each Ray pod to consume an entire Kubernetes node rather than running many small Ray pods per node. Ray's own scheduling and object-store locality assumptions work better with fewer, larger pods than with many small ones sharing a node — packing multiple Ray workers onto one node adds coordination overhead the "one pod per node" pattern avoids. For GPU workloads specifically, that means a dedicated GPU node group (EKS managed node group or Karpenter NodePool scoped to GPU instance types) sized so each node's GPU count matches what a single Ray worker pod requests, with CPU-only preprocessing or Ray Data stages routed to a separate, cheaper node group entirely — mixing GPU and CPU work on the same node group is a common way to pay GPU prices for work that never touches a GPU.

Spot instances are the obvious cost lever for Ray worker node groups, and they're also where I've seen the most painful production incidents — not because spot interruption is unhandled, but because object store state doesn't survive it gracefully by default. When a spot-backed worker node gets reclaimed, any objects that lived only in that node's local object store are gone, and any actor running on it dies with no automatic state recovery unless you've explicitly built checkpointing into the training loop. I've watched a multi-hour training run lose most of its progress to a single spot reclaim because checkpointing was "on the roadmap." Before running training workloads on spot node groups, confirm checkpointing is actually implemented and tested — not assumed — and keep the head node and any stateful coordination on stable, on-demand capacity even if the workers are spot; head node loss is a much worse failure mode than losing one worker.

What's the most common Ray production failure mode outside of spot interruption?

Object store pressure and out-of-memory kills. Ray's object store spills to disk automatically when it fills up, which is the correct behavior, but spilling that's too slow relative to the rate objects are being created — or an object store that's become fragmented — can still lead to OOM even with spilling enabled, because the fallback allocation path itself can fail under sustained pressure. The practical symptom is a worker or raylet process getting killed by the OS, which shows up as a task or actor failure with a confusing error rather than an obvious "out of memory" message. The fix is rarely "add more memory" as a first move — it's auditing for unreleased object references (objects a task is still holding onto long after it's done with them) and confirming spilling is actually configured with fast enough local storage to keep up, before reaching for bigger nodes as the default answer.

What to carry away

Ray exists because Spark's data-parallel model is a poor fit for computation-heavy, often stateful ML workloads — reinforcement learning, hyperparameter search, deep learning training, and heterogeneous CPU/GPU pipelines — and Ray's task/actor primitives, backed by a zero-copy local object store, are built specifically for that shape of work. Ray Data, Train, Tune, and Serve aren't four unrelated tools; they're the same object store and scheduling substrate applied to each stage of an ML pipeline, which is why they compose cleanly into one end-to-end workflow rather than requiring hand-offs between disconnected systems.

Running Ray in production on EKS means understanding KubeRay's three CRDs — RayCluster for long-lived clusters, RayJob for batch work that should tear itself down, RayService for production inference with zero-downtime upgrades — and treating autoscaling as the two-layer system it actually is (Ray Autoscaler requesting pods, Kubernetes/Karpenter provisioning the nodes underneath them), both bounded deliberately. Size Ray pods to fill whole nodes, isolate GPU and CPU work into separate node groups, and before putting training workers on spot instances for the cost savings, confirm checkpointing is real — not assumed — because object store state and unchecked actors do not survive a spot reclaim gracefully on their own.