technology

Ray vs Spark in 2026: Picking the Right Distributed Compute Engine

Ray and Spark compared on architecture, GPU support, ML workloads, ETL, and operational cost.

May 5, 2026

English

Every quarter a team somewhere asks the same question: should we build the new ML platform on Ray or on Spark? The answer is rarely one or the other. The interesting answer is which workloads each one wins, and what it costs to operate the loser.

This is a head-to-head built from production deployments across vision training, LLM serving, and petabyte ETL. No marketing.

The one-sentence summary

Ray is a fine-grained Python-first task and actor system designed for ML. Spark is a coarse-grained data-parallel JVM engine designed for analytics. They overlap in the middle, on batch ML and feature engineering, and that is where most of the confusion lives.

Architectural differences that matter

Dimension	Ray 2.x	Spark 3.5
Core language	C++ with Python bindings	Scala on the JVM
Scheduling unit	Microsecond task or actor call	Stage of partitioned RDD
State model	Stateful actors are first-class	Stateless RDD or DataFrame
GPU scheduling	Native fractional GPU	Coarse via barrier execution
Shuffle	Object store backed	Disk and network shuffle
Fault tolerance	Lineage plus actor reconstruction	Lineage on RDD only
Typical cluster	8 to 1,000 nodes	50 to 10,000 nodes
API surface	Python	SQL, Scala, Python, R

The big behavioral consequence is task overhead. Ray dispatches a task in roughly 200 microseconds. Spark stage launch is 1 to 3 seconds. For a workload made of millions of tiny tasks — graph algorithms, RL rollouts, hyperparameter trials — Ray finishes while Spark is still planning.

Throughput on representative ML workloads

Relative throughput on a 16-node cluster of A100 80GB nodes.

Distributed PyTorch (BERT)   Ray  |##############################| 30
                             Spark|######                        |  6
RL rollouts (PPO 256 envs)   Ray  |#############################_| 29
                             Spark|##                            |  2
Hyperparameter sweep (200)   Ray  |#############################_| 29
                             Spark|#######                       |  7
ETL Parquet 4TB              Ray  |##################            | 18
                             Spark|##############################| 30
SQL aggregations 8TB         Ray  |#####################         | 21
                             Spark|##############################| 30
LLM batch inference          Ray  |##############################| 30
                             Spark|############                  | 12

Numbers are normalized — 30 means each system was running at full cluster utilization on the workload it was best at. Sourced from Anyscale's 2025 ML benchmark and Databricks Ray-on-Spark blog. See Anyscale's Ray vs Spark benchmark for the underlying methodology.

When Spark still wins

Spark is the right choice when:

The workload is SQL-heavy. Catalyst is a real query planner; Ray Data is not.
You are integrating with a Delta Lake, Iceberg, or Hudi table. Spark is the reference engine.
You need predicate pushdown into Parquet at petabyte scale.
Your team already runs Databricks or EMR and the operational sunk cost is enormous.

A 2024 survey by Databricks found that 78 percent of Spark workloads are still SQL or DataFrame ETL, not ML. That is the part Spark should keep.

When Ray wins by a landslide

Ray dominates when:

The unit of work is a Python function or a stateful object.
You need fractional GPUs. Ray supports num_gpus=0.25; Spark does not.
You are doing reinforcement learning, hyperparameter search, or distributed deep learning.
You want to mix training and serving on one cluster.
Your engineers write Python and resent JVM tuning.

For a deeper look at when Ray is the right tool, see our pillar on Ray and alternatives and the Ray Tune hyperparameter guide.

The boring middle: feature engineering

The honest middle ground is batch feature engineering on tabular data. Both systems handle it competently. Picking is mostly a culture question.

Question	Pick Ray	Pick Spark
Most engineers comfortable in	Python	Scala or SQL
Existing data lake	Iceberg with Python access	Delta Lake on Databricks
Downstream training framework	PyTorch or JAX	Spark MLlib
Cluster lifecycle	Ephemeral per job	Long-lived shared
Streaming requirements	Async, tolerant	Exactly-once, structured streaming

Operational cost in the wild

Self-managed on AWS, May 2026 spot pricing, 16-node cluster running 24/7.

Component	Ray plus KubeRay	Databricks Spark
Compute (spot)	$58,000 / yr	$58,000 / yr
Software license	0	$40,000 / yr
Operator headcount	0.4 FTE	0.6 FTE
Average cluster utilization	64 percent	41 percent
Net cost per useful core-hour	$0.12	$0.27

Ray's lower utilization floor comes from its autoscaler-v2 plus actor lifetime control. Spark utilization suffers because driver and executor sizing assumes coarse stages.

Hybrid: Ray on Spark, Spark on Ray

Both systems offer interop, and both interop modes work well enough to ship.

Ray on Spark: Databricks Runtime 15+ ships Ray as a first-class library. Run a Spark DataFrame ETL, then ray.init() in the same notebook for distributed training on the result. This is the path of least resistance for Databricks shops.
Spark on Ray: RayDP runs PySpark on top of Ray. Useful for teams who want a single Ray-native control plane and need Spark for an existing SQL workload.
Dask on Ray: Drop-in scheduler swap. See the Ray Data documentation for the recommended pattern.

For most teams, the right answer is Spark for what it is good at, Ray for what it is good at, and a shared object store (S3 or GCS) between them.

Migration story: a vision team moving off Spark

A computer-vision team we worked with had a 2018-vintage Spark MLlib pipeline doing image preprocessing plus a Horovod training step driven by Spark barrier mode. The numbers before migration:

Training run: 11 hours on 8 V100s
Cluster utilization: 38 percent
60 percent of wall-clock spent in driver-executor coordination

After migrating preprocessing to Ray Data and training to Ray Train:

Training run: 4.2 hours on the same 8 V100s
Cluster utilization: 71 percent
Engineering team dropped 1,400 lines of Scala wrapper code

The Spark code that survived is the upstream ETL, which still runs nightly to produce the Parquet that Ray Data reads.

A note on managed inference

Distributed compute frameworks decide where training happens. Production inference is a different problem. Many teams run training on Ray and inference on a mix of self-hosted and managed APIs, with a gateway like Swfte Connect routing requests by cost and quality. That keeps the Ray cluster focused on training rather than handling unpredictable production load.

For inference patterns specifically see our batch inference cost optimization guide.

What to do this quarter

List every Spark job currently in your platform. Tag each as SQL ETL, batch ML, streaming, or Python compute. The Python compute and batch ML rows are Ray candidates.
Pick one Python-heavy Spark job and rewrite it on Ray. Measure wall-clock and cost. Most teams see 2x to 3x speedups.
Keep Spark for SQL ETL. Do not rewrite working ETL.
If you are a Databricks shop, enable Ray on Databricks Runtime and run training there. No new cluster needed.
Document the cluster boundary: which workloads go where, and how data moves between them. Most production failures come from blurred boundaries, not from the engines themselves.

Pubblicato intechnology

Ray Apache Spark Distributed Compute ML Infrastructure Big Data

Enjoyed this article?

Get more insights on AI and enterprise automation delivered to your inbox.

← Back to all articles