Every quarter a team somewhere asks the same question: should we build the new ML platform on Ray or on Spark? The answer is rarely one or the other. The interesting answer is which workloads each one wins, and what it costs to operate the loser.
This is a head-to-head built from production deployments across vision training, LLM serving, and petabyte ETL. No marketing.
The one-sentence summary
Ray is a fine-grained Python-first task and actor system designed for ML. Spark is a coarse-grained data-parallel JVM engine designed for analytics. They overlap in the middle, on batch ML and feature engineering, and that is where most of the confusion lives.
Architectural differences that matter
| Dimension | Ray 2.x | Spark 3.5 |
|---|---|---|
| Core language | C++ with Python bindings | Scala on the JVM |
| Scheduling unit | Microsecond task or actor call | Stage of partitioned RDD |
| State model | Stateful actors are first-class | Stateless RDD or DataFrame |
| GPU scheduling | Native fractional GPU | Coarse via barrier execution |
| Shuffle | Object store backed | Disk and network shuffle |
| Fault tolerance | Lineage plus actor reconstruction | Lineage on RDD only |
| Typical cluster | 8 to 1,000 nodes | 50 to 10,000 nodes |
| API surface | Python | SQL, Scala, Python, R |
The big behavioral consequence is task overhead. Ray dispatches a task in roughly 200 microseconds. Spark stage launch is 1 to 3 seconds. For a workload made of millions of tiny tasks — graph algorithms, RL rollouts, hyperparameter trials — Ray finishes while Spark is still planning.
Throughput on representative ML workloads
Relative throughput on a 16-node cluster of A100 80GB nodes.
Distributed PyTorch (BERT) Ray |##############################| 30
Spark|###### | 6
RL rollouts (PPO 256 envs) Ray |#############################_| 29
Spark|## | 2
Hyperparameter sweep (200) Ray |#############################_| 29
Spark|####### | 7
ETL Parquet 4TB Ray |################## | 18
Spark|##############################| 30
SQL aggregations 8TB Ray |##################### | 21
Spark|##############################| 30
LLM batch inference Ray |##############################| 30
Spark|############ | 12
Numbers are normalized — 30 means each system was running at full cluster utilization on the workload it was best at. Sourced from Anyscale's 2025 ML benchmark and Databricks Ray-on-Spark blog. See Anyscale's Ray vs Spark benchmark for the underlying methodology.
When Spark still wins
Spark is the right choice when:
- The workload is SQL-heavy. Catalyst is a real query planner; Ray Data is not.
- You are integrating with a Delta Lake, Iceberg, or Hudi table. Spark is the reference engine.
- You need predicate pushdown into Parquet at petabyte scale.
- Your team already runs Databricks or EMR and the operational sunk cost is enormous.
A 2024 survey by Databricks found that 78 percent of Spark workloads are still SQL or DataFrame ETL, not ML. That is the part Spark should keep.
When Ray wins by a landslide
Ray dominates when:
- The unit of work is a Python function or a stateful object.
- You need fractional GPUs. Ray supports
num_gpus=0.25; Spark does not. - You are doing reinforcement learning, hyperparameter search, or distributed deep learning.
- You want to mix training and serving on one cluster.
- Your engineers write Python and resent JVM tuning.
For a deeper look at when Ray is the right tool, see our pillar on Ray and alternatives and the Ray Tune hyperparameter guide.
The boring middle: feature engineering
The honest middle ground is batch feature engineering on tabular data. Both systems handle it competently. Picking is mostly a culture question.
| Question | Pick Ray | Pick Spark |
|---|---|---|
| Most engineers comfortable in | Python | Scala or SQL |
| Existing data lake | Iceberg with Python access | Delta Lake on Databricks |
| Downstream training framework | PyTorch or JAX | Spark MLlib |
| Cluster lifecycle | Ephemeral per job | Long-lived shared |
| Streaming requirements | Async, tolerant | Exactly-once, structured streaming |
Operational cost in the wild
Self-managed on AWS, May 2026 spot pricing, 16-node cluster running 24/7.
| Component | Ray plus KubeRay | Databricks Spark |
|---|---|---|
| Compute (spot) | $58,000 / yr | $58,000 / yr |
| Software license | 0 | $40,000 / yr |
| Operator headcount | 0.4 FTE | 0.6 FTE |
| Average cluster utilization | 64 percent | 41 percent |
| Net cost per useful core-hour | $0.12 | $0.27 |
Ray's lower utilization floor comes from its autoscaler-v2 plus actor lifetime control. Spark utilization suffers because driver and executor sizing assumes coarse stages.
Hybrid: Ray on Spark, Spark on Ray
Both systems offer interop, and both interop modes work well enough to ship.
- Ray on Spark: Databricks Runtime 15+ ships Ray as a first-class library. Run a Spark DataFrame ETL, then
ray.init()in the same notebook for distributed training on the result. This is the path of least resistance for Databricks shops. - Spark on Ray: RayDP runs PySpark on top of Ray. Useful for teams who want a single Ray-native control plane and need Spark for an existing SQL workload.
- Dask on Ray: Drop-in scheduler swap. See the Ray Data documentation for the recommended pattern.
For most teams, the right answer is Spark for what it is good at, Ray for what it is good at, and a shared object store (S3 or GCS) between them.
Migration story: a vision team moving off Spark
A computer-vision team we worked with had a 2018-vintage Spark MLlib pipeline doing image preprocessing plus a Horovod training step driven by Spark barrier mode. The numbers before migration:
- Training run: 11 hours on 8 V100s
- Cluster utilization: 38 percent
- 60 percent of wall-clock spent in driver-executor coordination
After migrating preprocessing to Ray Data and training to Ray Train:
- Training run: 4.2 hours on the same 8 V100s
- Cluster utilization: 71 percent
- Engineering team dropped 1,400 lines of Scala wrapper code
The Spark code that survived is the upstream ETL, which still runs nightly to produce the Parquet that Ray Data reads.
A note on managed inference
Distributed compute frameworks decide where training happens. Production inference is a different problem. Many teams run training on Ray and inference on a mix of self-hosted and managed APIs, with a gateway like Swfte Connect routing requests by cost and quality. That keeps the Ray cluster focused on training rather than handling unpredictable production load.
For inference patterns specifically see our batch inference cost optimization guide.
What to do this quarter
- List every Spark job currently in your platform. Tag each as SQL ETL, batch ML, streaming, or Python compute. The Python compute and batch ML rows are Ray candidates.
- Pick one Python-heavy Spark job and rewrite it on Ray. Measure wall-clock and cost. Most teams see 2x to 3x speedups.
- Keep Spark for SQL ETL. Do not rewrite working ETL.
- If you are a Databricks shop, enable Ray on Databricks Runtime and run training there. No new cluster needed.
- Document the cluster boundary: which workloads go where, and how data moves between them. Most production failures come from blurred boundaries, not from the engines themselves.