|
English

Every quarter a team somewhere asks the same question: should we build the new ML platform on Ray or on Spark? The answer is rarely one or the other. The interesting answer is which workloads each one wins, and what it costs to operate the loser.

This is a head-to-head built from production deployments across vision training, LLM serving, and petabyte ETL. No marketing.

The one-sentence summary

Ray is a fine-grained Python-first task and actor system designed for ML. Spark is a coarse-grained data-parallel JVM engine designed for analytics. They overlap in the middle, on batch ML and feature engineering, and that is where most of the confusion lives.

Architectural differences that matter

DimensionRay 2.xSpark 3.5
Core languageC++ with Python bindingsScala on the JVM
Scheduling unitMicrosecond task or actor callStage of partitioned RDD
State modelStateful actors are first-classStateless RDD or DataFrame
GPU schedulingNative fractional GPUCoarse via barrier execution
ShuffleObject store backedDisk and network shuffle
Fault toleranceLineage plus actor reconstructionLineage on RDD only
Typical cluster8 to 1,000 nodes50 to 10,000 nodes
API surfacePythonSQL, Scala, Python, R

The big behavioral consequence is task overhead. Ray dispatches a task in roughly 200 microseconds. Spark stage launch is 1 to 3 seconds. For a workload made of millions of tiny tasks — graph algorithms, RL rollouts, hyperparameter trials — Ray finishes while Spark is still planning.

Throughput on representative ML workloads

Relative throughput on a 16-node cluster of A100 80GB nodes.

Distributed PyTorch (BERT)   Ray  |##############################| 30
                             Spark|######                        |  6
RL rollouts (PPO 256 envs)   Ray  |#############################_| 29
                             Spark|##                            |  2
Hyperparameter sweep (200)   Ray  |#############################_| 29
                             Spark|#######                       |  7
ETL Parquet 4TB              Ray  |##################            | 18
                             Spark|##############################| 30
SQL aggregations 8TB         Ray  |#####################         | 21
                             Spark|##############################| 30
LLM batch inference          Ray  |##############################| 30
                             Spark|############                  | 12

Numbers are normalized — 30 means each system was running at full cluster utilization on the workload it was best at. Sourced from Anyscale's 2025 ML benchmark and Databricks Ray-on-Spark blog. See Anyscale's Ray vs Spark benchmark for the underlying methodology.

When Spark still wins

Spark is the right choice when:

  • The workload is SQL-heavy. Catalyst is a real query planner; Ray Data is not.
  • You are integrating with a Delta Lake, Iceberg, or Hudi table. Spark is the reference engine.
  • You need predicate pushdown into Parquet at petabyte scale.
  • Your team already runs Databricks or EMR and the operational sunk cost is enormous.

A 2024 survey by Databricks found that 78 percent of Spark workloads are still SQL or DataFrame ETL, not ML. That is the part Spark should keep.

When Ray wins by a landslide

Ray dominates when:

  • The unit of work is a Python function or a stateful object.
  • You need fractional GPUs. Ray supports num_gpus=0.25; Spark does not.
  • You are doing reinforcement learning, hyperparameter search, or distributed deep learning.
  • You want to mix training and serving on one cluster.
  • Your engineers write Python and resent JVM tuning.

For a deeper look at when Ray is the right tool, see our pillar on Ray and alternatives and the Ray Tune hyperparameter guide.

The boring middle: feature engineering

The honest middle ground is batch feature engineering on tabular data. Both systems handle it competently. Picking is mostly a culture question.

QuestionPick RayPick Spark
Most engineers comfortable inPythonScala or SQL
Existing data lakeIceberg with Python accessDelta Lake on Databricks
Downstream training frameworkPyTorch or JAXSpark MLlib
Cluster lifecycleEphemeral per jobLong-lived shared
Streaming requirementsAsync, tolerantExactly-once, structured streaming

Operational cost in the wild

Self-managed on AWS, May 2026 spot pricing, 16-node cluster running 24/7.

ComponentRay plus KubeRayDatabricks Spark
Compute (spot)$58,000 / yr$58,000 / yr
Software license0$40,000 / yr
Operator headcount0.4 FTE0.6 FTE
Average cluster utilization64 percent41 percent
Net cost per useful core-hour$0.12$0.27

Ray's lower utilization floor comes from its autoscaler-v2 plus actor lifetime control. Spark utilization suffers because driver and executor sizing assumes coarse stages.

Hybrid: Ray on Spark, Spark on Ray

Both systems offer interop, and both interop modes work well enough to ship.

  • Ray on Spark: Databricks Runtime 15+ ships Ray as a first-class library. Run a Spark DataFrame ETL, then ray.init() in the same notebook for distributed training on the result. This is the path of least resistance for Databricks shops.
  • Spark on Ray: RayDP runs PySpark on top of Ray. Useful for teams who want a single Ray-native control plane and need Spark for an existing SQL workload.
  • Dask on Ray: Drop-in scheduler swap. See the Ray Data documentation for the recommended pattern.

For most teams, the right answer is Spark for what it is good at, Ray for what it is good at, and a shared object store (S3 or GCS) between them.

Migration story: a vision team moving off Spark

A computer-vision team we worked with had a 2018-vintage Spark MLlib pipeline doing image preprocessing plus a Horovod training step driven by Spark barrier mode. The numbers before migration:

  • Training run: 11 hours on 8 V100s
  • Cluster utilization: 38 percent
  • 60 percent of wall-clock spent in driver-executor coordination

After migrating preprocessing to Ray Data and training to Ray Train:

  • Training run: 4.2 hours on the same 8 V100s
  • Cluster utilization: 71 percent
  • Engineering team dropped 1,400 lines of Scala wrapper code

The Spark code that survived is the upstream ETL, which still runs nightly to produce the Parquet that Ray Data reads.

A note on managed inference

Distributed compute frameworks decide where training happens. Production inference is a different problem. Many teams run training on Ray and inference on a mix of self-hosted and managed APIs, with a gateway like Swfte Connect routing requests by cost and quality. That keeps the Ray cluster focused on training rather than handling unpredictable production load.

For inference patterns specifically see our batch inference cost optimization guide.

What to do this quarter

  1. List every Spark job currently in your platform. Tag each as SQL ETL, batch ML, streaming, or Python compute. The Python compute and batch ML rows are Ray candidates.
  2. Pick one Python-heavy Spark job and rewrite it on Ray. Measure wall-clock and cost. Most teams see 2x to 3x speedups.
  3. Keep Spark for SQL ETL. Do not rewrite working ETL.
  4. If you are a Databricks shop, enable Ray on Databricks Runtime and run training there. No new cluster needed.
  5. Document the cluster boundary: which workloads go where, and how data moves between them. Most production failures come from blurred boundaries, not from the engines themselves.
0
0
0
0

Enjoyed this article?

Get more insights on AI and enterprise automation delivered to your inbox.