technology

Ray Tune Hyperparameter Tuning Guide: ASHA, PBT, and Optuna in 2026

How to run hyperparameter sweeps at scale with Ray Tune, ASHA, PBT, and Optuna integrations.

May 6, 2026

English

Hyperparameter search is the part of training that everyone underfunds and overruns. A single ResNet-50 sweep with grid search across five hyperparameters can burn $40,000 of GPU time before producing a model that an early-stopping scheduler would have found in 14 percent of the budget. Ray Tune is the open-source library that ships those schedulers.

This is a hands-on tour of Ray Tune in 2026, with the schedulers worth using, the ones to avoid, and where teams trip up moving from a laptop sweep to a 64-GPU cluster.

What Ray Tune actually is

Ray Tune is a Python library that turns any training function into a distributed hyperparameter search. It runs on top of Ray Core, so the same cluster you use for distributed training can host the sweep. The control plane decides what configurations to try; the data plane uses Ray actors to execute them in parallel.

The library has been the reference Bayesian and population-based search engine for the PyTorch ecosystem since 2019. It ships integrations for Optuna, HyperOpt, Nevergrad, BOHB, BayesOpt, and Sigopt. See the official Ray Tune documentation for the full integration list.

Five-line sweep

from ray import tune

def train(config):
    score = train_one_epoch(lr=config["lr"], batch=config["batch"])
    tune.report(score=score)

tune.run(train, config={"lr": tune.loguniform(1e-5, 1e-2), "batch": tune.choice([16, 32, 64])}, num_samples=200)

That snippet runs 200 trials in parallel across whatever Ray cluster ray.init() connects to. Move it to a 16-node A100 cluster and it spreads automatically. For a deeper tour of Ray itself, see our Apache Ray distributed Python guide.

Schedulers compared

The scheduler decides which trials to keep, prune, or perturb. Choosing the right one is the difference between a sweep that converges in 8 hours and one that runs all weekend.

Scheduler	Best for	Typical compute savings	Notes
ASHA	Most workloads, deep nets	4x to 10x vs grid	Default in Ray Tune 2.x
Hyperband	Theory-pure ASHA fallback	3x to 6x	Superseded by ASHA
PBT	RL, long-running training	2x to 5x	Mutates configs in place
PB2	PBT with Gaussian Process	2x to 4x	Lower variance than PBT
BOHB	Mid-budget mixed search	4x to 8x	Bayesian on top of Hyperband
Median Stopping	Cheap baseline	2x	Use only if ASHA is unavailable
FIFO	Reproducibility	1x	Runs every trial to completion

ASHA, the Asynchronous Successive Halving Algorithm, is what most teams should default to. It was published by Li and colleagues in 2018 (arxiv.org/abs/1810.05934) and matches or beats Bayesian methods on benchmarks while being trivially parallel.

Search budget at a glance

Relative compute spent on a 5-dimensional ResNet sweep, normalized so grid search equals 100.

Grid search       |##########################################| 100
Random search     |#########################                 |  60
HyperOpt (TPE)    |####################                      |  48
Optuna + ASHA     |#######                                   |  17
Ray Tune ASHA     |######                                    |  14
PBT               |#########                                  |  22

ASHA reaches the same final accuracy with roughly 14 percent of the grid budget. Numbers from the Anyscale 2024 ASHA reproducibility study and our internal benchmarks.

Where teams burn money

Three failure modes are common in production sweeps.

The first is stopping too late. Default Ray Tune behavior keeps a trial alive for the full max iteration count even after the scheduler signals stop. Setting tune.run(stop={"score": 0.95}) plus a real time_budget_s cuts wall-clock by 30 to 50 percent.

The second is head node OOM. Each trial reports metrics through the head node. With 1,000 concurrent trials reporting every 10 seconds, the GCS process becomes the bottleneck. Set RAY_NUM_REDIS_SHARDS=4 and isolate the head on a memory-optimized node.

The third is re-instantiation cost. PyTorch Lightning and Hugging Face Trainer rebuild the model from scratch on every trial. Use tune.with_resources and Reuseable Actors (added in Ray 2.6) to cache CUDA contexts and shave 6 to 12 seconds per trial — meaningful when you run 5,000 of them.

Optuna versus Ray Tune

A common confusion is whether to use Optuna or Ray Tune. The honest answer is: both. Ray Tune ships an Optuna search algorithm out of the box. Optuna is the search algorithm; Ray Tune is the distributed executor.

Capability	Optuna alone	Ray Tune	Ray Tune plus Optuna
Bayesian search	Yes	Yes	Yes
Distributed across machines	Manual	Yes	Yes
ASHA early stopping	No	Yes	Yes
Population-based training	No	Yes	Yes
Visualization dashboard	Yes	Yes (Ray dashboard)	Yes
Median time to first result	Fast on one box	Slow setup	Fast on cluster

If you have one GPU, Optuna alone is fine. If you have eight or more, wire Optuna into Ray Tune.

Production reference setup

For a real cluster running on KubeRay 1.2:

Head node on a memory-optimized instance with no GPU; isolate the GCS.
Worker pool autoscales between 4 and 64 GPU nodes.
Object spill directory mounted on local NVMe with RAY_object_spilling_config pointing at /mnt/nvme/spill.
Trials reported through Ray AIR MLflowLoggerCallback to a centralized tracking server.
Checkpoints written to S3 via tune.SyncConfig(upload_dir="s3://my-bucket/").

This setup runs sweeps of 5,000 trials nightly with a 99th percentile trial start latency under 4 seconds. The same cluster handles online inference during the day with vLLM workers; for that pattern see our LLM serving frameworks comparison.

When you should not use Ray Tune

Ray Tune is overkill when:

You have fewer than 50 trials. A bash loop with random seeds is faster to set up.
You are tuning a scikit-learn model. Use sklearn's HalvingRandomSearchCV directly.
Your training step is under 30 seconds. Ray scheduling overhead becomes a meaningful fraction of trial time.

For everything else — large vision and language model training, RL, ensemble selection — Ray Tune is the default in 2026.

A note on inference and routing

Hyperparameter sweeps optimize a single model. In production, the next question is how to route real traffic across the candidates that survived the sweep. Many teams run an A/B test through a gateway like Swfte Connect, which can split traffic between fine-tunes by percentage and roll back on quality regression without redeploying. That keeps the sweep results honest against live distribution shift. For a deeper look at the routing layer, see our pillar on Ray and alternatives.

What to do this quarter

Replace your existing grid or random search with Ray Tune and ASHA on the same training script. Expect a 5x to 10x compute reduction with no accuracy loss.
Set time_budget_s and a stop criterion on every sweep so a stuck trial cannot eat the cluster.
Wire Optuna's TPE sampler into Ray Tune for any sweep above three dimensions.
Move trial reporting through MLflow or Weights and Biases so results survive a cluster crash.
Stand up a KubeRay test cluster on three small nodes and prove autoscaling works before pushing the credit card limit on a 64-GPU one.

发布于technology

Ray Ray Tune Distributed ML Hyperparameter Tuning MLOps

Enjoyed this article?

Get more insights on AI and enterprise automation delivered to your inbox.

← Back to all articles