Hyperparameter search is the part of training that everyone underfunds and overruns. A single ResNet-50 sweep with grid search across five hyperparameters can burn $40,000 of GPU time before producing a model that an early-stopping scheduler would have found in 14 percent of the budget. Ray Tune is the open-source library that ships those schedulers.
This is a hands-on tour of Ray Tune in 2026, with the schedulers worth using, the ones to avoid, and where teams trip up moving from a laptop sweep to a 64-GPU cluster.
What Ray Tune actually is
Ray Tune is a Python library that turns any training function into a distributed hyperparameter search. It runs on top of Ray Core, so the same cluster you use for distributed training can host the sweep. The control plane decides what configurations to try; the data plane uses Ray actors to execute them in parallel.
The library has been the reference Bayesian and population-based search engine for the PyTorch ecosystem since 2019. It ships integrations for Optuna, HyperOpt, Nevergrad, BOHB, BayesOpt, and Sigopt. See the official Ray Tune documentation for the full integration list.
Five-line sweep
from ray import tune
def train(config):
score = train_one_epoch(lr=config["lr"], batch=config["batch"])
tune.report(score=score)
tune.run(train, config={"lr": tune.loguniform(1e-5, 1e-2), "batch": tune.choice([16, 32, 64])}, num_samples=200)
That snippet runs 200 trials in parallel across whatever Ray cluster ray.init() connects to. Move it to a 16-node A100 cluster and it spreads automatically. For a deeper tour of Ray itself, see our Apache Ray distributed Python guide.
Schedulers compared
The scheduler decides which trials to keep, prune, or perturb. Choosing the right one is the difference between a sweep that converges in 8 hours and one that runs all weekend.
| Scheduler | Best for | Typical compute savings | Notes |
|---|---|---|---|
| ASHA | Most workloads, deep nets | 4x to 10x vs grid | Default in Ray Tune 2.x |
| Hyperband | Theory-pure ASHA fallback | 3x to 6x | Superseded by ASHA |
| PBT | RL, long-running training | 2x to 5x | Mutates configs in place |
| PB2 | PBT with Gaussian Process | 2x to 4x | Lower variance than PBT |
| BOHB | Mid-budget mixed search | 4x to 8x | Bayesian on top of Hyperband |
| Median Stopping | Cheap baseline | 2x | Use only if ASHA is unavailable |
| FIFO | Reproducibility | 1x | Runs every trial to completion |
ASHA, the Asynchronous Successive Halving Algorithm, is what most teams should default to. It was published by Li and colleagues in 2018 (arxiv.org/abs/1810.05934) and matches or beats Bayesian methods on benchmarks while being trivially parallel.
Search budget at a glance
Relative compute spent on a 5-dimensional ResNet sweep, normalized so grid search equals 100.
Grid search |##########################################| 100
Random search |######################### | 60
HyperOpt (TPE) |#################### | 48
Optuna + ASHA |####### | 17
Ray Tune ASHA |###### | 14
PBT |######### | 22
ASHA reaches the same final accuracy with roughly 14 percent of the grid budget. Numbers from the Anyscale 2024 ASHA reproducibility study and our internal benchmarks.
Where teams burn money
Three failure modes are common in production sweeps.
The first is stopping too late. Default Ray Tune behavior keeps a trial alive for the full max iteration count even after the scheduler signals stop. Setting tune.run(stop={"score": 0.95}) plus a real time_budget_s cuts wall-clock by 30 to 50 percent.
The second is head node OOM. Each trial reports metrics through the head node. With 1,000 concurrent trials reporting every 10 seconds, the GCS process becomes the bottleneck. Set RAY_NUM_REDIS_SHARDS=4 and isolate the head on a memory-optimized node.
The third is re-instantiation cost. PyTorch Lightning and Hugging Face Trainer rebuild the model from scratch on every trial. Use tune.with_resources and Reuseable Actors (added in Ray 2.6) to cache CUDA contexts and shave 6 to 12 seconds per trial — meaningful when you run 5,000 of them.
Optuna versus Ray Tune
A common confusion is whether to use Optuna or Ray Tune. The honest answer is: both. Ray Tune ships an Optuna search algorithm out of the box. Optuna is the search algorithm; Ray Tune is the distributed executor.
| Capability | Optuna alone | Ray Tune | Ray Tune plus Optuna |
|---|---|---|---|
| Bayesian search | Yes | Yes | Yes |
| Distributed across machines | Manual | Yes | Yes |
| ASHA early stopping | No | Yes | Yes |
| Population-based training | No | Yes | Yes |
| Visualization dashboard | Yes | Yes (Ray dashboard) | Yes |
| Median time to first result | Fast on one box | Slow setup | Fast on cluster |
If you have one GPU, Optuna alone is fine. If you have eight or more, wire Optuna into Ray Tune.
Production reference setup
For a real cluster running on KubeRay 1.2:
- Head node on a memory-optimized instance with no GPU; isolate the GCS.
- Worker pool autoscales between 4 and 64 GPU nodes.
- Object spill directory mounted on local NVMe with
RAY_object_spilling_configpointing at/mnt/nvme/spill. - Trials reported through Ray AIR
MLflowLoggerCallbackto a centralized tracking server. - Checkpoints written to S3 via
tune.SyncConfig(upload_dir="s3://my-bucket/").
This setup runs sweeps of 5,000 trials nightly with a 99th percentile trial start latency under 4 seconds. The same cluster handles online inference during the day with vLLM workers; for that pattern see our LLM serving frameworks comparison.
When you should not use Ray Tune
Ray Tune is overkill when:
- You have fewer than 50 trials. A bash loop with random seeds is faster to set up.
- You are tuning a scikit-learn model. Use sklearn's
HalvingRandomSearchCVdirectly. - Your training step is under 30 seconds. Ray scheduling overhead becomes a meaningful fraction of trial time.
For everything else — large vision and language model training, RL, ensemble selection — Ray Tune is the default in 2026.
A note on inference and routing
Hyperparameter sweeps optimize a single model. In production, the next question is how to route real traffic across the candidates that survived the sweep. Many teams run an A/B test through a gateway like Swfte Connect, which can split traffic between fine-tunes by percentage and roll back on quality regression without redeploying. That keeps the sweep results honest against live distribution shift. For a deeper look at the routing layer, see our pillar on Ray and alternatives.
What to do this quarter
- Replace your existing grid or random search with Ray Tune and ASHA on the same training script. Expect a 5x to 10x compute reduction with no accuracy loss.
- Set
time_budget_sand a stop criterion on every sweep so a stuck trial cannot eat the cluster. - Wire Optuna's TPE sampler into Ray Tune for any sweep above three dimensions.
- Move trial reporting through MLflow or Weights and Biases so results survive a cluster crash.
- Stand up a KubeRay test cluster on three small nodes and prove autoscaling works before pushing the credit card limit on a 64-GPU one.