Most Ray tutorials stop at @ray.remote on a Fibonacci function. The actual question — what do production Ray workloads look like — is harder to find a clean answer to. This post collects six representative examples we have shipped or seen shipped, with the cluster shape and what to watch out for. Skim the table, jump to the one that matches your work.
The six examples at a glance
| Example | Ray library | Cluster shape | Code complexity |
|---|---|---|---|
| 1. Distributed PyTorch training | Ray Train | 8 GPU workers + 1 head | Low |
| 2. Hyperparameter sweep | Ray Tune | 4-32 nodes | Low |
| 3. LLM batch inference | Ray Data + vLLM | 4-16 GPU workers | Medium |
| 4. Online LLM serving | Ray Serve + vLLM | 2-8 GPU replicas | Medium |
| 5. Reinforcement learning | RLlib | 16-64 CPU workers + 1 GPU | High |
| 6. Web crawl ETL | Ray Core + Ray Data | 8-32 CPU workers | Low |
Example 1: distributed PyTorch with Ray Train
import ray
from ray.train import ScalingConfig, RunConfig
from ray.train.torch import TorchTrainer
def train_func(config):
import torch
from torch.utils.data import DataLoader
model = build_resnet50()
model = ray.train.torch.prepare_model(model)
loader = ray.train.torch.prepare_data_loader(get_loader(config))
for epoch in range(config["epochs"]):
for batch in loader:
loss = step(model, batch)
ray.train.report({"loss": loss.item()})
trainer = TorchTrainer(
train_func,
scaling_config=ScalingConfig(num_workers=8, use_gpu=True),
train_loop_config={"epochs": 10},
run_config=RunConfig(storage_path="s3://my-bucket/runs"),
)
result = trainer.fit()
Ray Train wraps model and dataloader so each worker sees the right shard. Checkpoints flow to S3. Restart-resilient if a worker dies.
Watch out for: NCCL timeouts on large clusters. Set NCCL_ASYNC_ERROR_HANDLING=1 and NCCL_BLOCKING_WAIT=1.
Example 2: hyperparameter sweep with Ray Tune
from ray import tune
from ray.tune.search.optuna import OptunaSearch
from ray.tune.schedulers import ASHAScheduler
def trainable(config):
score = train_one_epoch(lr=config["lr"], wd=config["wd"])
tune.report(score=score)
tuner = tune.Tuner(
trainable,
param_space={
"lr": tune.loguniform(1e-5, 1e-1),
"wd": tune.loguniform(1e-6, 1e-2),
},
tune_config=tune.TuneConfig(
search_alg=OptunaSearch(metric="score", mode="max"),
scheduler=ASHAScheduler(metric="score", mode="max"),
num_samples=200,
),
)
tuner.fit()
ASHA prunes bad trials early, Optuna picks promising regions. 200 trials in production typically converge within 4 hours on a 16-GPU cluster.
For the full guide see our Ray Tune hyperparameter post.
Example 3: LLM batch inference
import ray
from ray.data import read_parquet
from vllm import LLM, SamplingParams
class LLMActor:
def __init__(self):
self.llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct", dtype="bfloat16")
def __call__(self, batch):
prompts = batch["prompt"].tolist()
outputs = self.llm.generate(prompts, SamplingParams(max_tokens=256))
batch["completion"] = [o.outputs[0].text for o in outputs]
return batch
ds = read_parquet("s3://bucket/inputs.parquet")
ds = ds.map_batches(LLMActor, num_gpus=1, concurrency=8, batch_size=64)
ds.write_parquet("s3://bucket/outputs.parquet")
Eight vLLM replicas, each one GPU. Ray Data feeds them rows; vLLM's continuous batching handles the per-replica scheduling. The deeper mechanics are in our vLLM continuous batching deep dive.
Watch out for: Parquet input file sizes. Ray Data reads one file per task at minimum; if you have one 100 GB file, you get one task. Repartition input files to 100-500 MB before kicking off.
Example 4: online LLM serving
from ray import serve
from vllm import LLM, SamplingParams
from fastapi import FastAPI
app = FastAPI()
@serve.deployment(num_replicas=4, ray_actor_options={"num_gpus": 1})
@serve.ingress(app)
class LLMServer:
def __init__(self):
self.llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct")
@app.post("/v1/completions")
async def complete(self, body: dict):
out = self.llm.generate(body["prompt"], SamplingParams(max_tokens=body.get("max_tokens", 256)))
return {"text": out[0].outputs[0].text}
serve.run(LLMServer.bind(), route_prefix="/")
Four GPU replicas behind a Ray Serve ingress. Autoscaling, rolling deploys, health checks all built in. Add an OpenAI-compatible adapter and you have production serving.
Watch out for: cold-start time. A 70B model loads in about 90 seconds. Set min_replicas above zero in autoscaling_config to avoid request timeouts.
Example 5: reinforcement learning with RLlib
from ray.rllib.algorithms.ppo import PPOConfig
config = (
PPOConfig()
.environment("CartPole-v1")
.training(train_batch_size=4000, lr=3e-4, gamma=0.99)
.env_runners(num_env_runners=64)
.resources(num_gpus=1)
)
algo = config.build()
for i in range(100):
result = algo.train()
print(i, result["env_runners"]["episode_return_mean"])
64 environment workers feed rollouts to one GPU learner. PPO converges on CartPole in seconds, on more complex envs (Atari, MuJoCo) in hours.
Watch out for: serialization of custom environments. Custom envs must be pickleable; envs that hold open OS resources (file handles, sockets) crash the workers.
Example 6: web crawl ETL
import ray
import requests
from bs4 import BeautifulSoup
ray.init()
@ray.remote(num_cpus=0.25)
def fetch_and_parse(url):
try:
r = requests.get(url, timeout=10)
text = BeautifulSoup(r.text, "html.parser").get_text()
return {"url": url, "text": text[:8000], "ok": True}
except Exception as e:
return {"url": url, "ok": False, "err": str(e)}
urls = open("urls.txt").read().splitlines()
results = ray.get([fetch_and_parse.remote(u) for u in urls])
ray.data.from_items(results).write_parquet("s3://bucket/crawl/")
Four tasks per CPU because they are network-bound. 100,000 URLs finish on a 16-CPU cluster in under 30 minutes.
Watch out for: rate-limiting yourself out of the source domains. Add a RateLimiter actor and have tasks acquire a token before fetching.
Cluster sizing summary
Workload Min nodes Sweet spot Sample throughput
PyTorch training (8B model) 4 GPU 8 GPU 12 hrs to converge
Hyperparameter sweep (200) 4 GPU 16 GPU 4 hrs total
LLM batch inference 4 GPU 16 GPU 12k tok/s/GPU
LLM online serving 2 GPU 4-8 GPU 6k tok/s/replica
RL (simple env) 4 CPU 32 CPU 1M steps/min
Web crawl 2 CPU 16 CPU 100k URLs/30min
Where these examples sit in the bigger picture
These are the canonical Ray workloads. The interesting architectural question is which ones to keep self-hosted versus push to managed APIs. For a feature-by-feature look at the inference engines we used (vLLM, SGLang, TGI), see our LLM serving frameworks 2026 comparison and the broader Ray vs alternatives pillar.
For inference cost specifically, our batch inference cost optimization post walks through the spot-instance math.
Routing example: hybrid Ray and managed
A working production pattern: Ray Serve runs your local Llama 70B for routine traffic, and a gateway like Swfte Connect sits in front. The gateway routes simple chat to your Ray cluster, complex reasoning to Anthropic, vision queries to Gemini, and falls back to OpenAI when local capacity saturates. Ray handles compute; the gateway handles cross-provider orchestration.
Canonical references
- Ray examples gallery: docs.ray.io/en/latest/ray-overview/use-cases.html.
- vLLM project: github.com/vllm-project/vllm.
- RLlib documentation: docs.ray.io/en/latest/rllib/index.html.
- The original Ray OSDI paper: arxiv.org/abs/1712.05889.
What to do this quarter
- Pick the example above closest to your current workload and copy it. Get it running on a 3-node Ray cluster end to end. Ship something small.
- Stand up monitoring (Prometheus plus the Ray dashboard) before you scale. Production scaling without metrics is a guessing game.
- If the LLM batch or online example matches you, plan to consolidate onto Ray Serve plus vLLM. The vendor lock-in to OpenAI batch is real but reversible.
- Externalize the GCS to managed Redis on day one. Skipping this is the most expensive shortcut in Ray.
- Move from notebooks to declarative cluster YAML. Notebook-driven Ray clusters work for prototyping and break in production.