|
English

Most Ray tutorials stop at @ray.remote on a Fibonacci function. The actual question — what do production Ray workloads look like — is harder to find a clean answer to. This post collects six representative examples we have shipped or seen shipped, with the cluster shape and what to watch out for. Skim the table, jump to the one that matches your work.

The six examples at a glance

ExampleRay libraryCluster shapeCode complexity
1. Distributed PyTorch trainingRay Train8 GPU workers + 1 headLow
2. Hyperparameter sweepRay Tune4-32 nodesLow
3. LLM batch inferenceRay Data + vLLM4-16 GPU workersMedium
4. Online LLM servingRay Serve + vLLM2-8 GPU replicasMedium
5. Reinforcement learningRLlib16-64 CPU workers + 1 GPUHigh
6. Web crawl ETLRay Core + Ray Data8-32 CPU workersLow

Example 1: distributed PyTorch with Ray Train

import ray
from ray.train import ScalingConfig, RunConfig
from ray.train.torch import TorchTrainer

def train_func(config):
    import torch
    from torch.utils.data import DataLoader
    model = build_resnet50()
    model = ray.train.torch.prepare_model(model)
    loader = ray.train.torch.prepare_data_loader(get_loader(config))
    for epoch in range(config["epochs"]):
        for batch in loader:
            loss = step(model, batch)
        ray.train.report({"loss": loss.item()})

trainer = TorchTrainer(
    train_func,
    scaling_config=ScalingConfig(num_workers=8, use_gpu=True),
    train_loop_config={"epochs": 10},
    run_config=RunConfig(storage_path="s3://my-bucket/runs"),
)
result = trainer.fit()

Ray Train wraps model and dataloader so each worker sees the right shard. Checkpoints flow to S3. Restart-resilient if a worker dies.

Watch out for: NCCL timeouts on large clusters. Set NCCL_ASYNC_ERROR_HANDLING=1 and NCCL_BLOCKING_WAIT=1.

Example 2: hyperparameter sweep with Ray Tune

from ray import tune
from ray.tune.search.optuna import OptunaSearch
from ray.tune.schedulers import ASHAScheduler

def trainable(config):
    score = train_one_epoch(lr=config["lr"], wd=config["wd"])
    tune.report(score=score)

tuner = tune.Tuner(
    trainable,
    param_space={
        "lr": tune.loguniform(1e-5, 1e-1),
        "wd": tune.loguniform(1e-6, 1e-2),
    },
    tune_config=tune.TuneConfig(
        search_alg=OptunaSearch(metric="score", mode="max"),
        scheduler=ASHAScheduler(metric="score", mode="max"),
        num_samples=200,
    ),
)
tuner.fit()

ASHA prunes bad trials early, Optuna picks promising regions. 200 trials in production typically converge within 4 hours on a 16-GPU cluster.

For the full guide see our Ray Tune hyperparameter post.

Example 3: LLM batch inference

import ray
from ray.data import read_parquet
from vllm import LLM, SamplingParams

class LLMActor:
    def __init__(self):
        self.llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct", dtype="bfloat16")

    def __call__(self, batch):
        prompts = batch["prompt"].tolist()
        outputs = self.llm.generate(prompts, SamplingParams(max_tokens=256))
        batch["completion"] = [o.outputs[0].text for o in outputs]
        return batch

ds = read_parquet("s3://bucket/inputs.parquet")
ds = ds.map_batches(LLMActor, num_gpus=1, concurrency=8, batch_size=64)
ds.write_parquet("s3://bucket/outputs.parquet")

Eight vLLM replicas, each one GPU. Ray Data feeds them rows; vLLM's continuous batching handles the per-replica scheduling. The deeper mechanics are in our vLLM continuous batching deep dive.

Watch out for: Parquet input file sizes. Ray Data reads one file per task at minimum; if you have one 100 GB file, you get one task. Repartition input files to 100-500 MB before kicking off.

Example 4: online LLM serving

from ray import serve
from vllm import LLM, SamplingParams
from fastapi import FastAPI

app = FastAPI()

@serve.deployment(num_replicas=4, ray_actor_options={"num_gpus": 1})
@serve.ingress(app)
class LLMServer:
    def __init__(self):
        self.llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct")

    @app.post("/v1/completions")
    async def complete(self, body: dict):
        out = self.llm.generate(body["prompt"], SamplingParams(max_tokens=body.get("max_tokens", 256)))
        return {"text": out[0].outputs[0].text}

serve.run(LLMServer.bind(), route_prefix="/")

Four GPU replicas behind a Ray Serve ingress. Autoscaling, rolling deploys, health checks all built in. Add an OpenAI-compatible adapter and you have production serving.

Watch out for: cold-start time. A 70B model loads in about 90 seconds. Set min_replicas above zero in autoscaling_config to avoid request timeouts.

Example 5: reinforcement learning with RLlib

from ray.rllib.algorithms.ppo import PPOConfig

config = (
    PPOConfig()
    .environment("CartPole-v1")
    .training(train_batch_size=4000, lr=3e-4, gamma=0.99)
    .env_runners(num_env_runners=64)
    .resources(num_gpus=1)
)
algo = config.build()
for i in range(100):
    result = algo.train()
    print(i, result["env_runners"]["episode_return_mean"])

64 environment workers feed rollouts to one GPU learner. PPO converges on CartPole in seconds, on more complex envs (Atari, MuJoCo) in hours.

Watch out for: serialization of custom environments. Custom envs must be pickleable; envs that hold open OS resources (file handles, sockets) crash the workers.

Example 6: web crawl ETL

import ray
import requests
from bs4 import BeautifulSoup

ray.init()

@ray.remote(num_cpus=0.25)
def fetch_and_parse(url):
    try:
        r = requests.get(url, timeout=10)
        text = BeautifulSoup(r.text, "html.parser").get_text()
        return {"url": url, "text": text[:8000], "ok": True}
    except Exception as e:
        return {"url": url, "ok": False, "err": str(e)}

urls = open("urls.txt").read().splitlines()
results = ray.get([fetch_and_parse.remote(u) for u in urls])
ray.data.from_items(results).write_parquet("s3://bucket/crawl/")

Four tasks per CPU because they are network-bound. 100,000 URLs finish on a 16-CPU cluster in under 30 minutes.

Watch out for: rate-limiting yourself out of the source domains. Add a RateLimiter actor and have tasks acquire a token before fetching.

Cluster sizing summary

Workload                        Min nodes  Sweet spot   Sample throughput
PyTorch training (8B model)     4 GPU      8 GPU        12 hrs to converge
Hyperparameter sweep (200)      4 GPU      16 GPU       4 hrs total
LLM batch inference             4 GPU      16 GPU       12k tok/s/GPU
LLM online serving              2 GPU      4-8 GPU      6k tok/s/replica
RL (simple env)                 4 CPU      32 CPU       1M steps/min
Web crawl                       2 CPU      16 CPU       100k URLs/30min

Where these examples sit in the bigger picture

These are the canonical Ray workloads. The interesting architectural question is which ones to keep self-hosted versus push to managed APIs. For a feature-by-feature look at the inference engines we used (vLLM, SGLang, TGI), see our LLM serving frameworks 2026 comparison and the broader Ray vs alternatives pillar.

For inference cost specifically, our batch inference cost optimization post walks through the spot-instance math.

Routing example: hybrid Ray and managed

A working production pattern: Ray Serve runs your local Llama 70B for routine traffic, and a gateway like Swfte Connect sits in front. The gateway routes simple chat to your Ray cluster, complex reasoning to Anthropic, vision queries to Gemini, and falls back to OpenAI when local capacity saturates. Ray handles compute; the gateway handles cross-provider orchestration.

Canonical references

What to do this quarter

  1. Pick the example above closest to your current workload and copy it. Get it running on a 3-node Ray cluster end to end. Ship something small.
  2. Stand up monitoring (Prometheus plus the Ray dashboard) before you scale. Production scaling without metrics is a guessing game.
  3. If the LLM batch or online example matches you, plan to consolidate onto Ray Serve plus vLLM. The vendor lock-in to OpenAI batch is real but reversible.
  4. Externalize the GCS to managed Redis on day one. Skipping this is the most expensive shortcut in Ray.
  5. Move from notebooks to declarative cluster YAML. Notebook-driven Ray clusters work for prototyping and break in production.
0
0
0
0

Enjoyed this article?

Get more insights on AI and enterprise automation delivered to your inbox.