engineering

Cut Claude Code Token Spend 60-80% by Letting Swfte Orchestrate and Claude Compose

Claude Code wastes tokens on orchestration. Let Swfte run the loop and Claude compose. 60-80% savings.

May 7, 2026

English

If your team is using Claude Code seriously in 2026, two things are true at the same time. The first is that your engineers are shipping more code than they ever have, and they are shipping it with a quality bar that even the most generous reviewer in 2023 would have called senior. The second is that your AI bill is climbing at a rate that has started attracting attention from finance, and the climbing-rate is not linear with the velocity-improvement; it is super-linear. The reason is structural, and it is the same structural reason behind every cost-overrun on agentic AI: the model is doing a lot of expensive thinking about coordination work that does not actually require a frontier model.

The pattern that fixes this — without giving up the velocity gain that Claude Code provides — is to let Swfte handle orchestration and let Claude handle composition. The cost reduction we see in practice is in the 60-80% range, on real production codebases, with no measurable hit to delivery quality. This post is the architecture, the math, and the migration playbook.

What Claude Code is great at, and what it is wasteful at

Claude Code is a frontier-model agent that runs a tight loop: read context, plan, edit, run, observe, iterate. It is genuinely excellent at the composition parts of that loop — writing code, refactoring, reasoning about subtle bugs, generating tests, drafting docstrings. These are the parts where the frontier model's quality matters and where the expensive tokens are well spent.

It is much less impressive, and much more expensive, at the orchestration parts of the same loop. Reading directory listings to find the right file. Re-reading a file that was just edited to confirm the change. Running the test suite for the third time after a small fix. Deciding which of seventeen files to touch next. Re-loading the same package.json for the fifth time in a session. Each of these is a frontier-model token spend on a task that is, charitably, clerical.

A typical Claude Code session decomposes roughly as:

40-55% of tokens spent on orchestration (file enumeration, context re-loading, decision-making about which file to touch next, redundant test runs, status checking)
30-45% of tokens spent on composition (the actual code generation, refactoring, reasoning about behaviour)
5-15% of tokens spent on planning (the high-level decomposition, which is genuinely high-value frontier-model work)

The composition tokens are the ones you want to be paying for. The orchestration tokens are the ones you would rather not be.

The Swfte-orchestrates / Claude-composes pattern

The pattern, in one sentence: let a workflow runtime do the deterministic loop work, and call Claude only when actual reasoning or composition is needed.

In practice, this looks like a Swfte workflow that:

Takes a high-level intent (a prompt like implement the new export-CSV feature for the reports module).
Plans the task into discrete steps (the planning step itself can use Claude, but only once, on a tight prompt).
For each step, runs a workflow node that does the deterministic work (find files, read their current state, run tests, parse outputs) and only calls Claude when that step actually needs composition (writing the diff, reasoning about a failure, generating the test).
Routes the easy composition work (renaming, simple edits, mechanical refactors) to a cheaper model (Claude Haiku, DeepSeek V4 Flash, Nova Lite) and the hard composition work (architecture changes, subtle bug fixes) to Claude Sonnet or Opus.
Maintains its own state about the codebase between steps, so context is loaded once and reused, rather than re-loaded by Claude every time the agent forgets.

The orchestrator is doing what an orchestrator is good at: deterministic state management, branching, retry, persistence. Claude is doing what Claude is good at: composing the next bit of code given clean inputs.

The math, with real numbers

Take a representative engineering session: implement a new feature touching four files, generate tests, run the suite, fix one failure, commit. We benchmarked this against three configurations:

Configuration A — Claude Code, default. The agent runs end-to-end, reading directories, loading files, running tests, deciding next steps, all using Claude Sonnet 4.6 ($3 input / $15 output per 1M tokens). Total session: 850K input, 95K output. Cost: $2.55 input + $1.43 output = $3.98 per session.

Configuration B — Claude Code, optimised. Same, but with explicit caching enabled (Anthropic's prompt caching) and careful context management. Total session: 320K input (240K cached), 95K output. Cost (cached at 90% off): $0.96 + $0.072 + $1.43 = $2.46 per session. A 38% improvement just from caching.

Configuration C — Swfte orchestrates, Claude composes. Workflow runtime handles file enumeration, test runs, context loading, retry decisions. Composition steps call Claude Sonnet for hard work (~30% of steps) and Claude Haiku for easy work (~70%). Total Claude usage: 90K input, 35K output on Sonnet; 60K input, 25K output on Haiku. Cost: $0.27 + $0.525 (Sonnet) + $0.05 + $0.025 (Haiku) = $0.87 per session.

Configuration C is 78% cheaper than Configuration A and 65% cheaper than the already-optimised Configuration B, with no measurable difference in delivered code quality on a 50-session held-out evaluation.

The reason for the gap is straightforward. Configuration A is paying frontier-model prices for filesystem traversal. Configuration B is paying frontier-model prices for filesystem traversal with a discount. Configuration C is paying no model price for filesystem traversal, because filesystem traversal is a workflow node running on a process, not a model call.

What the workflow actually looks like

A simplified Swfte workflow for a code change task might be:

INPUT: high-level intent + repo handle

STEP 1: plan
  → call Claude Sonnet with intent + repo summary (cached)
  → output: list of (file, change_type) tuples

STEP 2: for each (file, change_type):
  STEP 2a: read_file (workflow node, no model call)
  STEP 2b: classify_complexity (cheap classifier model)
  STEP 2c: if simple → call Claude Haiku for diff
           if complex → call Claude Sonnet for diff
  STEP 2d: apply_diff (workflow node, deterministic)

STEP 3: run_tests (workflow node, executes shell)

STEP 4: if tests fail:
  → call Claude Sonnet with failure context only (not full repo)
  → workflow applies fix, returns to STEP 3
  → bounded retry (3 attempts max)

STEP 5: commit (workflow node, git operations)

Every step that does not require composition is a deterministic workflow node. Every step that does require composition is a model call with the minimum context needed for that specific composition. The model never sees the full repo at once; it sees the slice relevant to the current step.

The classification step is where the savings compound

The single highest-leverage decision in this architecture is the complexity classifier — the cheap model that decides whether a given step needs Sonnet/Opus or whether Haiku will suffice. Get this right and 70% of your composition steps run on a 5-10x cheaper model. Get it wrong and either you pay too much (over-routing to Sonnet) or you take quality hits (under-routing to Haiku).

The classifier prompt we have found works in production: given this file path, this current state summary, and this intended change, predict whether the change is mechanical (rename, type-only, format) or substantive (logic, architecture, bug fix). Output: 'mechanical' or 'substantive'. Run it on Haiku itself; the classifier is cheap and self-correcting (mechanical changes that turn out to be subtle get bounced to Sonnet on test failure, which is the right behaviour).

In our benchmarks, the classifier is right about 88% of the time on first call, and the bounce-on-test-failure path catches the other 12% with a small retry cost.

Other places this pattern wins

The orchestrate-then-compose pattern generalises far past Claude Code. We see the same structural savings on:

Customer-support agents. Workflow handles ticket loading, customer history retrieval, knowledge-base search, and SLA tracking. Claude only handles the actual response composition. Token savings: 55-70%.

Document workflows. Workflow handles OCR, classification, field extraction, validation, and ERP posting. Claude only handles the genuinely-ambiguous edge cases that the deterministic extractors flag. Token savings: 80-90% (because the workflow handles the long tail of clear cases without any model call at all).

Code review agents. Workflow handles diff parsing, file context loading, test correlation. Claude only reasons about whether a specific change introduces a bug or violates a convention. Token savings: 60-70%.

Sales lead enrichment. Workflow handles enrichment API calls, deduplication, scoring, and CRM upserts. Claude only handles the qualitative summary at the end. Token savings: 75%+ (most of the work is structured data processing).

The pattern is the same in every case: the frontier model is precious; spend it on the parts that need it; let a workflow runtime handle the rest.

What you give up — and what you do not

It is fair to ask what the trade-offs are, because no architectural pattern is free.

You give up some autonomy. Claude Code in default configuration can adapt mid-session in ways that a workflow cannot. If the model decides mid-task that the architecture should change shape, it can pivot. A workflow with predefined steps will execute the steps it was given. For tightly-scoped tasks, this is a feature (the workflow is more predictable). For genuinely exploratory tasks (research, experimental refactors, let me see what happens if I try this), default Claude Code is still the better choice.

You give up some setup time. Building the workflow takes hours or days, depending on complexity. For a one-off task, that setup cost is not worth it; just run Claude Code directly. For a recurring task that runs hundreds of times — which is what most production AI workloads are — the setup pays back in the first week.

You do not give up quality. This is the part that surprises teams the first time they migrate. The held-out evaluation we ran showed no statistically significant quality difference between Configuration A (Claude Code default) and Configuration C (Swfte orchestrates, Claude composes). The orchestrator is not making the model dumber; it is making the model focused. Focused expensive thinking on a clean input produces the same quality output as unfocused expensive thinking on a noisy input — and costs a fraction.

The migration playbook

If you are running heavy Claude Code usage and want to capture this saving, the migration is more pragmatic than dramatic:

Instrument the current spend. Categorise tokens into orchestration, composition, and planning. Use the audit trail from Anthropic or your egress proxy. The exact percentages will tell you how much there is to capture.
Pick one high-frequency task. A specific kind of feature change, a specific kind of bug fix, a specific kind of test generation. Migrate that one to a Swfte workflow.
Measure the cost-per-task before and after. Run 20 of each. Compare.
Measure quality on a held-out eval. Run both configurations against the same eval set. Confirm parity.
Roll out to other tasks. The patterns repeat; the second migration is half the effort of the first; the fifth is a quarter.
Leave the long-tail to default Claude Code. One-off tasks, exploratory work, and genuinely novel changes can stay in the default configuration. The savings come from migrating the high-frequency tasks, not from migrating everything.

A team that follows this playbook captures roughly 60-70% of the available savings within four weeks. The remaining 10-20% lives in the long-tail of less-frequent tasks that may or may not be worth the migration effort.

A note on caching, batching, and the boring optimisations

Everything in this post is additive to the standard token-cost optimisations: enable Anthropic's prompt caching for any prompt over 4K tokens with multi-turn use, route async work to batch (50% off where the latency budget allows), and right-size models per task. The pattern in this post adds another 60-80% on top of those, because it attacks a different layer of the cost stack: not the price of tokens, but the quantity spent on coordination work that should never have been a model call in the first place.

The compounded effect — caching + batching + orchestrate-then-compose — is what gets a Claude Code team's bill from uncomfortable to comfortable without sacrificing the velocity that justified the investment in the first place.

The bottom line

Claude Code is the best AI composer the industry has, and the best composer is wasteful when you ask it to also be the orchestrator. The pattern that wins in 2026 is to let a workflow runtime handle the deterministic orchestration loop and let Claude do the composition steps that actually need a frontier model. Swfte is built specifically for this — model-routed workflows with cheap-classifier steering, deterministic state management between Claude calls, and bounded retry on test failures. The math works out to 60-80% savings on real production code-generation workloads, with no measurable quality regression. The architecture is the difference between an AI-coding investment that scales sustainably and one that scales until finance pulls the brake.

Read the related posts: The AI Workflow Marketplace, Buy vs Build in the Age of AI Coding Assistants, and AI Vendor Lock-In in 2026. Or explore Swfte's workflow builder and start orchestrating your Claude Code spend the same way you orchestrate everything else.

Publié dansengineering

claude-code token-optimisation ai-orchestration swfte-workflows llm-cost

Enjoyed this article?

Get more insights on AI and enterprise automation delivered to your inbox.

← Back to all articles