Thinking Machines Lab published a piece this month — Interaction Models — that quietly reframes the next chapter of AI. The argument is short and uncomfortable for most of the systems we have built so far: turn-based chat is a dead end. Intelligence has scaled. Interactivity has not. The result is a collaboration bottleneck, where the model is capable but the interface forces the human out of the loop.
If they are right — and the architecture they sketch suggests they are — the entire pattern of "send prompt, wait, read answer" is about to feel as dated as dial-up. What replaces it is not a faster chatbot. It is a new class of model that is native to real-time human collaboration, and it changes what an "agent" is allowed to be.
This post unpacks why that matters, what is new about the proposal, and how a combination of Swfte Connect and a dedicated Swfte Agent is the most practical path to building this kind of immersive experience today.
The Collaboration Bottleneck
Every production LLM today shares the same shape: a request goes in, the model thinks, an answer comes out, the user reads it, and another turn begins. Tool use, function calling, even agent loops — they are all variations on the same turn-based contract.
That contract has consequences. The user can only intervene at turn boundaries. The model cannot see the user mid-generation. There is no shared sense of time. Voice systems paper over this with external scaffolding — voice activity detection, speech-to-text, text-to-speech, barge-in logic — but the model underneath is still living one turn at a time.
Thinking Machines' framing is that this is the wrong layer to solve at. Bolting interactivity onto a fundamentally turn-based core gives you a ventriloquist act, not collaboration. The fix is to push interactivity into the model itself.
What Is New in an Interaction Model
The paper introduces two terms worth memorizing because both will become standard vocabulary in the next year.
Time-aligned micro-turns. Instead of one giant turn per response, the model processes continuous 200-millisecond chunks of audio and video. Inputs and outputs interleave rather than alternate. The model can listen while it talks, pause when interrupted, and respond to a raised eyebrow before a sentence is finished. Turn boundaries stop being a thing the system enforces and start being a thing the conversation has, the way two people in a room have them.
Encoder-free early fusion. Audio and video signals are integrated directly, without heavy preprocessing pipelines. There is no separate ASR module deciding when a sentence ended. The model itself develops a sense of timing, tone, and visual context — which is the only way you get capabilities like overlapping speech, mid-sentence interjection, or a model that waits because it can tell the user is thinking.
The architectural point underneath both is the most important sentence in the post: interactivity should scale alongside intelligence. If you scale up a turn-based model, you get a smarter monologue. If you scale up an interaction model, you get a smarter collaborator.
Why This Opens a New World
The current ceiling on AI experiences is not the model's IQ. It is the medium. A model that can only respond after you stop typing cannot be a tutor that watches you work. A model that has to wait for a full audio clip cannot be a copilot in a live customer call. A model that cannot interrupt cannot stop you from making a mistake before you make it.
Interaction models break that ceiling. They make several categories of product genuinely possible for the first time:
- Live operator copilots that listen to a sales or support call alongside the human, surface the right answer the moment it is needed, and stay silent the rest of the time.
- Embodied agents in robotics and avatars that can adjust mid-gesture because they noticed the user flinched. (We covered the hardware side of this in our physical AI robotics breakthrough piece.)
- Immersive tutors that watch you solve a problem on a shared canvas, catch the error at the moment it appears, and let you finish the right move on your own.
- Telepresence and meeting agents that participate the way a junior teammate would — taking notes, asking clarifying questions, picking the moment to speak.
- Operator-style browsing agents that can be steered mid-task without the user having to wait for the agent to "finish thinking" first.
The common thread is presence. The agent is with you, not after you.
The Connect + Dedicated Agent Pattern
Here is where the practical question shows up. Real-time interaction models are coming, but no single provider will own this category. Some will optimize for voice latency. Some will lead on vision. Some will be cheaper for specific verticals. And crucially, none of them work well as a generic chatbot endpoint — an interaction model needs to be dedicated to a session, with persistent context, persistent tools, and a stable identity for the user.
That is exactly the shape of the Swfte Connect + Swfte Agents stack.
Connect is the unified gateway layer. It abstracts 50+ model providers behind one endpoint, with routing, failover, cost tracking, and policy controls. As real-time interaction models ship from different labs, Connect is the seam that lets you adopt the best one for each surface — voice with one provider, embodied vision with another, low-cost background reasoning with a third — without rewriting your application. (Our multi-provider routing guide walks through the routing primitives.)
A dedicated Swfte Agent is the other half. An interaction model in the abstract is not a product. What turns it into one is a stable agent identity: a persistent system prompt, a memory of prior sessions, a set of tools scoped to a workflow, and a clear surface (a meeting room, a phone line, a robot, a browser). The agent is what the user remembers. The model is what powers it.
Pair them and the immersive experience falls out naturally:
- The user joins a session — call, canvas, robot, browser — bound to a specific agent.
- The agent opens a live channel through Connect to whichever real-time interaction model is best for that surface.
- Audio, video, and tool events stream as time-aligned micro-turns. The agent can interject, listen through its own speech, and use tools concurrently — because the model underneath was built for it.
- A background model, also routed through Connect, handles slower reasoning, long-form memory, and async tools without blocking the foreground conversation. This interaction model / background model split is the same split Thinking Machines describes, except you assemble it from best-in-class providers rather than committing to one lab's stack.
This is the part that matters: interaction models do not replace the need for a gateway and an agent layer. They make the gateway and the agent layer more valuable, because the surface area of "what an AI experience can be" just got much larger, and the cost of betting on a single provider just got much higher.
What This Means for What You Build Now
You do not need to wait for the first commercial interaction model to start building toward this. A few things you can do today:
- Move foreground and background work onto separate model slots. Even with turn-based models, splitting fast user-facing responses from slower background reasoning is the architecture an interaction-model world rewards. Connect's routing makes this a config change rather than a rewrite.
- Give every user-facing experience a dedicated agent identity. If your product still talks to a raw model endpoint per request, the moment real-time models ship you will be retrofitting state, memory, and tools across every surface. Dedicated agents — like the ones documented in our agent platforms buyer's guide — give you a place to hang those concerns now.
- Instrument for latency, not just quality. Time-to-first-token, interrupt latency, and turn-overlap tolerance are about to become first-class metrics. Your eval suite probably does not measure them yet.
- Avoid baking turn-based assumptions into your UI. A microphone button that toggles between "you talk" and "AI talks" is the UI of a turn-based world. The interaction-model UI is a single continuous channel, the way a phone call is.
The Bigger Pattern
The history of consumer software has a clear lesson: the medium that wins is the one that feels like presence. Asynchronous email lost to chat. Chat lost to video. Video calls lost individual moments to walking into someone's office. Every step traded efficiency for being there.
AI has spent its first chapter being the most efficient possible version of asynchronous. Interaction models are the first credible attempt to make it present. The labs that ship them first will not own the category alone — the category will be defined by the products built on top, the agents users actually form relationships with, and the infrastructure that lets teams compose all of it without lock-in.
A unified gateway and a dedicated agent layer are the boring, durable bet underneath the exciting one. Connect is how you stay portable across the interaction-model providers that are coming. A Swfte Agent is how you turn one of those models into something a user wants to come back to.
The new world Thinking Machines is pointing at is real. The path into it is already buildable.