|
English

In 2018, Google demoed Duplex booking a haircut over the phone. The AI said "um." It paused at the right moments. It handled an interruption from the stylist without missing a beat. The internet lost its mind — and then, mostly, moved on. Voice AI for the next several years went back to being a turn-taking, push-to-talk affair.

Eight years later, the demo has caught up to the architecture. In January 2026, NVIDIA released PersonaPlex, a 7-billion-parameter open-weight model that processes incoming audio and generates response audio simultaneously, inside a single model — collapsing voice latency from over a second to 70 milliseconds and shipping under MIT and the NVIDIA Open Model License. And in May, Thinking Machines Lab published the theoretical framing that explains why this is not a one-off product release but the beginning of a category.

The shorthand for that category is full-duplex AI. It is the most important shift in how humans interact with models since chat itself, and it is happening faster than most product teams have noticed.

What Full-Duplex Actually Means

In telecommunications, "duplex" describes which directions a channel can carry signal at the same time. A walkie-talkie is half-duplex — one person talks, the other waits, you press a button to switch. A phone call is full-duplex — both ends are open simultaneously, which is why you can interrupt, laugh while someone is mid-sentence, or finish each other's thought.

Every chatbot, voice assistant, and "AI agent" shipped to date is a walkie-talkie. You speak. The model listens. The model generates. The model speaks. You wait. Repeat. The illusion of conversation is held together by external scaffolding — voice-activity detection, speech-to-text, text-to-speech, barge-in logic — wrapped around a fundamentally turn-based core.

Full-duplex AI is the phone call. The model listens through its own speech. It can pause when you take a breath, interrupt when you are about to make a mistake, and laugh at the right beat because it heard you laugh first. There is no "your turn / my turn." There is just the conversation.

The reason this took eight years to ship is that you cannot get there by making turn-based models faster. You have to rebuild the model's input/output contract from the ground up. Which is exactly what PersonaPlex and the broader interaction-model wave are doing.

What PersonaPlex Got Right

NVIDIA's announcement is worth reading on its own, but three architectural choices stand out and they are what other labs will copy.

Speech-to-speech in a single model. PersonaPlex does not transcribe audio to text, run a text model, then synthesize speech back. Audio goes in. Audio comes out. The model develops a native sense of timing, prosody, and overlap that you simply cannot recover after passing through a text bottleneck.

Persona and voice as first-class inputs. A text prompt defines the role ("you are a polite restaurant host"). An audio prompt defines the voice (a few seconds of reference speech). The same model becomes a different agent without retraining. This is the missing primitive for product teams: you can finally ship a thousand distinct AI personas on top of one model.

70-millisecond perceived latency. This is the number that matters. Human conversational turn-taking sits in the 200-millisecond range. Anything above that and the listener feels the lag — it is the difference between a real-time call and a satellite phone. PersonaPlex is the first openly available model to credibly clear that bar.

Open weights. The 7B model is downloadable. Code is on GitHub. This is the moment Llama-2 was for text in 2023 — the point at which a category stops belonging to a single vendor and starts being something every product team can build with.

Why Thinking Machines Is Thinking In the Same Direction

What makes the PersonaPlex release more than a clever product is that the theoretical case for it is being built in parallel. Thinking Machines Lab's Interaction Models post — published this month — argues that interactivity should scale alongside intelligence, and that the way to get there is to make real-time interaction native to the model architecture rather than bolted on through external pipelines.

The technical primitives they describe — time-aligned micro-turns (200ms chunks of audio and video processed concurrently) and encoder-free early fusion (raw signals integrated directly, without heavy preprocessors) — are the same primitives PersonaPlex implements for the voice channel specifically. The convergence is not an accident. Two of the most credible labs in the field are independently concluding that the next ceiling on AI experience is not the model's reasoning, but the medium through which the model meets the human.

The framing Thinking Machines uses is the most useful sentence to memorize from the entire conversation: turn-based systems create a collaboration bottleneck where the user is forced out of the loop, not because the model cannot include them, but because the interface has no room for them. Full-duplex models — PersonaPlex today, video-aware successors tomorrow — are the architectural fix.

For a longer treatment of how this reshapes the agent layer, we covered the Thinking Machines piece directly in Interaction Models: Why Real-Time AI Unlocks Truly Immersive Agents.

What Becomes Possible When the Walkie-Talkie Ends

The product categories that full-duplex unlocks are the ones that have felt almost-but-not-quite viable for years:

  • Phone-grade AI receptionists and outbound callers that can handle a real interruption ("hold on, let me check") without dropping the thread. Duplex hinted at this; PersonaPlex makes it shippable.
  • Live operator copilots that sit on a sales or support call and surface the right answer the moment it is needed, then stay quiet — because they can listen through the human's speech instead of waiting for a turn.
  • Voice-first tutors and coaches that interrupt before you make a mistake, not after. The 70ms latency is the difference between a coach and a transcript reviewer.
  • Embodied agents — robots, avatars, kiosks — whose responsiveness finally matches their physical presence. We covered the hardware side in our physical AI robotics breakthrough piece; full-duplex models are the missing software half.
  • Persona-rich consumer experiences where the same model voices thousands of distinct agents — a feature of PersonaPlex specifically, and a hint at what the next two years of consumer AI products will look like.

The shared thread across all of these is presence. The agent stops feeling like a system you query and starts feeling like a participant who is there.

What Product Teams Should Do Now

Full-duplex is not yet a drop-in replacement for every voice surface, but the planning horizon is short — months, not years. A few practical moves:

  • Stop hard-coding turn-based assumptions. Microphone toggles, "press to talk" buttons, and explicit end-of-turn detection are the UI of a half-duplex era. The next generation of surfaces is a single open channel.
  • Separate foreground from background reasoning. Full-duplex models are optimized for real-time conversational presence, not heavy thinking. Pair them with a slower background model for long-horizon reasoning, retrieval, and tool execution. Thinking Machines calls this the interaction model / background model split, and it is the architecture every voice product will converge on.
  • Plan for multiple providers. PersonaPlex is the first credible open full-duplex model, but it will not be the last. Different labs will optimize for different languages, latencies, and verticals. Routing voice traffic through a unified gateway like Swfte Connect — which already abstracts 50+ providers behind one API — is how you stay portable as the category matures. Our multi-provider routing guide walks through the primitives.
  • Treat the agent as the product, not the model. PersonaPlex's persona control is a feature, but persistent identity, memory, and tool access live one layer up. A dedicated Swfte Agent is where those concerns belong; the underlying model becomes a swappable voice for that agent rather than the agent itself.
  • Measure latency, overlap, and interrupt handling. Your eval suite probably tracks accuracy and tokens. It probably does not track time-to-first-audio, mid-turn interrupt latency, or backchannel tolerance. Those are about to become the metrics that distinguish a product from a demo.

The Pattern Underneath

Every shift in conversational UI has followed the same logic: the medium that wins is the one that feels most like presence. Letters lost to telegrams. Telegrams lost to phone calls. Phone calls lost individual moments to walking into someone's office. Email lost mindshare to chat. Chat is losing it to video.

AI has spent its first chapter being the most efficient possible version of asynchronous text. Duplex was the proof, eight years early, that the next chapter would be synchronous. PersonaPlex is the first open model that ships the architecture. Thinking Machines is the lab articulating why this is the trajectory rather than a niche.

For anyone building AI products, the implication is clear. Full-duplex is not a feature to add to a voice assistant. It is the new default for what an AI experience is allowed to feel like — and the products built on top of it will not look like better chatbots. They will look like presence.

The walkie-talkie era is closing. The phone call is beginning.

0
0
0
0

Enjoyed this article?

Get more insights on AI and enterprise automation delivered to your inbox.