Field Guide

Models That Understand the Real World: A 2026 Field Guide

World models, omni multimodal, and wearables like Omi: three ways AI is reaching past text into the real world.

June 2, 2026

English

A chatbot predicts the next word. A self-driving car has to predict the next second of the road. Those are different problems, and 2026 is the year the second one stopped being a research footnote. Three families of model are pushing AI out of the text box and into the physical world: world models, omni multimodal models, and the wearables that feed them real life. Here is the field guide.

Why "predict the next word" runs out of road

A language model is a very good guesser of text. Ask it what comes after "the cat sat on the," and it has seen enough sentences to bet on "mat." That trick, scaled up, gives you everything from chat to code. But it has a ceiling that shows up the moment the task is about the physical world rather than about words describing it.

Drop a glass and you know, without thinking, that it falls, it shatters, the pieces scatter, and the water spreads. You did not read that. You built a rough simulator of the world as a child and you have been running it ever since. A text model never built that simulator. It can recite the sentence "the glass shattered," but it has no internal picture of glass, gravity, or the floor. For chat, that is fine. For a robot arm or a car, it is the whole game.

So the frontier has split. One branch keeps making the word-guesser better, and that is the Opus / GPT / Gemini race. The other branch is trying to build the missing simulator. That second branch is what people mean by world models.

Branch one: world models

A world model learns to predict what happens next in reality, not in a sentence. Feed it the current state of a scene and a possible action, and it returns the likely next state: the ball rolls here, the door swings open, the shadow moves. Stack those predictions and you can plan, by imagining several futures and choosing the action that leads somewhere good, which is exactly what a driver or a robot has to do.

Three efforts define the space right now.

World Labs and Marble. Fei-Fei Li's company shipped Marble, the first world model sold as a commercial product. You give it a sentence, a photo, a video, or a rough 3D sketch, and it builds a persistent, navigable 3D place you can walk through and export, as a mesh, a Gaussian splat, or a video. The key word is persistent: the room does not melt and re-form as you turn around, the way early video generators did. It stays put, which is what makes it usable for design, games, and training simulators. World Labs raised a billion dollars in February 2026, with AMD, NVIDIA, Autodesk, Fidelity, and Sea backing it, specifically to push this further.

Google DeepMind. DeepMind built a real-time interactive world model on top of video diffusion, the same lineage as its Genie line. It turns video into a space you can move through and act in. As of this writing it is still a limited research preview, so World Labs has the lead on actually shipping, but DeepMind's research depth makes it the one to watch.

AMI Labs. Yann LeCun left Meta to start AMI Labs in Paris and raised just over a billion dollars in seed money, the largest seed round Europe has recorded. His bet is a different shape from everyone else's. Instead of predicting raw pixels, his architecture, called a joint-embedding predictive approach, predicts in a compressed, abstract space, learning what matters about how a scene changes rather than redrawing every detail. The argument is that drawing pixels wastes effort on things a planner does not need (the exact texture of a wall) and that abstraction is the path to robots and industrial control. It is the most contrarian of the three, and the best-funded contrarian bet in the field.

Tesla has been quietly doing a version of this in production for years, running simulated futures of the road to decide whether to brake or turn before anything happens. The labs above are trying to generalize that idea past driving.

Branch two: omni multimodal models

While world models try to simulate reality, omni models try to perceive it, all of it, at once. An omni model takes text, images, audio, and sometimes video into a single network and answers in kind. The point is not that it can do each one; plenty of models bolt on a vision encoder. The point is that it does them in one pass, so it can hear a question, glance at what you are pointing the camera at, and reply by voice without three separate models handing off to each other.

NVIDIA's Nemotron Omni line is the clearest open example: one stack handling vision, audio, and text, small enough to self-host, and it tops several specialty boards. The big closed assistants are converging on the same shape, which is why GPT-5.5's real-time voice feels less like a feature and more like a different kind of model. You can see where these land against the rest on our model leaderboard.

Omni models are the bridge between the two branches. A world model needs eyes and ears to take in the current state of the room before it can predict the next one. Increasingly, the perception front-end of a physical-AI system is an omni model, and the planning back-end is a world model.

Branch three: the wearables that feed them

A model that understands the real world is only as good as its view of the real world, and a laptop has a terrible view. This is where devices like Omi come in.

Omi is an open-source AI wearable from Based Hardware, a small thing you wear that listens to your conversations and watches your screen activity, then turns the stream into transcripts, summaries, and tasks. Marketed as a "second brain," it is really a sensor: a always-on feed of one person's actual day, in audio and context, rather than the curated text people type into a chat box.

That feed is the missing fuel. The hard part of physical AI was never the math; it was data. Models learned language from the entire internet, but there is no internet-sized archive of "what an ordinary hour of human life looks like from the inside." Wearables generate exactly that, continuously, with consent, from the first-person point of view a world model actually needs to learn from. LeCun has called out wearables and healthcare as natural homes for his approach for this reason: they sit right on the stream of real-world experience.

It is worth being plain about the trade. An always-listening device is a privacy decision before it is a technical one. The open-source path Omi takes (you can read the code, run your own server, see where the audio goes) is part of how that decision gets made in the open rather than behind a vendor's wall. Anyone deploying this near customers or staff is making a consent and governance choice, not just a product one.

How the three fit together

Read top to bottom, the stack looks like this:

Layer	Job	Examples
Wearables / sensors	Capture real life, first-person	Omi, camera and audio rigs
Omni models	Perceive many signals at once	Nemotron Omni, GPT-5.5 voice
World models	Simulate and plan what comes next	Marble, DeepMind, AMI Labs

A complete physical-AI system uses all three: a sensor to see the world, an omni model to make sense of what it sees, and a world model to decide what to do about it. Today those pieces mostly live in different companies and different demos. The race of the next two years is to fuse them into one loop that can run a robot, a car, or an assistant that actually does things in your house rather than describing how.

What this means if you build software

You probably do not need a world model this quarter. Almost every business problem in front of you is still a language or document problem, and the text-first models handle those. But two shifts are worth tracking now.

First, "multimodal" is becoming the default, not a premium tier. Plan for inputs that include audio and images, because the models you will buy in a year assume them.

Second, first-person data is becoming a strategic asset. If your product sits anywhere near a stream of real-world activity, that stream is training data for the next branch of AI, with all the privacy weight that carries. Decide how you will treat it before someone decides for you.

The word-guessers will keep getting better. But the more interesting story of 2026 is the machines learning, finally, that the glass falls.

Keep reading

Sources:

Veröffentlicht intechnology

World Models Multimodal AI Physical AI Omi Spatial Intelligence

Enjoyed this article?

Get more insights on AI and enterprise automation delivered to your inbox.

← Back to all articles