World models and multimodality

5.1Predicting text is not enough: understanding the world

Hence an idea that has been stirring research since 2025: to reach a more general intelligence, we would need systems endowed with a genuine world model, that is, an internal representation of how reality works and evolves. Text, from this standpoint, is only an impoverished, incomplete shadow of reality. The researcher Yann LeCun has made it his rallying cry: in his view, we will not reach intelligence by swallowing ever more text, but by learning from richer signals, such as video and interaction with the world.

5.2Multimodality: text, image, sound, video

Under the hood

Aligning text and image (contrastive learning and CLIP)

How do we get a word and an image to "live in the same space," to the point where a model links the caption "a ginger cat" to the matching photo? The key technique, popularized by the CLIP model (OpenAI, 2021), is contrastive learning. Two encoders are trained in parallel, one for text and one for image, on hundreds of millions of (image, caption) pairs gleaned from the web. The objective: to bring together in the representation space the pairs that belong together, and to push apart those that have nothing in common. The result is that an image and its description end up in the same place, which makes it possible to retrieve an image from a text (and vice versa), to classify images without labels, and above all to guide image generators: this is how a diffusion model (below) knows how to match a textual instruction to the visual it produces. Contrastive learning has become one of the foundational building blocks of multimodality.

Under the hood

How AI generates an image (diffusion models)

How does a model build an image from a simple sentence? The dominant technique since 2022, diffusion, rests on a counterintuitive idea. During training, millions of images are taken and progressively degraded by adding noise, until they become a random mush; the model learns to do the reverse, that is, to remove the noise step by step in order to recover a clean image. Once trained, it is given a purely random starting point (noise) and a textual description, and it gradually "denoises" this chaos until an image conforming to the instruction emerges. This is the process that drives the image generators—and, extended over time, the video generators—of Chapter 16. It has supplanted the earlier approach, generative adversarial networks (GAN), in which two networks competed (one producing images, the other trying to unmask them): harder to train and more unstable, GANs have largely given way to diffusion, which is more stable and more controllable. The same logic of guided denoising inspires some of today's world models (next section).

In context

Representing the world in 3D (NeRF and Gaussian splatting)

Beyond flat images, a family of techniques reconstructs three-dimensional scenes from ordinary photographs. Neural radiance fields (NeRF, 2020) train a small network to predict, for each point in space and each viewing angle, the color and density of the scene, so that it can later be "replayed" from any angle. A more recent and far faster approach, Gaussian splatting (2023), represents the scene with millions of small colored blobs, allowing real-time rendering. These methods feed into film, video games, mapping, and above all the training of robots in simulation (Chapter 13): faithfully reconstructing an environment in 3D means offering agents a realistic virtual world in which to practice. They thus join the quest for world models (following sections), which seek to endow AI with a manipulable representation of space and its dynamics.

5.3World models: definition and stakes

Be careful not to confuse it with a mere video generator. The distinction is subtle but crucial:

Diagram5.1. Video generator versus world model. The first produces a fixed clip. The second is interactive: you can act within it, and it responds coherently, frame after frame. It is this action-then-consequence loop that turns it into a training ground for agents and robots.

5.4Competing approaches (a mid-2026 panorama)

In plain terms

"World model" became, in 2026, one of the most contested terms in AI: everyone calls their project by that name. Four broad families can be distinguished, driven by American, European and Chinese players.

Approach	Predicts in...	Interactive?	Examples (2026)	Main use
Video generation	pixel space	No	Sora (OpenAI), Veo (Google), Kling (Kuaishou), Seedance (ByteDance), Runway	Content creation
Spatial / 3D	3D space	Partly	World Labs: Marble (Fei-Fei Li)	Navigable 3D worlds, games, visual effects
Interactive generative	pixels or tokens, conditioned on action	Yes	Genie 3 (DeepMind), Cosmos (NVIDIA), GAIA-2 (Wayve)	Training agents and robots in simulation
Latent (JEPA)	an abstract embedding space	Yes	V-JEPA 2 (Meta), AMI Labs (LeCun)	Efficient understanding and planning

Table 5.1. The four camps of "world models" in 2026.

A few concrete milestones from the 2025-2026 period:

Video generation "as world simulation." As early as 2024, with Sora, OpenAI championed the thesis that a model trained on enough video ends up implicitly learning physics, and that scale will fill in the rest. Google (Veo), China's Kuaishou (Kling) and ByteDance (Seedance, which topped several video-generation rankings in 2025), as well as Runway, follow a similar path. The debate remains open: Sora models certain interactions poorly (a glass shattering, food being bitten into), and several studies conclude that it shows "the beginnings of a world model" without quite being one.
Spatial intelligence. World Labs, founded by Fei-Fei Li (the "godmother" of AI, behind ImageNet), launched the Marble product in late 2025, capable of generating persistent, editable and exportable 3D worlds from a text or an image. For Li, "spatial intelligence is the next frontier of AI."
Interactive generative models. Google DeepMind unveiled Genie 3 (August 2025), a model able to generate explorable 3D worlds in real time at 24 frames per second, open since 2026 to certain subscribers via "Project Genie." NVIDIA, with its Cosmos platform (more than two million downloads in early 2026), provides "physics-aware" worlds to train robots and autonomous vehicles in simulation.
The latent approach (JEPA). This is Yann LeCun's bet: rather than predicting pixels, predicting in an abstract space what is going to happen, which would be far more efficient and closer to cognition. Convinced that LLMs are "plateauing," LeCun left Meta in late 2025 to found AMI Labs (Advanced Machine Intelligence) in Paris, raising more than a billion dollars around this idea. Meta, for its part, continues to develop its V-JEPA models.

The interest here meets that of the following section, but transposed to software: a real environment is slow, scarce, costly and risky, and you cannot inject into it at will the edge cases an agent will nonetheless have to handle. A faithful, controllable and replayable simulator, by contrast, makes it possible to generate unlimited quantities of training trajectories (a form of synthetic data, Chapter 4), to apply reinforcement learning cheaply and without danger, and to test an agent before unleashing it on real systems. Qwen even reports that agents trained this way in simulation can outperform those trained under real conditions, and that this knowledge of environments transfers to agent tasks without specific retraining.

Two caveats are nonetheless in order. The benchmark that measures these results (AgentWorldBench) was designed and published by the team itself: its margins warrant caution. And it is the same pitfall as the gap between simulation and reality discussed later: an agent brilliant in the simulator may fail against the disorder of the real world, for a world model is never better than the data that fed it.

5.5Why it is one of the great bets of 2026

In context: from virtual to real (sim-to-real). The main concrete use of world models is already here: training agents and robots in simulation. The benefit is obvious: in a virtual world, a robot can attempt millions of trials at high speed, without risk, without wear and without danger, where learning in the real world would be slow and costly. There remains the central challenge, the gap between simulation and reality (reality gap): a behavior learned in an overly perfect simulator often fails against the disorder of the real world (friction, changing light, imperfect sensors). The main countermeasure is called domain randomization: a thousand parameters of the simulation (textures, lighting, masses, frictions) are deliberately varied to force the learned strategy to become robust, so that the real world is, for it, just one more variant among those already encountered. To this is added the production of synthetic data (examples that are generated rather than collected), increasingly used to train models when real data is lacking. This is the most tangible bridge between this chapter and robotics (Chapter 13).

Key takeaways (Chapter 5)

The leading models are multimodal: text, image, sound and video, processed in a common representation space (everything becomes tokens).
A world model is an internal simulator that predicts the next state from an action; it is interactive, unlike a video generator, which produces a fixed clip.
Four camps are competing in 2026: video generation (Sora, Veo, Kling, Seedance), spatial/3D (World Labs, Marble), interactive generative (Genie 3, Cosmos) and latent/JEPA (V-JEPA, LeCun's AMI Labs).
A variant that appeared in 2026, the language world model (e.g. Qwen-AgentWorld), simulates not the physical world but the digital environments of agents (terminal, web, OS...), in order to train and test them without risk (Chapter 6).
Many see it as the bridge to embodied AI (robotics) and a possible step toward general intelligence.
The underlying debate: will scale alone suffice, or are new architectures grounded in physics needed?

In the next chapter, we move from perception and simulation to action: AI agents, those systems that no longer merely respond, but act.

5.1Predicting text is not enough: understanding the world#

5.2Multimodality: text, image, sound, video#

5.3World models: definition and stakes#

5.4Competing approaches (a mid-2026 panorama)#

5.5Why it is one of the great bets of 2026#