Part II · THE ERA OF LARGE MODELS

World models and multimodality

Chapter 512 min readUpdated: June 2026

5.1Predicting text is not enough: understanding the world

Hence an idea that has been stirring research since 2025: to reach a more general intelligence, we would need systems endowed with a genuine world model, that is, an internal representation of how reality works and evolves. Text, from this standpoint, is only an impoverished, incomplete shadow of reality. The researcher Yann LeCun has made it his rallying cry: in his view, we will not reach intelligence by swallowing ever more text, but by learning from richer signals, such as video and interaction with the world.

5.2Multimodality: text, image, sound, video

5.3World models: definition and stakes

Be careful not to confuse it with a mere video generator. The distinction is subtle but crucial:

Diagram5.1. Video generator versus world model. The first produces a fixed clip. The second is interactive: you can act within it, and it responds coherently, frame after frame. It is this action-then-consequence loop that turns it into a training ground for agents and robots.

5.4Competing approaches (a mid-2026 panorama)

Table 5.1. The four camps of "world models" in 2026.

A few concrete milestones from the 2025-2026 period:

  • Video generation "as world simulation." As early as 2024, with Sora, OpenAI championed the thesis that a model trained on enough video ends up implicitly learning physics, and that scale will fill in the rest. Google (Veo), China's Kuaishou (Kling) and ByteDance (Seedance, which topped several video-generation rankings in 2025), as well as Runway, follow a similar path. The debate remains open: Sora models certain interactions poorly (a glass shattering, food being bitten into), and several studies conclude that it shows "the beginnings of a world model" without quite being one.
  • Spatial intelligence. World Labs, founded by Fei-Fei Li (the "godmother" of AI, behind ImageNet), launched the Marble product in late 2025, capable of generating persistent, editable and exportable 3D worlds from a text or an image. For Li, "spatial intelligence is the next frontier of AI."
  • Interactive generative models. Google DeepMind unveiled Genie 3 (August 2025), a model able to generate explorable 3D worlds in real time at 24 frames per second, open since 2026 to certain subscribers via "Project Genie." NVIDIA, with its Cosmos platform (more than two million downloads in early 2026), provides "physics-aware" worlds to train robots and autonomous vehicles in simulation.
  • The latent approach (JEPA). This is Yann LeCun's bet: rather than predicting pixels, predicting in an abstract space what is going to happen, which would be far more efficient and closer to cognition. Convinced that LLMs are "plateauing," LeCun left Meta in late 2025 to found AMI Labs (Advanced Machine Intelligence) in Paris, raising more than a billion dollars around this idea. Meta, for its part, continues to develop its V-JEPA models.

The interest here meets that of the following section, but transposed to software: a real environment is slow, scarce, costly and risky, and you cannot inject into it at will the edge cases an agent will nonetheless have to handle. A faithful, controllable and replayable simulator, by contrast, makes it possible to generate unlimited quantities of training trajectories (a form of synthetic data, Chapter 4), to apply reinforcement learning cheaply and without danger, and to test an agent before unleashing it on real systems. Qwen even reports that agents trained this way in simulation can outperform those trained under real conditions, and that this knowledge of environments transfers to agent tasks without specific retraining.

Two caveats are nonetheless in order. The benchmark that measures these results (AgentWorldBench) was designed and published by the team itself: its margins warrant caution. And it is the same pitfall as the gap between simulation and reality discussed later: an agent brilliant in the simulator may fail against the disorder of the real world, for a world model is never better than the data that fed it.

5.5Why it is one of the great bets of 2026

In context: from virtual to real (sim-to-real). The main concrete use of world models is already here: training agents and robots in simulation. The benefit is obvious: in a virtual world, a robot can attempt millions of trials at high speed, without risk, without wear and without danger, where learning in the real world would be slow and costly. There remains the central challenge, the gap between simulation and reality (reality gap): a behavior learned in an overly perfect simulator often fails against the disorder of the real world (friction, changing light, imperfect sensors). The main countermeasure is called domain randomization: a thousand parameters of the simulation (textures, lighting, masses, frictions) are deliberately varied to force the learned strategy to become robust, so that the real world is, for it, just one more variant among those already encountered. To this is added the production of synthetic data (examples that are generated rather than collected), increasingly used to train models when real data is lacking. This is the most tangible bridge between this chapter and robotics (Chapter 13).


Key takeaways (Chapter 5)

  • The leading models are multimodal: text, image, sound and video, processed in a common representation space (everything becomes tokens).
  • A world model is an internal simulator that predicts the next state from an action; it is interactive, unlike a video generator, which produces a fixed clip.
  • Four camps are competing in 2026: video generation (Sora, Veo, Kling, Seedance), spatial/3D (World Labs, Marble), interactive generative (Genie 3, Cosmos) and latent/JEPA (V-JEPA, LeCun's AMI Labs).
  • A variant that appeared in 2026, the language world model (e.g. Qwen-AgentWorld), simulates not the physical world but the digital environments of agents (terminal, web, OS...), in order to train and test them without risk (Chapter 6).
  • Many see it as the bridge to embodied AI (robotics) and a possible step toward general intelligence.
  • The underlying debate: will scale alone suffice, or are new architectures grounded in physics needed?

In the next chapter, we move from perception and simulation to action: AI agents, those systems that no longer merely respond, but act.