Part II · THE ERA OF LARGE MODELS

Large language models (LLM)

Chapter 416 min readUpdated: June 2026

4.1What is an LLM, really?

One striking way to put it: an LLM is a function that, having read a quantity of text no human could read in a thousand lifetimes, has compressed into its parameters an immense share of the regularities of language and, through it, of the world.

4.2Tokens: the "currency" of AI

Why does this matter so much? For two very concrete reasons:

  • The context window is the maximum number of tokens the model can "hold in mind" at once. Sizes vary widely from model to model: in 2026 many fall between 128,000 and 256,000 tokens (the equivalent of a hefty book), and many frontier models now reach one million tokens, or even more. Beyond its window, the model no longer "sees" the start of the conversation or document; in practice, its ability to make use of a very long context often degrades well before that limit.
  • The price is counted in tokens. Using a model via an application programming interface (API) is billed per million tokens consumed, both for input (what you send it) and output (what it generates). The chip maker NVIDIA goes so far as to describe tokens as "the language and currency of AI": optimizing the cost per token has become a major industrial challenge (Chapters 8 and 9).

4.3The anatomy of a training run

4.4Emergent capabilities and hallucinations

But these models suffer from a notorious flaw: hallucinations. The model asserts, with the same calm assurance it brings to a truth, false information: an invented quotation, a nonexistent legal reference, an erroneous fact. The reason is structural: an LLM is optimized to produce plausible text, not true text. By construction, it has no internal notion of "I don't know"; faced with a gap, it fills it with whatever most resembles a plausible answer.

The consequences can be serious (medical errors, false case law cited in court). Several countermeasures exist and are improving:

  • Retrieval-augmented generation (RAG, see Chapter 2): the model is supplied with reliable documents retrieved on the fly, on which it must rely.
  • Tool use: delegating computation to a calculator, recent facts to a search engine (Chapter 6).
  • Verifiable citations and the continual improvement of training.
  • Explicit reasoning (next section), which reduces certain errors.

4.5Reasoning: chain of thought and "thinking" models

The labs then trained reasoning models (or "thinking" models): models that produce a long internal deliberation before delivering their answer, devoting more compute at the moment of answering (this is referred to as test-time compute, compute at inference). Rather than answering off the cuff, the model "takes the time to think," explores avenues, corrects itself.

Diagram4.1. Direct answer versus explicit reasoning. The reasoning model is slower and more costly, but markedly more reliable on complex problems.

This shift moved the performance frontier: gains no longer come only from enlarging pre-training, but also from letting the model think longer. The first models of this generation were OpenAI's o1 and then o3 lineage (late 2024 and 2025) and the open model DeepSeek-R1 (early 2025), which made a strong impression by reaching an excellent level of reasoning at very low cost. In 2026, the major families (Claude, Gemini, GPT, Grok) all offer a reasoning mode.

4.6Evaluating a model: benchmarks

State of play as of mid-2026. The summit of the art is fiercely contested, and the ranking changes almost every month; what follows is a snapshot. On the American side, Anthropic's Claude family, OpenAI's GPT-5 lineage, Google DeepMind's Gemini 3, and xAI's Grok are locked in close competition. On the Chinese side, models often released with open weights and at very low cost, such as DeepSeek and Qwen (Alibaba), reach a level close to the frontier. On the European side, France's Mistral carries the banner of sovereignty. A few clear trends stand out in 2026: on the human preference arenas, the Claude variants held the top spots for a good part of the year; on code (SWE-bench), the lead is disputed among Claude, Grok, and GPT; Gemini shines on several reasoning tests and on the multimodal; and the open-weights models now offer near-equivalent quality for a fraction of the price, which is upending the entire economics of the sector.

The great lesson of 2026, to which we will return in Chapter 7, can be put in a single sentence: there is no longer a "best model" in the absolute, but a best model for each task. The most advanced organizations practice "routing": handing each request to the best-suited model in terms of quality, speed, and cost. And every ranking must be read as a snapshot, valid at a given moment.


Key takeaways (Chapter 4)

  • An LLM is a Transformer trained at large scale to predict the next token; from this objective emerge conversation, translation, code, and reasoning.
  • Models reason in tokens (fragments of words), which defines the context window and the price (billed per million tokens).
  • Training (a one-off, massive investment) is distinct from inference (a recurring cost on each request); distillation produces light, cheap versions.
  • The raw material, quality human text, could be exhausted around 2026-2032 (the "data wall"), hence the recourse to licensing, the multimodal, and synthetic data.
  • Hallucinations are structural (the model aims for the plausible, not the true); they are mitigated by RAG, tool use, and reasoning, without being eliminated.
  • Reasoning models "think" longer at the moment of answering, shifting the performance frontier toward compute at inference.
  • Benchmarks measure progress but suffer from saturation, contamination, and the Goodhart effect. As of mid-2026, the frontier is disputed among American, Chinese, and European players, with no single winner.

We now have a complete picture of "how it works." Part II continues by broadening the view: beyond text, world models and the multimodal (Chapter 5), then the move to action with agents (Chapter 6), before mapping out the players (Chapter 7).