Large language models (LLM)

4.1What is an LLM, really?

One striking way to put it: an LLM is a function that, having read a quantity of text no human could read in a thousand lifetimes, has compressed into its parameters an immense share of the regularities of language and, through it, of the world.

Debate

Do they really understand?

This is the intellectual controversy of the field. On one side, researchers such as Emily Bender have called these models "stochastic parrots" (2021): they merely spit back statistically plausible combinations of words, without any understanding. On the other, researchers observe that, in order to predict the next word so well across such varied subjects, a model must have built structured internal representations that look a great deal like concepts, or even the beginnings of a "world model" (Chapter 5). The honest position, in 2026, is somewhere in between: these systems manipulate regularities at such a scale that their behavior is often indistinguishable from understanding, without our being able to settle with certainty the philosophical question of whether they "understand." What is certain is that the metaphor of the mere parrot is no longer enough to account for their reasoning abilities.

Under the hood

How the model chooses the next word (decoding)

At each step, an LLM does not produce a word, but a probability for each of the possible words (tokens): for example, after "the sky is," it assigns a high probability to "blue," a lower one to "gray," a tiny one to "pie." One of them then has to be chosen: this is the decoding step. The simplest method, known as greedy, always takes the most probable one, but it produces flat, repetitive text. In practice, we sample: a word is drawn at random in accordance with the probabilities, which introduces variety. One setting, the temperature, adjusts this draw: near zero, the model becomes almost deterministic and cautious (useful for code or facts); higher, it ventures less likely choices and proves more creative (useful for writing a story), at the risk of going off the rails. This is why the same model, on the same question, can give different answers from one time to the next: not out of whim, but because the randomness of the draw is, by design, at the heart of generation.

4.2Tokens: the "currency" of AI

Why does this matter so much? For two very concrete reasons:

The context window is the maximum number of tokens the model can "hold in mind" at once. Sizes vary widely from model to model: in 2026 many fall between 128,000 and 256,000 tokens (the equivalent of a hefty book), and many frontier models now reach one million tokens, or even more. Beyond its window, the model no longer "sees" the start of the conversation or document; in practice, its ability to make use of a very long context often degrades well before that limit.
The price is counted in tokens. Using a model via an application programming interface (API) is billed per million tokens consumed, both for input (what you send it) and output (what it generates). The chip maker NVIDIA goes so far as to describe tokens as "the language and currency of AI": optimizing the cost per token has become a major industrial challenge (Chapters 8 and 9).

4.3The anatomy of a training run

In plain terms

Let us revisit the pipeline from Chapter 3, spelling out what is specific to LLMs.

The data. A gigantic corpus is assembled: a large part of the public web, digitized books, enormous volumes of computer code, articles. This raw material is then cleaned (removing duplicates and very low-quality content, filtering). The quality of the data is now considered as decisive as its quantity, which raises legal questions (copyright) and ethical ones addressed in Chapters 21 and 25.
Pre-training. The computation is carried out on clusters of thousands of specialized processors (Chapter 8) over weeks or months. In line with the lessons of "Chinchilla" (Chapter 3), the aim is to strike the right balance between the number of parameters and the volume of data.
Post-training (supervised fine-tuning, then RLHF) turns the raw model into an assistant.

Under the hood

Mixture of Experts (MoE) models

One architectural innovation partly explains how the models of 2024-2026 became both more powerful and more economical: the Mixture of Experts (MoE). Instead of a single dense network in which all the parameters activate for each word, the model is split into many specialized sub-networks, the "experts," and a small router calls upon, for each token, only the two or three most relevant experts. The result: a model can hold hundreds of billions, or even trillions, of parameters "in reserve" while activating only a fraction of them at each computation, hence at a far lower cost than an equivalent dense model. This is one of the drivers of the DeepSeek shock (Chapter 9) and of most recent large models. The downside: such models are more complex to train and to serve (the load must be balanced across experts), but the efficiency gain wins out by a wide margin.

In context

The data wall

This whole machinery rests on a finite raw material: text written by humans. Yet the largest models have already ingested most of what is publicly accessible, hence the fear of a "data wall." A landmark study (the Epoch AI institute, 2024) estimates, with a margin of uncertainty, that the stock of high-quality public human text could be exhausted between 2026 and 2032, or even sooner if models are "overtrained" (feeding them the same sources several times to gain efficiency). In late 2024 and early 2025, several voices in the sector popularized the image of data as the "oil" of AI, a resource that runs out. Three responses are taking shape. First, buying data, hence the wave of licensing deals between labs and content holders (the press, forums, archives, Chapters 16 and 21). Next, changing the raw material, by drawing on image, video, and sound (the multimodal, Chapter 5), far more abundant than text. Finally and above all, manufacturing synthetic data, produced by the models themselves, in particular to train reasoning (next section). But this last avenue has a well-known drawback: training a model too much on its own output degrades its quality, the phenomenon of model collapse (Chapter 16). The underlying unknown remains: will this wall really slow progress, or will efficiency gains (learning better with less) push it back?

4.4Emergent capabilities and hallucinations

But these models suffer from a notorious flaw: hallucinations. The model asserts, with the same calm assurance it brings to a truth, false information: an invented quotation, a nonexistent legal reference, an erroneous fact. The reason is structural: an LLM is optimized to produce plausible text, not true text. By construction, it has no internal notion of "I don't know"; faced with a gap, it fills it with whatever most resembles a plausible answer.

The consequences can be serious (medical errors, false case law cited in court). Several countermeasures exist and are improving:

Retrieval-augmented generation (RAG, see Chapter 2): the model is supplied with reliable documents retrieved on the fly, on which it must rely.
Tool use: delegating computation to a calculator, recent facts to a search engine (Chapter 6).
Verifiable citations and the continual improvement of training.
Explicit reasoning (next section), which reduces certain errors.

In context

The art of the prompt (prompt and context engineering)

The quality of an answer depends enormously on how it is solicited. Prompt engineering is the art of phrasing requests to get the best out of a model: providing context, giving examples (the model learns "on the fly" from a few cases, what is called in-context learning), specifying the expected format, or asking the model to "think step by step" (which ties into the chain of thought of the next section). With the rise of agents (Chapter 6), the discipline has broadened into context engineering: it is no longer only about the question asked, but about everything placed in the model's context window at the right moment (instructions, memory, documents retrieved by RAG, tool results). Striking the right balance in this context, neither too little nor too much, has become a key skill for making models and agents more reliable.

4.5Reasoning: chain of thought and "thinking" models

The labs then trained reasoning models (or "thinking" models): models that produce a long internal deliberation before delivering their answer, devoting more compute at the moment of answering (this is referred to as test-time compute, compute at inference). Rather than answering off the cuff, the model "takes the time to think," explores avenues, corrects itself.

Diagram4.1. Direct answer versus explicit reasoning. The reasoning model is slower and more costly, but markedly more reliable on complex problems.

This shift moved the performance frontier: gains no longer come only from enlarging pre-training, but also from letting the model think longer. The first models of this generation were OpenAI's o1 and then o3 lineage (late 2024 and 2025) and the open model DeepSeek-R1 (early 2025), which made a strong impression by reaching an excellent level of reasoning at very low cost. In 2026, the major families (Claude, Gemini, GPT, Grok) all offer a reasoning mode.

4.6Evaluating a model: benchmarks

In plain terms

How do we know whether one model is "better" than another? We use benchmarks: standardized exams. The most cited in 2026:

MMLU: a vast questionnaire covering general and academic knowledge.
GPQA: doctorate-level science questions, designed to resist simple lookup.
SWE-bench: solving real software engineering problems drawn from code repositories, now the reference for measuring genuine usefulness in programming.
Humanity's Last Exam: a deliberately extreme exam, at the edge of human knowledge.
FrontierMath: research-level mathematics problems, validated by experts, on which even the best models were still largely stumbling as of mid-2026.
ARC-AGI: a test of abstract reasoning, designed to measure the ability to generalize rather than to memorize.
The human preference arenas (such as LMArena, formerly Chatbot Arena), where humans vote blind for the better answer between two models. This is one of the hardest indicators to game, because it measures actual user satisfaction.
The independent aggregators (such as Artificial Analysis), which compile performance across many tests and add measures of speed and cost, useful for comparing models from a practical angle.

In context

Perplexity, the model's "surprise."

Even before the big dashboards (below), the historical measure of a language model's quality is perplexity. The idea: the model is presented with a text it has never seen, and we look at how "surprised" it is by each word, that is, what probability it had assigned to it. The lower the perplexity, the better the model anticipated the text, hence the better it captured its regularities. It is a direct measure of the training objective (predicting the next word, Chapter 3), valuable for tracking progress during training and comparing models on the same corpus. Its limit: it evaluates prediction, not usefulness. A model can show excellent perplexity without being good at reasoning, instruction-following, or safety, hence the recourse, as a complement, to the task-based tests described below.

State of play as of mid-2026. The summit of the art is fiercely contested, and the ranking changes almost every month; what follows is a snapshot. On the American side, Anthropic's Claude family, OpenAI's GPT-5 lineage, Google DeepMind's Gemini 3, and xAI's Grok are locked in close competition. On the Chinese side, models often released with open weights and at very low cost, such as DeepSeek and Qwen (Alibaba), reach a level close to the frontier. On the European side, France's Mistral carries the banner of sovereignty. A few clear trends stand out in 2026: on the human preference arenas, the Claude variants held the top spots for a good part of the year; on code (SWE-bench), the lead is disputed among Claude, Grok, and GPT; Gemini shines on several reasoning tests and on the multimodal; and the open-weights models now offer near-equivalent quality for a fraction of the price, which is upending the entire economics of the sector.

In context

The main products (a mid-2026 snapshot)

A few concrete reference points, bearing in mind that versions change almost every month. Anthropic offers Claude in tiers: Opus (the most powerful), Sonnet (balanced), and Haiku (fast and economical), around the 4.x generation; to these is added a "frontier" family that is more capable still and surrounded by reinforced safeguards (the Mythos / Fable range), whose most advanced access has been temporarily restricted for reasons of export control (Chapters 20 and 25). OpenAI evolves GPT-5 in closely spaced increments (up to the GPT-5.5 versions by mid-2026), with Codex variants specialized in code (Chapter 6). Google offers Gemini 3 in Pro (advanced reasoning) and Flash (fast and economical) versions, ranging up to the 3.5 generation. xAI develops Grok, integrated into the social network X. On the Chinese side, DeepSeek and Qwen (Alibaba), often with open weights, remain close to the frontier, alongside Kimi (Moonshot) and MiniMax. In Europe, Mistral offers both open and proprietary models.

The great lesson of 2026, to which we will return in Chapter 7, can be put in a single sentence: there is no longer a "best model" in the absolute, but a best model for each task. The most advanced organizations practice "routing": handing each request to the best-suited model in terms of quality, speed, and cost. And every ranking must be read as a snapshot, valid at a given moment.

Key takeaways (Chapter 4)

An LLM is a Transformer trained at large scale to predict the next token; from this objective emerge conversation, translation, code, and reasoning.
Models reason in tokens (fragments of words), which defines the context window and the price (billed per million tokens).
Training (a one-off, massive investment) is distinct from inference (a recurring cost on each request); distillation produces light, cheap versions.
The raw material, quality human text, could be exhausted around 2026-2032 (the "data wall"), hence the recourse to licensing, the multimodal, and synthetic data.
Hallucinations are structural (the model aims for the plausible, not the true); they are mitigated by RAG, tool use, and reasoning, without being eliminated.
Reasoning models "think" longer at the moment of answering, shifting the performance frontier toward compute at inference.
Benchmarks measure progress but suffer from saturation, contamination, and the Goodhart effect. As of mid-2026, the frontier is disputed among American, Chinese, and European players, with no single winner.

We now have a complete picture of "how it works." Part II continues by broadening the view: beyond text, world models and the multimodal (Chapter 5), then the move to action with agents (Chapter 6), before mapping out the players (Chapter 7).

4.1What is an LLM, really?#

4.2Tokens: the "currency" of AI#

4.3The anatomy of a training run#

4.4Emergent capabilities and hallucinations#

4.5Reasoning: chain of thought and "thinking" models#

4.6Evaluating a model: benchmarks#

Key takeaways (Chapter 4)

4.1What is an LLM, really?

4.2Tokens: the "currency" of AI

4.3The anatomy of a training run

4.4Emergent capabilities and hallucinations

4.5Reasoning: chain of thought and "thinking" models

4.6Evaluating a model: benchmarks