The Transformer revolution: "attention is all you need"

3.1The wall of sequential architectures

To break through the wall, an architecture was needed that could pull off two feats at once: process the entire sequence in one go (to be fast) and connect any word directly to any other, however far apart they might be (so as to forget nothing). This is exactly what the attention mechanism delivers.

3.2The intuition behind attention

Another image: picture a meeting where, to make sense of a remark, you automatically weight what each participant said earlier according to its relevance. Attention is that weighting system, applied at scale and learned automatically.

Diagram3.1. The attention mechanism in pictures. To interpret the word "it," the model assigns a high weight to "animal" and a low weight to "street." These weights are not written by a human: they are learned from billions of sentences.

Under the hood

Technically, each word (more precisely each token, see Chapter 4) emits three vectors: a query (query: "what I'm looking for"), a key (key: "what I offer"), and a value (value: "the information I carry"). The attention weight between two words is computed by comparing the query of one against the keys of all the others (a dot product, normalized by a softmax function to yield percentages that sum to 100%). The output for each word is then a weighted sum of the values of all the words. Because this operation reduces to large matrix multiplications, it parallelizes massively on GPUs, which breaks the slowness lock; and because each word can "look at" all the others directly, the amnesia lock disappears as well. We speak of self-attention (self-attention) when the words of a single sequence observe one another in this way.

3.3The Transformer architecture (2017)

Two ingredients are worth remembering:

Multi-head attention (multi-head attention). Rather than a single weighting system, the Transformer runs several in parallel, like so many different "gazes" cast over the sentence. One head may track grammar (subject-verb agreement), another meaning, another references ("it" refers to "animal"). By combining these gazes, the model captures very rich relationships.
Positional encoding (positional encoding). Attention, as it stands, is blind to word order: to it, "the dog bites the man" and "the man bites the dog" would be identical. So information about each word's position in the sentence is injected into its representation, in order to preserve order.

Diagram3.2. A very simplified view of a Transformer. A block combines a multi-head attention layer and a computation layer; dozens, even hundreds, of these blocks are stacked. It is the depth and size of this stack that give the model its power.

Under the hood

What makes deep stacks trainable (residuals and normalization)

Stacking dozens, even hundreds, of Transformer blocks raises a practical problem: the deeper a network, the more the learning signal tends to degrade as it travels back up through the layers (gradients that vanish or explode). Two tricks, unobtrusive but decisive, solve this. Residual connections (residual connections) add to the output of each layer its own input, creating a "shortcut": information and gradient thus pass through the whole network without dying out, and each layer need only learn a small correction rather than redo everything. Layer normalization (layer normalization) rescales, at each step, the magnitude of the values flowing through, which stabilizes and accelerates training. Neither of these mechanisms is spectacular, but without them today's very large models would simply be impossible to train. It is a good example of a recurring truth in the field: much of the progress comes from inconspicuous engineering details, as much as from ideas of principle.

In context

Tokenization, or how the model chops up text

Before any computation, a language model must turn text into numbers. It reads neither letters nor whole words, but tokens (tokens): fragments of words, obtained by statistical chunking (an algorithm such as Byte Pair Encoding merges the most frequent pairs of characters). A common word often fits in a single token; a rare or complex word is split into several. Each token is then converted into a vector (an embedding, Section 2.7) that the network can manipulate. This technical detail has very concrete consequences. It explains why models count letters poorly (how many "r"s in "strawberry"?) or stumble over arithmetic: they do not see characters or digits one by one, but blocks. It also explains why cost and context length are measured in tokens, and why certain languages, poorly represented in the data, are split into far more tokens than English, and therefore cost more to process (an angle on inequality taken up in Chapter 21).

In context

The hidden cost of attention (quadratic complexity)

The attention mechanism comes at a price: for a sequence of n words, each word must compare itself to all the others, that is, on the order of n times n comparisons. This is called quadratic complexity: doubling the length of the text does not double the cost, it quadruples it. This is the technical reason why processing very long documents (an extended context window, Chapter 4) is costly in computation and memory, and why context is not infinite. A whole branch of research therefore aims to loosen this lock: more economical attention variants (so-called sparse, or approximate, attention), implementations that optimize memory usage (such as FlashAttention), or alternative architectures seeking to recover the linear efficiency of the old sequential models without paying for it in performance. Lengthening the context while keeping this quadratic cost under control is one of the permanent engineering challenges behind the progress of large models.

In context

The alternatives to the Transformer (Mamba and state space models)

The Transformer reigns, but its quadratic cost in attention (seen above) has revived the quest for more economical architectures for very long sequences. The most prominent line is that of State Space Models (State Space Models, SSM), of which Mamba (2023) is the best-known representative. The idea draws on the old recurrent networks: process the sequence while maintaining a compact state that summarizes the past, which gives a cost that is linear (rather than quadratic) in length, and very fast inference. Where a Transformer must, for each word, look at all the others, an SSM updates its state on the fly. The difficulty is to recover, through mathematical tricks, the Transformer's ability to select the relevant information across long distances, something the old RNNs could not do. In 2026, these models (often hybridized with a few attention layers) remain a minority compared to Transformers, but are promising wherever very long context and efficiency take priority. They are a reminder of one lesson: no architecture is final, and the one that dominates today could be complemented, or even surpassed, tomorrow.

3.4Scaling laws

The researcher Richard Sutton summed up the philosophical lesson under the name "bitter lesson" (bitter lesson): in the long run, general methods that take advantage of ever-growing computing power always end up winning out over clever methods painstakingly hand-crafted by experts. Frustrating for human ingenuity, but remarkably effective.

3.5Pre-training, fine-tuning, and RLHF

Diagram3.3. The pipeline that builds an assistant. This is essentially the process that turned GPT-3 into ChatGPT at the end of 2022.

Pre-training (pre-training). The model is made to ingest a colossal fraction of the available text (web pages, books, code, articles). From it, it learns grammar, facts, reasoning, and style, simply by trying to predict what comes next. This is the most expensive step: weeks of computation on thousands of processors, for tens, even hundreds, of millions of dollars.
Supervised fine-tuning (supervised fine-tuning, SFT), also called instruction tuning. The model is shown thousands of examples of the form "a user's question, an assistant's ideal answer." It thereby learns to behave like an assistant: to answer, to follow instructions, to adopt the right register.
RLHF (Reinforcement Learning from Human Feedback, reinforcement learning from human feedback). Humans compare several of the model's answers and indicate which they prefer. A second model, called the "reward model," learns these preferences, then serves to train the main model to produce answers judged better: more helpful, more honest, less toxic. This is the step that makes the assistant pleasant and relatively safe.

Key takeaways (Chapter 3)

The Transformer (2017, the paper "Attention Is All You Need") replaces sequential reading with the attention mechanism, which connects all words directly to one another and parallelizes on GPUs.
Attention learns, for each word, which other words matter and how much. Multi-head attention multiplies these "gazes"; positional encoding preserves word order.
The decoder-only lineage (the GPT family) prevailed for text generation.
Scaling laws show that performance grows predictably with size, data, and compute, hence the race for resources. But their returns may plateau.
An assistant is built in three stages: pre-training, supervised fine-tuning, then RLHF. This last stage is also the cradle of the alignment problem.

In the next chapter, we throw open the hood of the flagship object of this new era: the large language model itself.

3.1The wall of sequential architectures#

3.2The intuition behind attention#

3.3The Transformer architecture (2017)#

3.4Scaling laws#

3.5Pre-training, fine-tuning, and RLHF#

Key takeaways (Chapter 3)

3.1The wall of sequential architectures

3.2The intuition behind attention

3.3The Transformer architecture (2017)

3.4Scaling laws

3.5Pre-training, fine-tuning, and RLHF