Part I · FOUNDATIONS: UNDERSTANDING AI BEFORE THE LLMS

The Transformer revolution: "attention is all you need"

Chapter 312 min readUpdated: June 2026

3.1The wall of sequential architectures

To break through the wall, an architecture was needed that could pull off two feats at once: process the entire sequence in one go (to be fast) and connect any word directly to any other, however far apart they might be (so as to forget nothing). This is exactly what the attention mechanism delivers.

3.2The intuition behind attention

Another image: picture a meeting where, to make sense of a remark, you automatically weight what each participant said earlier according to its relevance. Attention is that weighting system, applied at scale and learned automatically.

Diagram3.1. The attention mechanism in pictures. To interpret the word "it," the model assigns a high weight to "animal" and a low weight to "street." These weights are not written by a human: they are learned from billions of sentences.

3.3The Transformer architecture (2017)

Two ingredients are worth remembering:

  • Multi-head attention (multi-head attention). Rather than a single weighting system, the Transformer runs several in parallel, like so many different "gazes" cast over the sentence. One head may track grammar (subject-verb agreement), another meaning, another references ("it" refers to "animal"). By combining these gazes, the model captures very rich relationships.
  • Positional encoding (positional encoding). Attention, as it stands, is blind to word order: to it, "the dog bites the man" and "the man bites the dog" would be identical. So information about each word's position in the sentence is injected into its representation, in order to preserve order.
Diagram3.2. A very simplified view of a Transformer. A block combines a multi-head attention layer and a computation layer; dozens, even hundreds, of these blocks are stacked. It is the depth and size of this stack that give the model its power.

3.4Scaling laws

The researcher Richard Sutton summed up the philosophical lesson under the name "bitter lesson" (bitter lesson): in the long run, general methods that take advantage of ever-growing computing power always end up winning out over clever methods painstakingly hand-crafted by experts. Frustrating for human ingenuity, but remarkably effective.

3.5Pre-training, fine-tuning, and RLHF

Diagram3.3. The pipeline that builds an assistant. This is essentially the process that turned GPT-3 into ChatGPT at the end of 2022.
  1. Pre-training (pre-training). The model is made to ingest a colossal fraction of the available text (web pages, books, code, articles). From it, it learns grammar, facts, reasoning, and style, simply by trying to predict what comes next. This is the most expensive step: weeks of computation on thousands of processors, for tens, even hundreds, of millions of dollars.
  2. Supervised fine-tuning (supervised fine-tuning, SFT), also called instruction tuning. The model is shown thousands of examples of the form "a user's question, an assistant's ideal answer." It thereby learns to behave like an assistant: to answer, to follow instructions, to adopt the right register.
  3. RLHF (Reinforcement Learning from Human Feedback, reinforcement learning from human feedback). Humans compare several of the model's answers and indicate which they prefer. A second model, called the "reward model," learns these preferences, then serves to train the main model to produce answers judged better: more helpful, more honest, less toxic. This is the step that makes the assistant pleasant and relatively safe.

Key takeaways (Chapter 3)

  • The Transformer (2017, the paper "Attention Is All You Need") replaces sequential reading with the attention mechanism, which connects all words directly to one another and parallelizes on GPUs.
  • Attention learns, for each word, which other words matter and how much. Multi-head attention multiplies these "gazes"; positional encoding preserves word order.
  • The decoder-only lineage (the GPT family) prevailed for text generation.
  • Scaling laws show that performance grows predictably with size, data, and compute, hence the race for resources. But their returns may plateau.
  • An assistant is built in three stages: pre-training, supervised fine-tuning, then RLHF. This last stage is also the cradle of the alignment problem.

In the next chapter, we throw open the hood of the flagship object of this new era: the large language model itself.