Learning from data: machine learning & deep learning

2.1The paradigm shift: programming or learning

Machine learning (in French apprentissage automatique) inverts the logic. We no longer supply the rules: we supply examples (thousands of emails already labeled "spam" or "not spam"), and the machine discovers the rules itself that allow it to tell them apart. We no longer program the what to do; we program the how to learn.

Diagram2.1. The fundamental reversal. The machine no longer receives the rules: it learns them from examples. The product of this learning is called a model.

2.2Three ways to learn

Machine learning comes in three broad families, which must be carefully distinguished because they recur everywhere in what follows.

Diagram2.2. The three broad families of learning.

In plain terms

, with analogies:

Supervised learning: learning with a teacher who corrects. You show the student thousands of exercises along with their solutions. The student infers a general method, which it then applies to exercises it has never seen. This is the most widespread form: image recognition, translation, price prediction.
Unsupervised learning: exploring without an answer key. You give the student a pile of documents with no indication whatsoever, and ask it to bring order to them: group together what looks alike, spot what stands out. This is how a customer base is segmented or banking fraud is detected (an "abnormal" transaction).
Reinforcement learning (RL): learning by trial and error. The student acts in an environment, receives a reward when it succeeds and a penalty when it fails, and gradually adjusts its behavior to maximize its rewards. This is how an AI is trained to play, to pilot a robot, and, as we will see, to make LLMs helpful and polite.

In context

Beyond the three main modes (transfer, Bayesian, AutoML, active learning)

The three families above do not tell the whole story; several cross-cutting ideas round out the toolbox. Transfer learning consists of reusing a model already trained on a large task as the starting point for a related task, rather than starting from scratch: this is precisely the principle of pre-training then fine-tuning large models (chapter 3), and the reason we no longer need millions of examples for every new problem. Bayesian methods reason in probabilities: instead of a single answer, they estimate an uncertainty ("70% chance that..."), invaluable when getting it wrong is costly (medicine, finance). AutoML (and automatic architecture search) automates the very design of models, letting the machine search for the best settings. Finally, active learning lets the model choose the examples it wants labeled, so as to learn quickly with a minimum of costly human annotation. So many variations on one and the same question: how to learn better, with less data and effort?

Under the hood

The major reinforcement-learning algorithms

Reinforcement learning (above) comes in several families of algorithms whose names come up again and again. Q-learning learns, for each situation, the expected value of every possible action (the "Q"), then picks the one that promises the most; coupled with a neural network, it produced the Deep Q-Network that learned to play Atari games from the raw pixels alone. Policy gradient methods (the most widely used being PPO, Proximal Policy Optimization) optimize the agent's strategy directly, in small cautious steps to avoid abrupt swings; this is precisely the algorithm at the heart of RLHF for large models (chapter 3). Finally, Monte Carlo tree search (MCTS) explores a tree of possible moves by simulating many games to estimate the best branches; combined with neural networks, it is the key to the success of AlphaGo (chapter 1). Behind the general idea of "learning by trial and error," then, lie precise mathematical tools, found alike in games, in robotics (chapter 13), and in model alignment.

2.3The artificial neuron and networks

Diagram2.3. A "deep" neural network. Information flows from left to right, layer by layer. "Deep" simply means: having many hidden layers. That is where the term deep learning comes from.

Under the hood

Mathematically, a neuron computes a weighted sum of its inputs, adds a tuning term to it (the "bias"), then passes the result through a nonlinear activation function (for example the ReLU function, which replaces any negative number with zero). This nonlinearity is crucial: without it, stacking layers would be pointless (the composition of linear functions remains linear). The weights and biases are the parameters of the network: they are what learning will adjust. When we say a model has "70 billion parameters," we are talking about the number of these internal settings. A theoretical result, the universal approximation theorem, guarantees that a sufficiently large network can approximate any continuous function: it is the mathematical promise that underpins the whole enterprise.

2.4How a machine learns: cost and backpropagation

Through repetition, the error decreases, and the network becomes competent. The most telling image is that of a hike through fog to descend into a valley: you cannot see the bottom, but you feel the slope underfoot, and you take a step downward. By repeating, you eventually reach a low point. That "slope," in mathematics, is called the gradient, and the method is called gradient descent.

Diagram2.4. The learning loop. Repeated billions of times over enormous data sets, it transforms a random network into a competent model.

Under the hood

Descending the slope (gradient and learning rate)

How is the "adjustment" of the weights actually carried out? Through gradient descent, whose image is telling: picture the model's error as a landscape of hills and valleys, where you are looking for the lowest point (the minimum error). At each step, the gradient indicates the direction of steepest slope; you then take a small step downward, and start again. The size of that step is a decisive setting, the learning rate: too large, and you bounce from one slope to another without ever settling; too small, and the descent is endless. In practice, you do not compute the error over all the data at once (too costly), but over small batches drawn at random (mini-batch), hence the name stochastic gradient descent; a full pass over the data is called an epoch. Sophisticated optimizers (such as Adam) automatically adapt the step for each parameter, speeding up and stabilizing the descent. It is this process, repeated billions of times, that gradually sculpts a random network into a competent model.

In context

Overfitting, or the art of not reciting by rote

A trap lies in wait for all machine learning: overfitting. A model too closely fitted to its training data ends up "reciting it by rote," noise and errors included, instead of extracting useful regularities from it; it then excels on the examples it has seen, but fails on new cases. This is the opposite of the goal: generalization, that is, the ability to perform well on data never encountered before. To measure it, a portion of the data is systematically set aside (a test set) that the model does not see during training. Conversely, a model that is too simple underfits: it misses regularities that are nonetheless present. Finding the right balance is the central art of the discipline (this is known as the bias-variance trade-off), and to that end we have regularization techniques that rein in the model's complexity to keep it from sticking too closely to the data. This concern with generalization will take on particular significance for large models, which raise the question of whether they understand or memorize (chapters 4 and 23).

In context

Catastrophic forgetting and continual learning

A deep limitation of neural networks sheds light on a quirk of today's AIs: their knowledge is frozen at a certain date. When you train a network on a new task, the adjustment of the weights (above) tends to overwrite what it had learned before: this is catastrophic forgetting. A human takes in new information without erasing the rest; a network, by contrast, risks relearning everything on top. Practical consequence: you cannot simply "add" recent events to an already-trained large model on the fly; you would have to retrain it, a costly operation, hence the knowledge cutoff date observed in assistants. Getting an AI to learn continuously without forgetting everything is precisely the subject of continual learning, an active but unsolved area of research. In the meantime, the obstacle is circumvented another way: by supplying the model with fresh information at the moment of answering (retrieval-augmented generation, chapter 6) rather than etching it into its weights.

2.52012: the big bang of deep learning

Why 2012 and not before? Because the three missing fuels (chapter 1) were finally brought together:

Data: ImageNet provided the gigantic set of labeled images that had been missing.
Compute: AlexNet was trained on GPUs from the company NVIDIA. These chips, designed to compute the pixels of video games in parallel, turned out to be ideal for the massive multiplications of neural networks. This technical detail would have colossal geopolitical consequences: it would make NVIDIA one of the most valuable companies in the world (chapter 8).
Algorithms: refinements (the ReLU activation function, the dropout regularization technique) made it possible to train deeper networks without their going off the rails.

2.6Seeing and reading: CNNs and RNNs

In context

Graph neural networks (GNNs)

Alongside CNNs (for images) and RNNs (for sequences), a third family handles data shaped like a network: graph neural networks (GNNs). Many objects in the world are naturally graphs, entities linked to one another: a molecule (atoms joined by bonds), a social network (people linked by friendships), a road network, the web itself. A GNN learns by circulating information between neighbors: each node updates its representation by aggregating those of its neighbors, step by step. This makes it possible to predict properties (will a molecule make a good drug? chapter 14), to recommend (products, contacts), or to detect fraud in a network of transactions. It is the architecture of choice wherever relational structure matters as much as the data itself, where a CNN or a classical Transformer would be ill-suited.

2.7Representing meaning: embeddings

The brilliant trick: these numbers are learned in such a way that words close in meaning occupy nearby positions in the space. "Cat" and "dog" end up neighbors; "king" and "banana" are far apart. Meaning becomes geometry.

Better still: the directions of the space capture relationships. The example that has become famous (from the word2vec model, 2013) is almost magical:

king − man + woman ≈ queen

In other words, the vector linking "man" to "king" is roughly the same as the one linking "woman" to "queen." The machine discovered, all on its own and without being told, the abstract concept of royalty and that of gender, simply by observing how words are used across billions of sentences.

Under the hood

The underlying principle is the distributional hypothesis, summed up by the linguist J.R. Firth in 1957: "you shall know a word by the company it keeps." By training a model to predict the context of a word (or a word from its context), you force it to place in neighboring regions the words that appear in similar contexts. Modern LLMs generalize this idea massively: they no longer embed only isolated words, but word fragments as a function of their entire context, which lets them distinguish the multiple senses of a single word ("the pound sterling" vs. "a pound of butter"). Embeddings are also the fuel of technologies omnipresent in 2026: semantic search engines, recommendation systems, and the famous retrieval-augmented generation (RAG) that lets an LLM draw on a document base (we will return to this in chapters 6 and 9).

Diagram2.5. A fragment of a knowledge graph. Knowledge here is explicit and verifiable: each fact is a named relation between two entities, readable by a machine as well as by a human.

This is the modern form of the symbolic representation of knowledge (chapter 1), and it is what structures many search engines behind the scenes (their answer panels). Its strength is precision and traceability (you know where each fact comes from); its weakness, that it must be built and maintained by hand. Hence the growing interest in neuro-symbolic approaches, which marry the flexibility of neural networks with the rigor of graphs: an LLM can query a knowledge graph to anchor its answers in verified facts (a structured variant of retrieval-augmented generation, chapter 6), and thereby reduce its hallucinations.

2.8The three ingredients of modern AI

Diagram2.6. The fundamental triad. None of the three suffices on its own. It is their conjunction, from the 2010s onward, that made modern AI possible, and it is the race for these three resources that today structures the economics and geopolitics of the sector.

This triad illuminates the rest of the course:

The quest for data raises questions of intellectual property and privacy (chapters 21 and 25).
The quest for compute explains NVIDIA's valuation, the chip war, and the energy bill (chapters 8 and 10).
The quest for algorithms is the focus of the fierce competition between labs (chapter 7), and its next great leap, the Transformer, is the subject of the following chapter.

2.9The brain and the machine: a fruitful and misleading analogy

Under the hood

The contrast is first a matter of scale and nature. The human brain has roughly 86 billion neurons and on the order of a hundred trillion connections (synapses), all of it fitting in a small volume and consuming only about 20 watts, less than a light bulb. A large model, for its part, can line up hundreds of billions of parameters, but its training and operation demand megawatts (chapters 8 and 10): for a given task, living organisms remain unrivaled in energy efficiency. Above all, the resemblance stops at the surface. Several differences run deep:

The signal. A biological neuron communicates through brief electrical impulses (the "action potentials"), discrete and asynchronous, modulated by a complex chemistry (dozens of neurotransmitters). The artificial neuron, by contrast, exchanges synchronized continuous numbers, with no chemistry at all. The family of spiking networks (neuromorphic computing, chapter 8) seeks precisely to move closer to the biological model, but remains marginal.
Learning. Artificial networks learn through gradient backpropagation (section 2.4), a global mechanism that requires propagating an error backward through the entire network. Yet nothing of the sort has been clearly observed in the brain: biological learning seems above all local (synapses strengthen according to the joint activity of the neurons they connect, a principle summed up in the formula "what fires together wires together"), and involves sleep, emotion, and reward. How the brain achieves such efficient learning without backpropagation remains an open question.
Plasticity and time. The brain is plastic: it constantly rewires itself, forgets, consolidates, and often learns from a single example. A model, once trained, is largely frozen; it requires countless examples and suffers from catastrophic forgetting (it erases the old when you teach it the new). The brain is also recurrent and embodied (forever in a loop with a body and an environment), whereas most networks process information in a single sweep, from input to output.

Key takeaways (chapter 2)

Machine learning reverses classical programming: we no longer supply the rules, we supply examples, and the machine learns the rules. The result is called a model.
Three families: supervised learning (with an answer key), unsupervised (without an answer key), reinforcement (trial and error).
A neural network stacks artificial neurons in layers; "deep" means "having many layers" (deep learning).
Learning happens through gradient descent and backpropagation: we measure the error, then correct each weight by a small step to reduce it.
2012 (AlexNet/ImageNet) marks the big bang of deep learning, made possible by the conjunction of data + GPUs + algorithms.
Embeddings transform meaning into geometry: this is the conceptual bridge to large language models.
Every modern AI rests on a triad: data, compute, algorithms.

We are now ready to cross the threshold. In chapter 3, we tell the story of the 2017 innovation that broke open the locks of language and gave birth to the era of large models: the Transformer.

2.1The paradigm shift: programming or learning#

2.2Three ways to learn#

2.3The artificial neuron and networks#

2.4How a machine learns: cost and backpropagation#

2.52012: the big bang of deep learning#

2.6Seeing and reading: CNNs and RNNs#

2.7Representing meaning: embeddings#

2.8The three ingredients of modern AI#

2.9The brain and the machine: a fruitful and misleading analogy#

Key takeaways (chapter 2)

2.1The paradigm shift: programming or learning

2.2Three ways to learn

2.3The artificial neuron and networks

2.4How a machine learns: cost and backpropagation

2.52012: the big bang of deep learning

2.6Seeing and reading: CNNs and RNNs

2.7Representing meaning: embeddings

2.8The three ingredients of modern AI

2.9The brain and the machine: a fruitful and misleading analogy