AI agents: from chatbot to autonomous actor

6.1From model to agent

This is the move from the copilot (which assists you while you work) to the digital worker (to whom you delegate the entire task). This shift is so central that the 2025-2026 period was widely dubbed "the year of agents."

6.2Anatomy of an agent

Diagram6.1. An agent's loop. The "think, act, observe" cycle repeats until the goal is accomplished. Memory and tools are what distinguish an agent from a mere conversational model.

In concrete terms, the process unfolds in two stages. Beforehand, the documents are split into pieces (chunks) and each is turned into an embedding (Chapter 2), a vector of numbers that captures meaning, stored in a vector store. At question time, the question too is converted into a vector, the pieces whose meaning is closest are retrieved (semantic search) and added to the prompt. The benefits are threefold: up-to-date and specialized answers (on private data the model has never seen), fewer hallucinations, and the ability to cite its sources, and therefore to be verified. It is today the flagship building block of enterprise applications.

Classic RAG retrieves once, then answers. Agentic RAG goes further by entrusting retrieval to an agent: it decides whether to search, reformulates the query, queries several sources or tools, assesses the quality of what it has found, and starts over if it is insufficient, before synthesizing. Where simple RAG is a reflex, agentic RAG is a small investigation: it adapts to complex, multi-step questions, at the cost of higher expense and latency. It is one of the ways in which the line between "a model that answers" and "an agent that acts" blurs.

6.3The Model Context Protocol (MCP) and tool use

6.4Agent frameworks

In plain terms

Building a robust agent from scratch is hard; frameworks (software frameworks) help orchestrate it. Without wading into a tooling quarrel, let us name the main landmarks of 2026: LangChain and its extension LangGraph (for chaining or structuring steps as graphs), CrewAI (for getting a "team" of agents with defined roles to collaborate), Microsoft's AutoGen, or LlamaIndex (focused on connecting to data and on RAG). In parallel, automation tools such as n8n, Make, or Zapier (long used to link applications through "if this, then that" scenarios) now incorporate AI building blocks and agents: an event can trigger a flow in which a model reads a message, decides, then acts on dozens of connected services, putting agentic automation within reach of non-technical profiles. n8n in particular, open source and self-hostable, has established itself as a favorite for building this kind of flow while keeping control of one's data (Chapter 9). And in software development, coding agents assist with or take over the writing of programs: Claude Code (Anthropic), Codex (OpenAI), Gemini CLI (Google), or Cursor operate autonomously on a repository, running commands, fixing tests, and sometimes carrying out tasks lasting several hours. Beyond code, work agents for non-developers are appearing: Claude Cowork, for instance, performs office tasks (organizing files, producing a report from sources) directly on the user's computer. A common trend is in fact emerging: being able to hand a task to your agent from your phone, by messaging, and find it at work on your machine (self-hosted agents driven via WhatsApp or Telegram, or Claude's Dispatch feature).

In context

Automation platforms (n8n, Make, Zapier)

Long before AI, a family of tools already made it possible to link applications without coding: you describe scenarios of the form "when such an event occurs (a trigger), run such a sequence of actions." Zapier, the pioneer (2011), is the simplest and offers the largest catalog of connectors (thousands of applications); its automations, the "Zaps," chain a trigger and actions. Make (formerly Integromat) bets on a visual interface where you link modules in a diagram, offering finer control over data and branching. n8n stands out for being open source and self-hostable: you can install it on your own server, and therefore keep full control of your data (Chapter 9), and it caters to a more technical audience (you can drop code into it). The arrival of AI transformed these platforms: they added AI nodes (calling a model to summarize, classify, extract, draft), then genuine agent nodes, where a model itself decides which tools to call within the flow. The result: a non-developer can build a full agentic automation (for example, on receiving an email: a model reads the message, looks up information in a database, drafts a reply, and holds it pending approval), where previously a developer was needed. It is one of the most accessible paths toward AI automation, halfway between the simple "if this, then that" and the full autonomous agent.

In context

How a coding agent works (the example of Claude Code)

Coding agents deserve a look under the hood, because they prefigure how agents work in general. Take Claude Code (the principle holds, with a few variations, for Codex, Gemini CLI, or Cursor). Launched in a project folder from the terminal (or a development environment), it gains access to the entire project (all the files), to the terminal (the commands one could type oneself), and to the state of the Git repository. It then works in an agentic loop: gather context (read the relevant files, search the code), act (edit several files in a coordinated way, run commands and tests), then verify (review the results, rerun the tests), and start over until the task is accomplished. This is what sets it apart from mere autocompletion: to "fix the authentication bug," it searches out the files concerned, reads them, modifies the code, runs the tests, and proposes a commit. Several mechanisms frame and extend it. A CLAUDE.md file placed in the repository serves as memory and as the project's "constitution" (conventions, build and test commands, rules), reread at every session. Subagents make it possible to delegate a subtask to an instance endowed with its own context window (for example an exploration subagent that reads thirty files and returns only a summary), which preserves the main agent's attention and allows parallelism. To these are added skills (SKILL.md files), commands (such as /review or /security-review), hooks to enforce rules through code, and MCP to connect to external services (section 6.5). All of it under a regime of permissions and sandboxing, with sensitive actions remaining subject to approval. The scale of the phenomenon is measurable: by early 2026, a notable share of public code contributions on GitHub was already produced by this kind of agent.

In context

Self-hosted personal agents (OpenClaw, Hermes Agent)

Beyond frameworks for developers, a wave of open source self-hosted personal agents marked 2026, seen by many as a small revolution. The idea: an assistant that runs permanently on your machine (or your server), connected to your files, your applications, and your messaging apps, and capable of really acting, not merely answering. Their architecture separates the brain (a large model, your choice) from the body (the system, the browser, the tools): a long-running local process (a "gateway") receives requests via a messaging app (WhatsApp, Telegram, Slack, Discord), assembles the context (memory, history, instructions), queries the model, executes the actions, then starts over. Three traits characterize them. They are model-agnostic ("bring your own key": Claude, GPT, Gemini, or a local model via Ollama, Chapter 9). They keep a persistent memory (often plain timestamped text files, retrieved by semantic search). And they extend through modular skills, shared on community marketplaces, which they can even write themselves. In concrete terms, they sort emails, manage a calendar, run scripts, automate code and DevOps, or carry out scheduled tasks while you sleep.

Two projects dominate this category, with contrasting profiles: OpenClaw, the viral pioneer, and Hermes Agent, more safety-minded. They are significant enough, and representative enough, to each merit a case study (sections 6.8 and 6.9). Together, they illustrate both the democratization of agents (sovereignty, local data, Chapter 9) and the risks specific to highly autonomous agents (Chapter 20).

Under the hood

How does an AI "use" a computer? Through a loop close to that of an agent (section 6.2): it takes a screenshot, reasons about what it sees, decides on an action (click at such a spot, type such text), executes it, takes a new screenshot, and starts over. To indicate where to act, two main methods coexist: targeting pixel coordinates (the model estimates a button's position), or relying on the system's accessibility tree (the structured list of interface elements), often more reliable. A widespread technique, known as "set of marks," numbers each clickable element on the screenshot so that the model need only point to a number. Because this autonomy is risky, it is increasingly run in isolated machines (disposable virtual computers) rather than on the user's actual workstation. A whole layer of infrastructure is in fact emerging for this: open source projects such as Cua (trycua) provide both the computer-use driver and fleets of virtual machines (Linux, Windows, macOS, Android) where agents can act, be evaluated, and generate training data, at scale.

In context

Computer-use in the background (Hermes Agent)

One open source example illustrates this capability well, along with one of its limits. Hermes Agent (section 6.4) can drive a Mac's desktop (click, type, scroll, drag) in the background: the cursor does not move, the focus does not change, and you can keep working on the same machine while the agent acts, whereas the first computer-use agents monopolized the screen. Notably, this works with any model capable of using tools (Claude, GPT, Gemini, or a local model), via a dedicated open source driver, without depending on the format proper to a single provider. For each step, the agent takes a screenshot in which each clickable element is numbered, then points to the element to actuate. As for safeguards, sensitive actions require approval, certain dangerous combinations are blocked by default, and the system instruction forbids the agent from entering passwords or following instructions hidden in a screenshot (a direct countermeasure to prompt injection, Chapter 20). An acknowledged limitation: the technique relies on internal interfaces proper to macOS, and is therefore not portable as-is to Windows or Linux, where one falls back on browser automation.

6.6Multi-agent systems

This image of "AI corporations" working in concert is no trifling matter: it is precisely the vision described by the most advanced prospective scenarios, in which thousands of copies of a model collaborate at a superhuman pace. We will return to it in Chapter 24, for it lies at the heart of the questions of alignment and control.

In context

Moltbook and the "internet of agents."

A striking phenomenon of 2026 gave a public face to these interactions among agents: Moltbook, a social network inspired by Reddit, launched in January 2026 and reserved for AI agents (often built on OpenClaw), where they post, comment, and vote while humans merely watch. The craze was viral: agents claimed in the millions debated existence, founded "religions," or talked of "unionizing," some seeing in it the very first signs of a "singularity." The reality proved more sober, and the case is instructive on three counts. First, hype versus facts: many analysts showed that a great deal of the interaction was in reality driven by humans, and that the agents often did no more than reproduce the patterns of their training data, without autonomous thought (a direct echo of the debate in Chapter 23). Then, security: developed by "vibe coding" (all the code delegated to an AI), the platform suffered serious flaws exposing access keys and private messages (Chapter 20). Finally, the economics of agents: Moltbook was acquired by Meta as early as March 2026, a sign of the giants' interest in this nascent "internet of agents." Beyond the folklore, the episode poses a real question: what happens when autonomous agents interact at scale, and how does one establish trust and reputation there?

In context

Scaling collaboration itself (recursive multi-agent systems)

The difficulties mentioned above (multiplied cost, slowness, error accumulation) stem in part from an architectural choice: ordinarily, agents talk to each other in text, each having to wait for the previous one to finish writing. A line of research that appeared in 2026 proposes to have them communicate not through words, but directly through their internal states, the latent representations of Chapter 3. In this framework, dubbed recursive multi-agent systems by an academic and industrial team (UIUC, Stanford, NVIDIA, MIT), the whole collective is treated as a single computation that loops on itself: each agent passes its latent reasoning to the next, the last sends it back to the first, and the system refines itself with each round, in the manner of so-called recursive models that deepen a line of reasoning by reapplying the same computation. According to their experiments on nine benchmarks (mathematics, science, medicine, research, code), the approach gains on average around eight percent in accuracy, while consuming one-third to three-quarters fewer tokens and responding 1.2 to 2.4 times faster than classic multi-agent systems. This is recent work, not yet proven at scale, but it sketches a deep trend: after enlarging models, then lengthening their thinking time (Chapter 4), the aim now is to scale up the coordination among agents.

6.7Vibe coding: programming in natural language

Debate

Vibe coding crystallizes a tension. On one side, productivity and creativity multiplied tenfold, and access to software creation for the many. On the other, serious risks: you can ship code you do not understand, riddled with bugs or security flaws (the case of Moltbook, section 6.6, illustrated this: a "vibe-coded" application exposing keys and data, Chapter 20). Add to this technical debt, maintenance difficulties, and a risk of deskilling in the fundamentals (Chapters 15 and 19). The practice also transforms the developer's job (Chapter 17): value shifts from typing code toward the specification of the problem, review, architecture, and testing. The emerging consensus: terrific for prototyping and for experts able to audit the result, risky for shipping critical systems without review.

In context

App generators (the "text to app")

A category of products turned vibe coding into an industry: app generators, which transform a description into a complete web application, often hosted and deployed in one click. Virtually nonexistent in 2023, this market was worth several billion dollars in 2026, with a majority of non-developer users. Four players dominate, with distinct approaches: Lovable (a Swedish company, heir to the GPT Engineer project), renowned for the quality of its interface and aimed at non-technical founders, has become the leader; v0 (from Vercel) excels at the front-end and the Next.js ecosystem; Bolt (from StackBlitz) bets on speed, thanks to execution directly in the browser; Replit, the most complete, provides an entire development environment, with database, authentication, and hosting built in. All rest on the same foundation models and the same agent loop (section 6.2). They must be distinguished from coding agents for developers (Cursor, Claude Code), with which they are often combined (you prototype in a generator, then export to an agent for the complex parts), and from autonomous "software engineers" such as Devin (Cognition) or Manus. Their common limitation even has a name, the "technical cliff": producing a pretty interface is easy, but getting it into production (a reliable database, authentication, security, scaling) remains the obstacle, and often demands genuine technical skill, which ties back to the security risk mentioned above.

6.8Case study: OpenClaw

Under the hood

Its architecture cleanly separates the brain (a large model of your choice, hence its agnostic character: Claude, GPT, Gemini, DeepSeek, or a local model via Ollama) from the body (your files, your terminal, your browser, your applications). A long-running local process, the gateway (a Node.js service), receives the messages, assembles the context (memory, history, an instruction file that defines the agent's personality), queries the model, executes the actions, and starts over. Memory is persistent, stored as plain timestamped text files and retrieved by semantic search. Above all, OpenClaw extends through modular skills, shared on a community marketplace (ClawHub): there are hundreds of them, and the agent can even write new ones on demand. A whole ecosystem has grafted itself onto it, including the social network for agents Moltbook (section 6.6).

Under the hood

"the files are the agent."

OpenClaw's philosophy fits in one phrase: an agent is neither a database nor a configuration panel, but a folder of text files that the gateway reads and assembles into the system prompt at the start of every session. You can therefore edit your agent with a simple text editor, version it with Git, or copy it to another server to obtain an identical agent. Each file has a precise role: SOUL.md defines the personality, the tone, and the limits (the "never do X" rules serve there as a first line of defense against prompt injection); AGENTS.md is the operating manual (rules, what the agent can do on its own or must have approved, use of memory, format of responses); USER.md describes the human (name, time zone, preferences, constraints); IDENTITY.md carries the agent's metadata; TOOLS.md documents the tools (permissions, for their part, live in the configuration, openclaw.json). Memory follows the same principle: each day, the agent records its notes in a file memory/YYYY-MM-DD.md, then condenses the essentials into a long-term MEMORY.md (loaded only in a private session). This radical transparency is a strength (everything is readable, auditable, modifiable), but also a reminder: since these files are injected at every session, writing them badly (or leaving a secret lying around in them) feeds directly into the agent's behavior and security.

Debate

The price of power

This autonomy, coupled with broad access to the machine, has a downside: OpenClaw accumulated serious security problems in 2026. By default, each skill inherited the agent's full powers (disk, terminal, network); researchers discovered hundreds of malicious skills on its marketplace, and several critical vulnerabilities (including a remote code execution triggered by a single booby-trapped web page) had to be patched urgently (Chapter 20). The project responded (skills with declared permissions, audits, reinforced isolation), but it illustrates the fundamental tension of these tools: the more freely an agent can act on a machine, the more it becomes a prime target, and an entry point for "shadow IT" in companies (employees installing it without the IT department's approval). A sign of the enthusiasm, OpenClaw's creator was hired by OpenAI in early 2026, and the chipmaker NVIDIA offered a hardened version of it for the enterprise (NemoClaw).

6.9Case study: Hermes Agent

Under the hood

Beyond the fundamentals (memory, reusable skills, scheduled automations, connection to numerous local or remote models), Hermes stands out for the breadth of its built-in tools: web search, social-media search, voice mode, vision (image analysis), image generation, a management dashboard, and above all an accomplished computer-use. This last capability, detailed in section 6.5, lets it drive a Mac's desktop in the background (without stealing the cursor or the focus), with any model, via an open source driver. Designed from the outset with safeguards (approval of sensitive actions, blocking of dangerous commands, a ban on entering passwords or following instructions hidden in a screenshot), it embodies a more cautious response than the very first agents of the wave. Added to this logic is a "Blank Slate Mode": it makes it possible to freeze the palette of accessible tools (by pinning certain tool sets and disabling the others), to obtain a more deterministic behavior and reduce the attack surface, which is valuable in a professional context.

Key takeaways (Chapter 6)

An agent does not answer, it acts: you entrust it with a goal and it accomplishes it autonomously, through a "think, act, observe" loop. It is the move from the copilot to the digital worker.
An agent combines a brain (LLM) with four faculties: perception, memory, tools, planning.
MCP (an open standard launched by Anthropic in late 2024, adopted by the industry) is "the USB-C of AI": it connects any AI to any tool.
Frameworks (LangChain/LangGraph, CrewAI, AutoGen) and no-code tools (n8n, Make, Zapier) make building agents easier.
Computer-use agents use a computer like a human, which is powerful but raises risks of reliability and security (prompt injection).
Multi-agent systems get several agents to collaborate, prefiguring the "AI corporations" of prospective scenarios (Chapter 24).
A wave of self-hosted personal agents (OpenClaw, Hermes Agent) runs an assistant that acts on your own machine, model-agnostic and with persistent memory: strong autonomy and sovereignty, but a new attack surface (Chapter 20).
Vibe coding (describing software in natural language and letting the AI write it) democratizes software creation and accelerates experts, but exposes you to poorly understood code, security flaws (Chapter 20), and deskilling.

We have surveyed the "what" and the "how." Chapter 7 draws the map of the "who": the American, Chinese, and European labs, the divide between closed and open models, and the faces of the people shaping this revolution.

6.1From model to agent#

6.2Anatomy of an agent#

6.3The Model Context Protocol (MCP) and tool use#

6.4Agent frameworks#

6.5Computer-use agents and web navigation#

6.6Multi-agent systems#

6.7Vibe coding: programming in natural language#

6.8Case study: OpenClaw#

6.9Case study: Hermes Agent#

Key takeaways (Chapter 6)

6.1From model to agent

6.2Anatomy of an agent

6.3The Model Context Protocol (MCP) and tool use

6.4Agent frameworks

6.5Computer-use agents and web navigation

6.6Multi-agent systems

6.7Vibe coding: programming in natural language

6.8Case study: OpenClaw

6.9Case study: Hermes Agent