Agentic AI Architecture Patterns: The Four Shapes That Stabilised in 2026
The vendor demo I sat through in April conflated two pattern shapes inside the same slide. The diagram showed a “supervisor agent” dispatching to three “worker agents,” which were then drawn talking to one another in a peer-to-peer loop. The supervisor pattern and the reactive multi-agent pattern are not the same shape, and the failure modes are different enough that mixing them produces a system in which neither pattern’s strengths apply and both pattern’s failure modes do. The engineering lead in the room asked the right question — “is the supervisor in control, or are the workers in control” — and the vendor answered “both,” which is the answer that means the team has not decided. The deal slipped a quarter. The team eventually bought a different product after I helped them write a one-page document that committed to a single pattern shape per workload.
Naming the patterns precisely is doing real engineering work, not pedantry. Much of the agentic AI literature in 2026 uses the terms loosely — “multi-agent,” “agentic workflow,” “orchestrated agents” all get applied to architectures that have fundamentally different control flow, failure modes, and observability requirements. The four patterns below are the shapes that have stabilised through 2025 and 2026. If your system does not match one of them cleanly, you have either invented a fifth pattern (rare and probably worth interrogating) or you have a confused architecture that needs to commit. Most of the time, it is the second.
This page is the engineering complement to the architecture-level reasoning at the orchestration architecture page. That page covered what an orchestration architecture has to do; this one covers the concrete pattern shapes you choose from, with the trade-offs that determine which one fits your workload.
Pattern one: tool-using single-agent
One model. A defined set of tools. A loop: the model receives the user’s input, decides whether to call a tool or respond, calls the tool if it chose to, observes the result, decides again, and either calls another tool or responds. Termination is either explicit (the model emits a final response) or bounded (a maximum number of tool calls).
When to use. Most of the time, honestly. The single-agent pattern with a well-designed tool surface solves a much larger share of production workloads than the agentic-framework marketing implies. A customer-support workflow that needs to look up an account, check an order status, and either resolve or escalate is a single-agent workload. An internal research-assistant that searches a knowledge base, fetches documents, and synthesises an answer is a single-agent workload. A code-review assistant that reads a diff, queries the codebase, and produces a review is a single-agent workload. The pattern is under-credited and over-replaced.
What fails. The pattern fails when the tool list grows beyond what a single model can route reliably. In 2026 the empirical ceiling I see in production is around 12–15 tools — beyond that, the model’s attention budget is split too thinly across tool descriptions, semantic collisions between similar function definitions cause selection drift, and prompt length grows enough that latency and cost become uncomfortable. The pattern also fails when the workload has distinct stages that genuinely need specialised context. A workflow that requires deep research, then careful synthesis, then strict formatting is three different cognitive tasks; stuffing them all into one agent loop produces a system that does each of them adequately and none well.
Observability required. Per-turn trace of the model’s tool-selection decision, the arguments passed, the tool’s response, and the model’s reasoning about whether to continue. The trace has to capture the loop’s termination condition (final response, tool-call limit hit, error). Token counts per turn for cost analysis. Latency per turn for SLA tracking. The observability surface is the same as for any model call, with the loop dimension added.
Unique failure mode. Tool-selection drift. The model starts confidently calling the wrong tool when the user input is ambiguous, particularly under prompt-template changes that altered the tool descriptions without engineers noticing the change had moved the selection boundary. The fix is eval-as-CI on the tool-selection step specifically, with a labelled dataset that covers the ambiguous cases. Teams that skip this evaluation discover the drift through customer reports, which is the most expensive way to discover it.
Pattern two: supervisor-worker
One orchestrating agent (the supervisor) decides which specialised agent (a worker) to dispatch to, passes the task with the necessary context, receives the worker’s response, and decides what to do next — dispatch another worker, return a final response, or iterate. The supervisor holds the overall context; the workers hold specialised context relevant to their narrow tasks.
When to use. When the workload has distinct cognitive stages that benefit from specialised prompts, specialised tool surfaces, or specialised models. A research-and-write workflow where one agent is optimised for search and synthesis, another for fact-checking, another for formatting, is a clean supervisor-worker case. A multi-step data-analysis workflow where one agent is optimised for query generation, another for statistical reasoning, another for narrative explanation, is another. The pattern shines when the workers’ specialisations are genuinely distinct and the supervisor’s routing decision is non-trivial.
What fails. The pattern fails when the supervisor’s decision logic is so simple that it could be a deterministic router. If the supervisor’s job is “if the user asked a code question, dispatch to the code worker; otherwise, dispatch to the general worker,” you have not built a supervisor-worker pattern; you have built an over-decorated rules-based dispatcher with the latency and cost of a model call inserted into your routing layer. The pattern also fails when the workers need to share state in ways the supervisor cannot easily mediate — the supervisor becomes a bottleneck and a synchronisation point, and the architecture starts to want a different shape.
Observability required. All of the single-agent observability for each agent, plus the supervisor’s routing decisions, the context passed to each worker, and the cross-agent state mutations. The dimension that matters most for debugging is which worker was dispatched and why; the supervisor’s reasoning trace is the load-bearing telemetry. Without it, an engineer debugging a wrong-answer incident cannot tell whether the supervisor mis-routed or the worker mis-executed, and the two failures need different fixes.
Unique failure mode. Context-loss at the supervisor-worker boundary. The supervisor summarises context for the worker, the worker acts on the summary, and the summary has dropped a critical detail. The failure is silent and hard to detect after the fact because the worker’s output looks plausible. The fix is to structure the supervisor’s hand-off as structured data rather than a natural-language summary, and to evaluate the hand-offs specifically as part of the eval harness — covered in the orchestration architecture page on eval-as-CI.
Cost asymmetry to budget for. Supervisor-worker patterns carry a token tax. Each hand-off costs the supervisor’s summarising tokens plus the worker’s full context-rebuild on the receiving side; high-context workflows can find the summarise-and-re-inject cycle running two to four times the cost of the actual task. Run the unit economics before committing to the pattern, not after.
Pattern three: pipeline
A deterministic sequence of model-driven stages with structured hand-offs between them. Stage one transforms input A into intermediate representation B; stage two transforms B into C; stage three transforms C into the final output. Each stage is a model call, often with its own prompt and tool surface. The sequence is fixed at design time; runtime control flow does not branch except for retries and error handling.
When to use. When the workflow has a stable, well-understood structure that does not benefit from runtime decision-making about which stage to run next. Document-processing workflows are the canonical case — extract, classify, summarise, route — and they tend to be cleanly modeled as pipelines. Compliance-checking workflows often fit. Many ETL-shaped data-enrichment workloads fit. The pattern’s strength is determinism: a pipeline is the easiest pattern to test, evaluate, monitor, and reason about, because the control flow is known.
What fails. The pattern fails when the workflow’s actual structure is conditional in ways the pipeline cannot express cleanly. Forcing a conditional workflow into a pipeline produces stages that contain “if the previous stage said X, do this; otherwise do that” logic, and the pattern’s clarity advantage evaporates. The pattern also fails when the stages share so much context that the structured hand-off becomes either lossy (drops important detail) or bloated (every stage receives every previous stage’s full output, which scales poorly).
Observability required. Per-stage trace, with input, output, and any side-effects logged. Inter-stage latency and cost. Failure attribution — when the final output is wrong, which stage produced the bad output. The observability requirement is lighter than supervisor-worker because the control flow is known, which is one of the reasons the pattern is so under-rated. A pipeline is much easier to operate than a multi-agent pattern, and “easier to operate” is the architectural property that compounds across years.
Unique failure mode. Schema drift between stages. Stage one’s output schema changes, stage two has not been updated, and the pipeline produces structured-but-wrong output. The failure is often introduced by a prompt change rather than a code change, which means standard schema-validation catches it only if the eval harness includes the schema as a tested artefact. Treating prompts as code and prompts’ output schemas as contracts is the fix; this is also the failure mode that most argues for eval-as-CI being the load-bearing capability of the orchestration layer.
Pattern four: reactive multi-agent
Multiple agents operating concurrently against shared state, with each agent listening to events on that state and reacting independently. There is no central supervisor; coordination emerges from each agent’s reactions to the shared state. The pattern is event-driven, asynchronous, and bears more resemblance to actor-model systems than to the request-response patterns above.
When to use. Rarely, in 2026, and almost never as a first architecture. The legitimate cases are genuinely concurrent workloads — multi-agent simulation, swarm-style optimisation, certain research-assistant systems where different agents are pursuing independent threads of investigation that occasionally cross-pollinate. If your workload is not genuinely concurrent — if a sequential execution would produce the same result more cheaply — you do not need this pattern. The cases where it genuinely earns the complexity are narrow.
What fails. Almost everything that does not need this pattern. The reactive multi-agent pattern is the architecture most often deployed prematurely because it is the most impressive in a demo. Failures include emergent loops (two agents reacting to one another indefinitely), shared-state contention (agents fighting over the same state mutation), context fragmentation (no agent has the full picture of what is going on), and cost explosions (concurrent agents calling models in parallel multiply token spend faster than the value generated). The pattern’s failure modes are the failure modes of distributed systems, and most engineering teams underestimate the difficulty of operating distributed systems even before adding LLM non-determinism on top.
Observability required. Distributed-systems-grade. Per-agent traces, shared-state change events, inter-agent message logs, causality tracking, deadlock and livelock detection. The observability surface is fundamentally more expensive than the other three patterns; teams that adopt the pattern without budgeting for the observability work find themselves debugging emergent behaviour with no telemetry, which is roughly the worst position to be in.
Unique failure mode. Emergent infinite loops between agents reacting to one another’s reactions. The failure mode is unique to this pattern because the other patterns have explicit termination conditions. Reactive multi-agent systems require explicit loop-detection logic, budget caps, and circuit breakers. The Goodhart-style failure is the second-order risk — when agents optimise for the shared-state signal that was supposed to coordinate them, they can produce coordination that looks correct in the telemetry and is actively wrong in outcome. Red-team work on these systems, covered at the red-teaming page, has to include emergent-behaviour testing specifically.
Choosing the pattern
The decision flow I recommend, in roughly this order.
Start with single-agent. Map the workflow, define the tool surface, build it, evaluate it. About 60% of enterprise workloads in 2026 are solved adequately at this step and never need to move to anything more elaborate. The teams that skip this step and start with supervisor-worker or multi-agent tend to over-engineer and ship slowly; the teams that start single-agent and graduate when measured failure modes demand it tend to ship faster overall.
If single-agent fails on tool-list size or on distinct cognitive stages, move to pipeline if the structure is stable, or supervisor-worker if the structure requires runtime decisions about which stage to run. The pipeline is the lighter-weight move and should be preferred when applicable; supervisor-worker is the heavier move and should be chosen deliberately.
Reactive multi-agent should be considered only when the workload is genuinely concurrent and the simpler patterns have been ruled out by measured failure, not by ambition. The pattern is the right answer for a narrow set of problems and the wrong answer for everything else.
The framework choice follows the pattern. LangGraph for graph-shaped workflows with explicit state — fits pipeline and supervisor-worker cleanly. AutoGen for conversational multi-agent patterns. CrewAI for role-based multi-agent patterns where the agents have stable specialisations. The model-vendor-native surfaces (Anthropic’s tool use, OpenAI’s Responses API, Google’s Gemini agentic features) for single-agent workloads where a framework would be over-build. Never pick a framework first; the framework choice that survives is the one that matches the pattern your workload actually has.
What I would build, by pattern
For single-agent: model-vendor SDK plus a thin in-house loop and tool-calling layer. Eval-as-CI on the tool-selection step. About a week of engineering work for a competent platform engineer. The cleanest architecture in the cluster, and the one most teams should be running.
For pipeline: LangGraph if the stages need explicit state, plain function composition if they do not. Schema validation between stages enforced in CI. Per-stage eval coverage. Two to four weeks of work depending on stage complexity.
For supervisor-worker: LangGraph for the supervisor’s state machine, with structured hand-offs to workers implemented as typed function calls. Eval coverage on the hand-off boundary specifically. Four to eight weeks of work, including the observability layer that makes the supervisor’s routing decisions debuggable.
For reactive multi-agent: an actor framework (the model-vendor offerings are immature here; the open-source AutoGen and CrewAI are the closest analogues but still maturing). Substantial observability investment. Explicit loop detection and budget caps. Three to six months minimum, and the architecture justifies that investment only when the workload genuinely demands concurrency.
The diagram-level differences between these patterns are real and consequential. A team that has not committed to one pattern per workload has not yet made the architectural decision; the patterns commit them. Naming which pattern you are building is the load-bearing first step, and the rest of the engineering work flows from it.
Sources
- Anthropic — Building effective agents — the canonical 2024 paper on minimal agent design and the trap of premature multi-agent architecture
- LangChain — LangGraph documentation — reference implementation of graph-shaped agent and pipeline patterns
- Microsoft Research — AutoGen framework documentation — reference implementation of conversational multi-agent patterns
- Related: orchestration architecture, capabilities hub, AI-SRE tooling, red teaming
Methodology: pattern definitions and failure modes drawn from fractional CTO engagements (2024–2026) where I have either approved an agentic architecture at review or recommended its replacement. The pattern names are the ones the engineering teams I respect have converged on; the literature is still catching up to the precise vocabulary.
