AI

Agents as Teammates: Hierarchy, Roles, and What 2025 Taught Us

Multi-agent systems fail for the same reasons human teams fail. The fixes look familiar too.

17 min read
Key Takeaways
    • The empirical year is over: Between Anthropic's orchestrator-worker research system, Cognition's Devin reality check, Cursor 2.0's parallel background agents, and the Cemri et al. failure-mode paper, 2025-2026 produced the first real evidence base for multi-agent design.
  • Peer-to-peer agents lost: Letting agents "figure it out among themselves" amplified every known failure mode. Orchestrator-worker, agent-flow, and bounded-collaboration patterns are what shipped to production.
  • The cost is real: Anthropic's multi-agent research system beat single-agent Claude Opus 4 by 90.2% on internal evals, but used roughly 15x the tokens. You're buying capability with money.
  • Devin became a cautionary tale: 13.86% pass rate on SWE-bench Verified collapsed to a reported ~85% real-world failure rate. Benchmarks measured one thing; production measured another.
  • Multi-agent failure is an org-design problem: Cemri et al.'s 14 failure modes map almost one-to-one onto reasons human teams fail. Vague roles, no shared state, no accountability, coordination overhead.
  • The playbook is boring on purpose: Pick the right autonomy level per task, define one role at a time, use orchestrator-worker not P2P, instrument everything, keep a human in the loop on consequential actions.

The Year Multi-Agent Got Real

For most of 2024 the multi-agent conversation was vibes. People posted demos of "agent swarms" planning weddings or running fake companies, and almost none of it survived contact with production. The promise was huge. The evidence was thin.

That changed in April 2025, when Anthropic published the engineering write-up of its multi-agent research system. The architecture was simple in shape: a lead Claude Opus 4 agent acts as the orchestrator, breaks a research question into subtasks, spins up subagents to investigate in parallel, and then assembles their outputs. On Anthropic's internal browsing evaluation, this configuration beat a single-agent Opus 4 baseline by 90.2 percent. (Anthropic engineering, "How we built our multi-agent research system")

That's the headline. The fine print matters more. The same write-up notes that multi-agent systems use about 15 times more tokens than a normal chat. Token consumption explained roughly 80 percent of the performance variance in their evaluation framework. You're not getting that 90.2 percent for free; you're buying it.

So now the question for any founder, CTO, or staff engineer designing an AI feature is no longer "is multi-agent a real pattern?" It's "is multi-agent the right pattern for this task, and at what cost?" 2025 and the first half of 2026 produced enough data to answer that with more than gut feel. The patterns that won, the patterns that lost, and the failure modes that recur look a lot like classical org design. That's the thread of this piece.


The Failure-Mode Taxonomy: What Cemri et al. Documented

In March 2025, Mert Cemri and collaborators at Berkeley published "Why Do Multi-Agent LLM Systems Fail?" (arXiv 2503.13657). The paper analyzed five popular multi-agent frameworks across more than 150 conversational traces and produced something the field badly needed: a taxonomy.

Fourteen distinct failure modes, grouped into three families.

Specification and system design failures. Agents disobey their role spec. They disobey the task spec. They lose conversation history. Steps get repeated. Termination conditions never fire. The system "knows" the task but the individual agent in front of you doesn't behave as if it does.

Inter-agent misalignment. Agents reset progress and start over. They withhold information from each other. They derail into off-topic exchanges. They make assumptions about what another agent has done and act on those assumptions without checking. This is the classic "I thought you had it" failure of any team.

Task verification and termination failures. Verification is incomplete or absent. The system declares success when the output is wrong. No agent owns the final acceptance check. Humans only catch the problem downstream, when the bad output has already propagated.

Read the list and the shape becomes obvious. These aren't model failures in the narrow sense. They're coordination failures. They're the same failures a brand-new five-person team makes in their first month, before anyone has written down who owns what.

Group the 14 modes a different way and the org-design parallel sharpens:

  • Role ambiguity. Vague task ownership. Two agents both attempt the same step, or neither does.
  • State ambiguity. No single source of truth for what's been done, what's pending, what's blocked.
  • Accountability gaps. Who's responsible for the bad output? In peer-to-peer systems, often nobody.
  • Coordination overhead. The "meeting problem." Agents spend tokens negotiating instead of producing.
  • Specification drift. The same instruction gets interpreted differently across agents and across turns.

If you've ever rebuilt a team that wasn't shipping, you've seen all five. The fix in human teams is the same as in agent systems: pick an operating model, write the roles down, define handoffs, instrument the work.


Why Peer-to-Peer Agents Don't Work

The most common architectural mistake of 2024 and early 2025 was peer-to-peer collaboration: spawn a "team" of agents with different role prompts (CEO, researcher, coder, reviewer), let them talk to each other in a group chat, hope coordination emerges. AutoGen-style group chats, early CrewAI demos, and a wave of "AI startup in a box" projects all leaned on this pattern.

It failed in production, consistently. Every failure mode in the Cemri taxonomy amplifies in a P2P configuration. With no orchestrator, role boundaries blur because every agent can be asked to do anything. State scatters across the conversation history because no agent owns the canonical record. Accountability vanishes because the system as a whole produces the output, and the system as a whole has no name on it.

Coordination overhead is the killer. In a group of five peer agents, every meaningful action generates four observers who feel obligated to comment. Token cost balloons. The signal-to-noise ratio collapses. Cemri et al. found that conversational reset, where an agent restarts a finished thread because it can't tell the thread is finished, was one of the most common and most expensive failures in their corpus.

A concrete example. A research team I spoke with in late 2025 built a P2P agent system to draft a competitive analysis. Five agents: market analyst, product analyst, financial analyst, writer, editor. The first run took 90 minutes and produced a 14,000-word document. About 11,000 of those words were the agents discussing what the document should contain. The remaining 3,000 were the document itself, and they contradicted each other. The team rebuilt it as an orchestrator-worker setup the following week. Same task, 22 minutes, 4,200 words, internally consistent. Roughly half the token spend.

P2P didn't fail because the agents weren't smart enough. It failed because it asked them to do the hardest thing a group can do: organize themselves without a chair.


Orchestrator-Worker: The Pattern That Won

Anthropic's research system is a textbook orchestrator-worker setup. One coordinator agent owns the high-level task. It decomposes the task into subtasks, hands each subtask to a worker agent with a specific brief, collects results, and decides what to do next. Workers don't talk to each other. They talk to the orchestrator.

This maps cleanly onto human org design, which is exactly why it works. A small startup with one founder and four contractors operates this way. The founder holds the spec, the budget, the timeline, and the shared context. Contractors execute scoped tasks against a brief. Information flows up through the founder; tasks flow down from the founder. Contractors aren't expected to coordinate among themselves except in carefully scoped pair-ups.

The pattern has four properties that matter for reliability.

One owner of shared state. The orchestrator holds the canonical record. There's no ambiguity about what's been done.

Scoped worker contexts. Each worker gets only what it needs. This keeps context windows manageable and reduces the chance of cross-task contamination.

Defined handoffs. Worker outputs come back in a structured format the orchestrator can verify. No free-form negotiation.

Single accountability surface. When the output is wrong, the orchestrator is responsible. You debug one place.

Anthropic's write-up is explicit about how much of their reliability work happened inside the orchestrator: the lead agent's prompt is the longest and most carefully tuned part of the system, because that's where the role definitions, decomposition heuristics, and termination logic live. (Anthropic engineering)

Bounded collaboration is a useful variant. Two workers might be allowed to confer on a specific handoff, but only through a structured channel and only on a defined topic. Think of it as a scheduled standup, not a Slack channel. The boundary is the point.

PatternFailure resilienceCostComplexityWhere it fits
Peer-to-peer (group chat)Low. Every failure mode amplifies.High. Lots of negotiation tokens.Misleading. Looks simple, isn't.Demos, brainstorming sketches.
Orchestrator-workerHigh. One owner, one debug surface.Moderate to high. ~10-15x single-agent.Moderate. Most logic lives in the orchestrator.Research, decomposition, parallelizable work.
Bounded collaborationMedium-high. Risk lives at the seam.Moderate. Cheaper than full P2P.Higher. Designed handoffs are work.Specialist pair-ups under an orchestrator.
Agent-flow (DAG)High. Static structure pre-empts drift.Low to moderate. Reuses cached steps.Moderate at design time, low at runtime.Repetitive pipelines, batch processing.

The 5-Level Autonomy Framework

The other piece of 2025 scaffolding worth knowing is the autonomy framework from "Levels of Autonomy for AI Agents" (arXiv 2506.12469, with a companion governance write-up at Knight Columbia). The authors define five levels, loosely analogous to SAE driving automation but for AI agents.

Level 0: Assistive. The model suggests; the human acts. Autocomplete, inline code suggestions, draft email composition.

Level 1: Operator. The human still issues each action, but the agent assembles tool calls and composite steps under direct instruction.

Level 2: Reviewer. The agent proposes a plan and executes it under review. The human approves at major checkpoints.

Level 3: Collaborator. The agent owns whole tasks autonomously inside a scoped boundary. The human reviews outputs, not steps.

Level 4: Expert. The agent operates independently on complex multi-step work, with the human stepping in only on flagged exceptions.

Level 5: Agent. Full autonomy across a domain. The agent sets goals, plans, executes, and self-corrects with minimal oversight.

Anthropic's complementary "Measuring AI agent autonomy in practice" work makes a related point: in real deployments, the operating level is rarely uniform across a system. A "level 3" system usually contains level 1 subcomponents (high-stakes actions) and level 4 subcomponents (low-stakes background work). What matters is matching the level to the task, not raising the level globally.

The level you target shapes every role-design decision downstream. At level 2, your worker agents need clear plan-review affordances. At level 4, they need exception flagging and rich tracing. At level 5, they need formal verification of acceptance criteria, because nothing else catches a wrong answer. Builders who skip this step pick architecture first and then discover, in production, that the level the architecture implies isn't the level the task can tolerate.


Levels in Practice: The Devin Case

Cognition's Devin became the most quoted cautionary tale of 2025. Launched in March 2024 as "the first AI software engineer," Devin scored 13.86 percent on SWE-bench Verified, which at the time was state of the art. The marketing implied level 4 or level 5 autonomy: hand it a ticket, get a working PR back.

By mid-2025, multiple independent reviews put Devin's real-world success rate at roughly 15 percent, meaning an effective failure rate around 85 percent on tasks that weren't curated benchmark instances. A widely cited Answer.AI review walked through 20 real attempts and reported that 14 failed outright, with several producing confidently wrong output that took longer to debug than writing the code from scratch.

What happened is the benchmark-vs-production gap, sharpened. SWE-bench Verified tasks are clean: one repo, one well-described issue, a clear test signal, a constrained surface area. Real engineering tickets are messy: ambiguous specs, cross-cutting concerns, undocumented assumptions, decisions that depend on tribal knowledge. The same agent that placed level 3 on the benchmark dropped to a wobbly level 2 in the wild, sometimes worse.

Devin isn't a story about a bad agent. It's a story about a level mismatch. The architecture aimed at level 5 reliability; the underlying capability supported level 2 at best on non-curated work. The marketing forced users to operate the system at the advertised level, where it failed. Cognition's later pivots, more scoped use cases, more human-in-the-loop affordances, more honest framing, are an attempt to bring the operating level back into line with the capability.

The lesson is concrete. Pick the autonomy level your system can sustain on your hardest real task, not your easiest benchmark. Design the roles, the supervision, and the escalation paths for that level. If you want level 5 marketing, build a level 5 system; if you have a level 3 system, market it as one.


Cursor 2.0 and the Hardware-Backed Workflow

Cursor 2.0 shipped in February 2026 and quietly resolved one of the most persistent issues with multi-agent coding: workspace conflict. Cursor's background agents now run on cloud VMs, each in its own git worktree, with the IDE able to coordinate up to eight in parallel.

The architectural detail that matters: each agent has its own filesystem. No shared working tree means no merge conflicts mid-edit, no "agent A overwrote agent B's changes," no flaky test runs because two agents were touching the same files. When an agent finishes, its branch can be reviewed and merged like any other PR. When it goes wrong, you throw the worktree away.

This is hardware-backed isolation in the same sense that virtual machines gave process isolation in the 2000s. The agent fences are no longer "we promise not to step on each other"; they're "the operating system literally won't let us."

Why this matters for the Cemri taxonomy: hardware isolation removes a whole class of inter-agent misalignment failures. State stays inside the worktree. Side effects stay inside the VM. The orchestrator (Cursor itself, or the user) sees only the diff each agent produces. The acceptance check is structural (does the PR pass CI?) instead of conversational (does this agent claim the work is done?).

Practitioners running Cursor 2.0 alongside Anthropic's Claude Code, OpenAI's Codex CLI, and other parallel-agent tools have settled into a pattern: spawn three to eight agents on independent tasks, monitor their progress through a unified dashboard, merge the wins, discard the losses. The cost model is closer to "rent a junior contractor for an hour and a half" than to "ask a chatbot a question." The output model is closer to GitHub PR review than to chat completion.

Anysphere (Cursor), Bolt, Lovable, and v0 inside Vercel are now all running variants of this loop internally. The companies that ship the most agent-driven code are the companies that built the workspace isolation first.


Designing Roles for AI Teammates

Once you accept that multi-agent failure is an org-design problem, the design moves you make start to look like classical management. Every agent role needs four artifacts.

A scoped responsibility. One sentence: what this agent owns and what it doesn't. A "researcher" agent that also writes prose is two agents in a trench coat. Split them.

A structured input brief. Not "go research X." A template that fills in: the question, the prior context the agent should assume, the format of the expected output, the constraints (time, tokens, tools), the source preferences. This is the agent equivalent of a project brief.

Defined acceptance criteria. What does "done" look like? Often this is a schema (the output must validate against this Zod type) or a deterministic check (the PR must pass these three test files). Where deterministic checks aren't available, a separate reviewer agent runs against an explicit rubric.

An escalation path. When the agent gets stuck, what does it do? Reasonable defaults: ask the orchestrator a structured clarifying question, surface a flagged exception to a human, abort with a typed error. The wrong default is "keep going and improvise." That's where hallucinated success comes from.

Apply Cemri's failure modes one at a time and these four artifacts cover most of them. Role ambiguity dies on the scoped responsibility. State ambiguity dies on the structured brief. Accountability gaps die on the acceptance criteria. Coordination overhead drops because the agent doesn't need to negotiate; it has a brief and a rubric. Specification drift drops because the spec is captured in the schema, not in vibes.

The non-obvious part: write the role like you'd write a job description for a contractor, not a prompt for a chatbot. Contractors are the right mental model. They have a brief, they deliver an output, they don't need to know your company's whole context, and you fire them if they keep missing the spec.


The Founder's Multi-Agent Playbook

Here's the practical version, distilled from the Anthropic write-up, the Cemri paper, the Devin postmortems, and the Cursor 2.0 workflow data.

Start at level 2 or 3, not level 5. Capability is fragile under distribution shift. Even if your benchmark says level 4, your hardest real task is usually a level lower. Target the level your hardest task can sustain.

Use orchestrator-worker, not peer-to-peer. One agent owns shared state. Workers get scoped briefs. Bounded collaboration only at carefully designed seams. No group chats.

Define one agent role at a time. Resist the urge to design a five-agent system on a whiteboard before any of it ships. Ship one orchestrator and one worker. Watch it for a week. Add the next worker when you've seen the first one fail and fixed it.

Write the brief like a job description. Responsibility, inputs, acceptance criteria, escalation. If you can't describe the role in four short sections, the role isn't ready to ship.

Instrument with full traces. Every agent action, every tool call, every intermediate output. You cannot debug multi-agent systems by reading the final output. The bug is almost always upstream.

Budget for 15x token cost. Anthropic's research system used roughly 15 times the tokens of a single-agent baseline. Plan accordingly. If your unit economics break at 15x, multi-agent is the wrong pattern for that feature.

Keep a human in the loop on consequential actions. "Consequential" usually means: writes to a customer-facing system, sends an external communication, spends money, deletes data, modifies a security-relevant resource. The human review costs seconds; the absence of it can cost months.

Build the workspace isolation before the agent fleet. Cursor's lesson generalizes. Git worktrees for code, scoped database transactions for data, dedicated VMs for environment-touching work. Isolation is cheaper than coordination.

Run a postmortem on every failure for the first 90 days. Tag each failure against the Cemri taxonomy. Patterns emerge fast. The third "agent re-did finished work" failure is the signal that your termination conditions need tightening.

None of this is exotic. It's the same playbook a competent engineering leader uses to onboard a new team of contractors. The reason multi-agent works is that it's the same problem with a tighter feedback loop.


Where This Is Headed

The next 12 to 18 months are about turning these patterns into infrastructure. A few threads to watch.

Agent-to-agent protocols. Google's A2A protocol, the AGENTS.md convention emerging across IDE and CLI tooling, and various interop drafts in working groups are an attempt to standardize how agents discover each other, exchange capability descriptions, and authenticate. (Anthropic's overview of building agents gestures at this direction.) The point is to make orchestrator-worker patterns composable across vendors, instead of locked to one provider's SDK.

Capability-token authorization. Today, most agents inherit the full credentials of the user who launched them. That's a bad idea for level 3 and above. Capability tokens, narrow, time-bounded, scoped to specific tool calls and resources, are how agents will get the permissions they actually need without the ones they don't. Expect this to land in production SDKs in 2026.

Verified agent identities. When agents start calling other agents across organizational boundaries, the receiving side needs to know what's calling. Signed agent identities, attestation of training and configuration, and cross-vendor identity formats are being prototyped now. The model is closer to certificate transparency than to OAuth.

Better evaluation. SWE-bench, MLE-bench, GAIA, and similar suites have stretched far enough that frontier models saturate them. The next generation of evals will measure things benchmarks have historically ducked: long-horizon task completion, sustained policy compliance, recovery from failure, cost-per-success. Expect "agent reliability" to become a measurable property the way "uptime" did for services.

Standardized failure tagging. Cemri et al. gave the field a taxonomy. The natural next step is tooling that tags traces against that taxonomy automatically, the way Sentry tags exceptions. Founders who set this up early will debug their agent systems an order of magnitude faster than ones who don't.

The work is still mid-flight. None of these threads is settled. But the shape of the next phase is visible: less prompt craft, more systems engineering; less "what can the model do?" more "what role do I need filled, and what does done look like?" The teams that thrive will be the ones that treat agents as teammates, with the same rigor they'd apply to any new hire. Roles, briefs, acceptance criteria, escalation paths. The boring artifacts of competent organizations.


Frequently Asked Questions

Should every team build multi-agent systems?

No. Most production AI features still work best as single-agent setups with good tools. Multi-agent earns its keep when the task is genuinely decomposable (independent subtasks that can run in parallel), the per-task value justifies the 15x token cost, and reliability has been proven at the single-agent layer first. A failing single-agent system does not become a working multi-agent system. It usually becomes a more expensive failing system.

What's the simplest production-ready multi-agent pattern?

Orchestrator with two workers, both with deterministic acceptance criteria. The orchestrator owns shared state. Workers get scoped briefs. Outputs validate against a schema before the orchestrator accepts them. This is enough to capture most of the benefit of decomposition without the coordination overhead of larger teams. Scale up worker count only after this baseline is reliable.

Is the 15x token cost worth it?

It depends on the value of the output. Anthropic's research system makes economic sense because the alternative is a human researcher spending hours on the same task. For low-value, high-volume work (intent classification, simple summarization, routine extraction), 15x token cost is almost never worth it; use a single-agent setup or a smaller model. For high-value, low-volume work (deep research, complex coding tasks, multi-source analysis), the math often works.

How do I know when to spawn a subagent versus use a longer prompt?

Spawn a subagent when the work has a different context requirement from the parent task. If the subtask needs different tools, a different system prompt, or context that would pollute the parent's window, give it its own agent. If the subtask is "just do this next thing" and shares the parent's context, a longer prompt or a tool call is cheaper and more reliable. The deciding question is usually about context: would I want this context in my orchestrator's memory after the subtask finishes?

What's the difference between an agent and an AI tool?

A tool is a single deterministic capability the model can call (a web search, a database query, a code formatter). An agent is a loop: it observes, plans, calls tools, evaluates the result, and decides what to do next, often across many turns. Tools are nouns; agents are verbs. A well-built agent uses many tools. A "tool" that calls a model internally and runs its own loop is, in practice, an agent.


Closing Thoughts

The most useful framing I've heard for the current moment came from an engineering leader running a multi-agent coding rollout: "We stopped thinking of agents as smart and started thinking of them as juniors with infinite stamina." Juniors need clear roles, written briefs, defined acceptance criteria, and an escalation path. They make predictable mistakes. They get better with feedback. They fail when nobody owns their work.

That's the org-design shift hiding inside the multi-agent conversation. The hard problems of 2025-2026 are not capability problems anymore. The models are good enough. The hard problems are the same ones that have always made teams of humans hard to run: who owns what, who holds the state, who checks the work, who's accountable when it goes wrong. Cemri's taxonomy, the autonomy framework, the orchestrator-worker pattern, the Devin postmortem, the Cursor isolation model: they're all answers to versions of the same question.

If you're a founder shipping an agent-driven product in 2026, the boring artifacts are the leverage. Write the roles. Write the briefs. Define done. Build the workspace isolation. Instrument the traces. Keep the human in the loop where it counts. The teams that do this will look slower for two months and faster for two years. The teams that skip it will spend the rest of the year tagging failure modes against a taxonomy somebody else already wrote.

The agents are teammates now. Manage them like ones.

Start building your knowledge library

Highlight what matters as you read across the web. Save insights from articles, books, and YouTube videos in one place.

Get Started Free