TL;DR 11 min read

A single agent loses focus around 60–80 turns. Splitting work across agents — planner, executor, reviewer — restores clean context windows and unlocks parallelism. Three design questions decide everything else: who blocks on whom, what context the child sees, and how the result flows back. The default-and-best answers (parent waits, fresh context, summary back) cover 90% of cases. The remaining 10% are where the architecture earns its name.

Multi-agent coordination

A single LLM agent forgets things. Around the 60-turn mark, even with a 200K context, attention frays — earlier reasoning gets diluted, the agent revisits already-failed paths, output quality drifts. The fix isn’t a smarter model. It’s fewer turns per agent, achieved by delegating subtasks to fresh agents that finish quickly and report back.

That’s all “multi-agent” really is: a way to spend many short, sharp context windows instead of one long, blunt one.

The dominant pattern: a tool that delegates

Every project in the corpus implements multi-agent the same way at heart: the parent has a delegate (or agent, or subagent) tool. Calling it pauses the parent. The child runs to completion in its own loop. The child’s final answer comes back as a tool_result and the parent resumes one turn later.

sequenceDiagram
participant P as Parent
participant T as Delegate tool
participant C as Child agent
P->>T: delegate(task='research X', context=summary)
T->>C: spawn(messages=[system, user_task])
Note over C: child runs its own loop
C-->>T: final answer / artifact
T-->>P: tool_result(child_summary)
Note over P: parent resumes

Parent pauses while child runs; child's outcome arrives as a tool_result.

Why this is the default: it slots into the existing loop without inventing new infrastructure. The parent already knows how to dispatch tools. A child agent is just an unusually long-running tool.

Question 1 — what context does the child see?

The child gets a brand-new conversation: just the system prompt plus the task description the parent wrote. No history. No prior turns.

This is what you want by default. The whole point of delegating is that the child doesn’t carry the parent’s accumulated cruft. The parent compresses what the child returns, not what the child produces along the way.

If you find yourself thinking “but the child needs to know X” — write X into the task description. That’s the discipline. The discipline is the win.

The child inherits a copy of the parent’s context.

You almost never want this. The two cases where you do:

The parent has built up a detailed plan that’s awkward to summarize without losing fidelity, and the child needs to execute it verbatim.
The parent has loaded a large reference document into context and the child needs to query against the same document.

Both cases are warning signs. The first is usually solved by putting the plan in a file the child reads as a tool call. The second is usually solved by giving the child a search tool against the document.

Reach for forking only when the alternatives are clearly worse.

A long-running child that takes successive tasks. Saves spin-up cost when the child has expensive setup (e.g. building a code-search index).

Rarely worth it. Lifecycle becomes a question (when does the pool drain? what if the child crashes?), and the children are no longer cleanly independent. Most teams just pay the spin-up cost and keep agents stateless.

Question 2 — does the parent block?

? Should the parent wait for each child, or run children in parallel?

Children read independently (no side effects) Run in parallel wins
Each child writes to a shared file or DB Serialize, or scope each child to a subdir
Children depend on each other's output Serialize
You don't know yet Block by default; loosen later

Recommended default: Parallelism is the headline win of multi-agent. Default to running children in parallel if you can, but verify the child's tools don't share write paths.

For Swisscheese multi-agent code review

The whole Swisscheese review pipeline is a parallel-fan-out. Once the writer-agent produces a diff, the security reviewer, style reviewer, regression reviewer, and architectural reviewer can run simultaneously — they each read the same diff, they each write to their own report file. Wall-clock is dominated by the slowest reviewer, not the sum. That’s the headline cost win that justifies the meta-harness.

The catch is the fixer agent that comes after the reviewers. It needs to merge contradicting feedback (one reviewer says “split this into two functions”, another says “this needs to be one tight unit”). Two patterns work:

Voting / consensus. Each reviewer outputs a structured verdict (approve, block, nit); the fixer ignores anything below a threshold or weighted by reviewer trust.
Critic-of-critics. A single reasoning agent reads all reviewer outputs and synthesizes a coherent fix-list. More tokens, fewer arbitrary rules.

Both work. Voting is faster and auditable; critic-of-critics is more expensive but produces better output when reviewers genuinely conflict.

Question 3 — what flows back?

The child’s outcome comes back as a tool_result. There are three flavors of “outcome”:

Just the final answer

A string or markdown summary. Cheapest. Parent loses everything else. Right when the child’s job is “answer this question” and the work was disposable.
Final answer + key artifacts

The summary plus paths/IDs the child produced (file paths, ticket IDs, commit hashes). Right when the parent will keep working with what the child made.
Full sub-trajectory

The entire child transcript. Almost never useful — defeats the point of delegation, blows up parent’s context. The exception: when an auditor or human needs to inspect the child’s reasoning later, store the trajectory but don’t return it. Return a pointer.

The Strix mailbox — when delegation isn’t enough

Strix takes the unusual step of letting agents post messages to peer agents, not just return up to a parent. The infrastructure is module-level dictionaries: a graph dict tracks parent-children, an instances dict holds live agents, a messages dict is per-recipient mailboxes. Any agent can drop a note into another’s mailbox.

class BaseAgent:
    _agent_graph: dict[str, list[str]] = {}
    _agent_instances: dict[str, BaseAgent] = {}
    _agent_messages: dict[str, list[Msg]] = {}

The cleverness isn’t the dicts — it’s the recognition that for a single-process pentest, you don’t need Redis or a broker. Python’s GIL serializes individual dict operations. You just need to declare “one scan per process” and you’re done.

Coordination pitfalls

Pitfall	Symptom	Fix
Children that delegate to children that…	runaway cost, rate limits	depth cap (most teams: 2 or 3)
Children that share writeable files	flaky races	filesystem lock, or scope each child to a subdir
Parent waits sequentially on slow children	wall-clock dominated by tail	parallelize where independent
Child crashes, parent hangs	half-completed task, stuck loop	timeouts + structured error returns
Parent and child instructions conflict	child does the wrong thing	child re-states scope in its own system prompt

When not to delegate

The task is small. A 5-turn delegation is more overhead than just doing it inline.
The state is unfit to summarize. If you can’t crisply describe what the child should do, the child won’t do it crisply either.
One critical edit. Don’t delegate “now write the fix” if the parent has all the context — fragmenting understanding loses the fix.

Cross-project comparison

Project	Pattern	Fresh by default?	Parallel children?	Notable
Claude Code	Parent-child delegate tool	yes	sequential	Fork flag exists, rarely used
OpenHands	Delegate as an action / event	yes	sequential	Registry-based agent lookup
Strix	Delegate + module-level mailbox	yes	yes (threading)	Single-process limitation
Hermes	Delegate tool	yes	yes (threading)	Registry of named specialist agents
Multica	Graph nodes	n/a	yes	Edges are coordination

Projects that implement this

Claude Code — Anthropic's official agentic CLI. Streaming tool calls, prompt caching, thinking signatures, multi-agent subagents, slash commands.
OpenHands (v0) — All-hands AI v0 — autonomous software engineer agent. Event-sourced state, microagents, controller-level guardrails.
Strix — Open-source 'AI hacker' for autonomous pentesting. XML tool format, markdown-as-skills, LLM-based dedupe, module-level agent graph.
OpenHands (v1) — OpenHands re-architected: cleaner controller, refined memory condenser, improved tool dispatch. v1 of the All-Hands agent.
Hermes Agent — 40+ tool, multi-platform agent. Provider adapters per LLM, trajectory compression preserves first/last turns, side-channel auxiliary client.
Open Design — Open-source design / UI-generation agent. LLM-driven design intent → code, with a design-system-aware tool surface.
Multica — Multi-cloud / multi-agent orchestration. Architecture patterns for spanning providers and clouds in one agent.

Strix ●●●

LLM-based deduplication that reasons about root cause

Two pentest reports describing the same SQL injection with different payloads aren't textually similar — but they should dedupe. Hashing fails; LLM reasoning works.

memory-compression multi-agent-coordination

Strix ●●●

Inter-agent messaging via module-level dicts

For one-process multi-agent coordination, plain Python dicts are the right answer. No Redis, no broker, no race conditions you need locks for.

multi-agent-coordination mailbox-patterns

Strix ●●●

Authorized scope injected into system prompt at render time

A pentest agent that can be talked out of scope is dangerous. Putting scope in the locked system prompt — not the message log — defeats prompt injection.

guardrails multi-agent-coordination

Multi-agent coordination

Just the final answer

Final answer + key artifacts

Full sub-trajectory

Projects that implement this

Related insights