After 30–80 turns, even a 200K context starts to fray, and you don’t want to keep paying input cost for the contents of a file you read once. Four strategies are alive in the corpus: LLM summarization (the production default), event sourcing (never compresses; derives views), sliding window (the simplest), and hybrid (verbatim recent + summarized old). The strategy choice matters less than what your summarizer is told to preserve. That list is your domain model.
Memory compression
Token budgets are finite. Conversations grow. After enough turns, the agent’s context is full of itself — old tool results, old reasoning, old file contents that have since changed. Most of it is noise. Compression is the act of throwing away the noise without throwing away the signal.
flowchart LR T1[Turn 1] --> T2[Turn 2] --> T3[Turn 3] --> T4[Turn N-2] --> T5[Turn N-1] --> T6[Turn N] T1 -. compress .-> S[(Summary)] T2 -. compress .-> S T3 -. compress .-> S S --> Ctx[Context for next call] T4 --> Ctx T5 --> Ctx T6 --> Ctx class S sum
The four strategies
The standard for production agents. When token usage crosses a threshold, slice off old turns and call a separate LLM with a structured “preserve these things, summarize the rest” prompt. Prepend the summary to the remaining context.
def compress(turns, target_tokens):
if total_tokens(turns) < target_tokens:
return turns
keep_recent = turns[-RECENT_N:]
to_summarize = turns[:-RECENT_N]
summary = aux_llm.complete(
SUMMARIZE_PROMPT.format(content=to_summarize),
# the prompt enumerates what must survive
)
return [SystemMessage(summary), *keep_recent]The cost is one extra LLM call per compaction event — a few thousand input tokens, hundreds out. Cheap relative to what you save.
The quality is all in the prompt. See the next section.
Don’t compress. The append-only event log is the source of truth, forever. What hits the LLM is a derived view — last N events, filtered by type, possibly summarized at view-time but never written back to the log.
flowchart LR EL[(Event Log<br/>append-only)] --> CV1[View · full] EL --> CV2[View · summarized] EL --> CV3[View · type-filtered] CV2 --> LLM
The wins are heavy: deterministic replay, free audit trail, microagent triggers can subscribe by event type, and you never lose anything. The cost is infrastructure — schema evolution must be planned because old events live forever.
Drop the oldest turns. Cheap, predictable, terrible at preserving facts.
if (tokenCount(messages) > BUDGET) {
messages = messages.slice(-RECENT_N);
}This works for short-lived agents where there are no long-arc dependencies (a code-review bot on one diff, a triage bot on one ticket). It is spectacularly unsuitable for agents that need to remember what they discovered three hours ago.
Recent turns verbatim, deep past summarized. The pragmatic middle.
const RECENT = 10;
const recent = messages.slice(-RECENT);
const old = messages.slice(0, -RECENT);
const summary = await llmSummarize(old, { preserve: ['ids', 'paths'] });
return [systemMsg, summary, ...recent];The recent N turns stay full-fidelity (so the model has crisp tool-result context); the older arc is condensed. Two knobs: how many recent turns, and how aggressively to compress the old.
The preservation list — your domain model
The summarizer prompt has two halves. The first half says “compress aggressively.” The second half says “but never lose these specific things.” The second half is what separates a useful summary from useless mush.
You are summarizing an agent's session for memory compression.
PRESERVE EXPLICITLY (do not paraphrase, do not drop):
- Task tracker IDs and their statuses
- File paths the agent has read or written
- Exact text of error messages (byte-for-byte)
- Tool arguments used (not just tool names)
- In-progress actions and their identifiers
COMPRESS LOOSELY:
- The agent's narration
- Repeated tool failures
- Casual reasoning
What goes in the first list is the agent’s domain knowledge. Across the corpus:
| Project | Domain | Preserved verbatim |
|---|---|---|
| OpenHands | software engineering | task IDs, file paths, in-progress action IDs, error text |
| Strix | pentesting | vulnerabilities, credentials, payloads, scope tokens |
| Claude Code | code editing | file metadata graph (paths read/written, hashes), tool args |
| Hermes | general | first-turn charter, last 3–5 turns verbatim |
When does compression fire?
-
Token threshold
The most common: when input token count exceeds a fraction (e.g. 75%) of the model’s context limit, compress. Easy to compute, predictable. Beware: each compaction busts your prompt cache.
-
Turn count
“Every 20 turns” or “every 50 turns.” Simpler than tokens but ignores variable turn size — one tool result might be 50K tokens by itself.
-
Cost ceiling
Track per-session spend; compress when it crosses a configured cap. Pragmatic for SaaS — converts a UX problem into a billing one.
Pick a strategy
- Single-task, < 30 turns Sliding window. Keep it simple. minimal
- Long-running, single domain LLM-summarize with a domain-specific preservation list. default
- Audit / replay required Event-source. Compression is a view concern.
- Long-running mixed workload Hybrid: verbatim recent + summarized old.
Recommended default: Default to LLM-summarize. Spend your design time on the preservation list, not the algorithm.
Cross-project comparison
| Project | Strategy | Trigger | Preservation rules |
|---|---|---|---|
| Claude Code | LLM-summarize | token threshold | file metadata graph + prose summary |
| OpenHands | event-source + summarizing condenser | iteration / token | task IDs, file paths, errors |
| Strix | LLM-summarize | turn count | vulns, credentials, payloads, errors |
| Hermes | hybrid | token threshold | first + last turns; summary middle |
| Mistral Vibe | sliding window | token threshold | none |
| Kimi Code | sliding window | turn count | none |
Projects that implement this
- Claude Code — Anthropic's official agentic CLI. Streaming tool calls, prompt caching, thinking signatures, multi-agent subagents, slash commands.
- OpenHands (v0) — All-hands AI v0 — autonomous software engineer agent. Event-sourced state, microagents, controller-level guardrails.
- Strix — Open-source 'AI hacker' for autonomous pentesting. XML tool format, markdown-as-skills, LLM-based dedupe, module-level agent graph.
- OpenHands (v1) — OpenHands re-architected: cleaner controller, refined memory condenser, improved tool dispatch. v1 of the All-Hands agent.
- Hermes Agent — 40+ tool, multi-platform agent. Provider adapters per LLM, trajectory compression preserves first/last turns, side-channel auxiliary client.
Related insights
Generic "summarize this conversation" loses the bits the agent needs to keep working. Mature systems enumerate preservation rules.
Two pentest reports describing the same SQL injection with different payloads aren't textually similar — but they should dedupe. Hashing fails; LLM reasoning works.