TL;DR 9 min read

After 30–80 turns, even a 200K context starts to fray, and you don’t want to keep paying input cost for the contents of a file you read once. Four strategies are alive in the corpus: LLM summarization (the production default), event sourcing (never compresses; derives views), sliding window (the simplest), and hybrid (verbatim recent + summarized old). The strategy choice matters less than what your summarizer is told to preserve. That list is your domain model.

Memory compression

Token budgets are finite. Conversations grow. After enough turns, the agent’s context is full of itself — old tool results, old reasoning, old file contents that have since changed. Most of it is noise. Compression is the act of throwing away the noise without throwing away the signal.

flowchart LR
T1[Turn 1] --> T2[Turn 2] --> T3[Turn 3] --> T4[Turn N-2] --> T5[Turn N-1] --> T6[Turn N]
T1 -. compress .-> S[(Summary)]
T2 -. compress .-> S
T3 -. compress .-> S
S --> Ctx[Context for next call]
T4 --> Ctx
T5 --> Ctx
T6 --> Ctx
class S sum

Old turns roll up into a structured summary; recent turns stay verbatim.

The four strategies

The standard for production agents. When token usage crosses a threshold, slice off old turns and call a separate LLM with a structured “preserve these things, summarize the rest” prompt. Prepend the summary to the remaining context.

def compress(turns, target_tokens):
    if total_tokens(turns) < target_tokens:
        return turns
    keep_recent = turns[-RECENT_N:]
    to_summarize = turns[:-RECENT_N]
    summary = aux_llm.complete(
        SUMMARIZE_PROMPT.format(content=to_summarize),
        # the prompt enumerates what must survive
    )
    return [SystemMessage(summary), *keep_recent]

The cost is one extra LLM call per compaction event — a few thousand input tokens, hundreds out. Cheap relative to what you save.

The quality is all in the prompt. See the next section.

Don’t compress. The append-only event log is the source of truth, forever. What hits the LLM is a derived view — last N events, filtered by type, possibly summarized at view-time but never written back to the log.

flowchart LR
EL[(Event Log<br/>append-only)] --> CV1[View · full]
EL --> CV2[View · summarized]
EL --> CV3[View · type-filtered]
CV2 --> LLM

The log is the source of truth; what reaches the LLM is one derived view.

The wins are heavy: deterministic replay, free audit trail, microagent triggers can subscribe by event type, and you never lose anything. The cost is infrastructure — schema evolution must be planned because old events live forever.

Drop the oldest turns. Cheap, predictable, terrible at preserving facts.

if (tokenCount(messages) > BUDGET) {
  messages = messages.slice(-RECENT_N);
}

This works for short-lived agents where there are no long-arc dependencies (a code-review bot on one diff, a triage bot on one ticket). It is spectacularly unsuitable for agents that need to remember what they discovered three hours ago.

Recent turns verbatim, deep past summarized. The pragmatic middle.

const RECENT = 10;
const recent = messages.slice(-RECENT);
const old = messages.slice(0, -RECENT);
const summary = await llmSummarize(old, { preserve: ['ids', 'paths'] });
return [systemMsg, summary, ...recent];

The recent N turns stay full-fidelity (so the model has crisp tool-result context); the older arc is condensed. Two knobs: how many recent turns, and how aggressively to compress the old.

The preservation list — your domain model

The summarizer prompt has two halves. The first half says “compress aggressively.” The second half says “but never lose these specific things.” The second half is what separates a useful summary from useless mush.

You are summarizing an agent's session for memory compression.

PRESERVE EXPLICITLY (do not paraphrase, do not drop):
- Task tracker IDs and their statuses
- File paths the agent has read or written
- Exact text of error messages (byte-for-byte)
- Tool arguments used (not just tool names)
- In-progress actions and their identifiers

COMPRESS LOOSELY:
- The agent's narration
- Repeated tool failures
- Casual reasoning

What goes in the first list is the agent’s domain knowledge. Across the corpus:

Project	Domain	Preserved verbatim
OpenHands	software engineering	task IDs, file paths, in-progress action IDs, error text
Strix	pentesting	vulnerabilities, credentials, payloads, scope tokens
Claude Code	code editing	file metadata graph (paths read/written, hashes), tool args
Hermes	general	first-turn charter, last 3–5 turns verbatim

When does compression fire?

Token threshold

The most common: when input token count exceeds a fraction (e.g. 75%) of the model’s context limit, compress. Easy to compute, predictable. Beware: each compaction busts your prompt cache.
Turn count

“Every 20 turns” or “every 50 turns.” Simpler than tokens but ignores variable turn size — one tool result might be 50K tokens by itself.
Cost ceiling

Track per-session spend; compress when it crosses a configured cap. Pragmatic for SaaS — converts a UX problem into a billing one.

Pick a strategy

? How long does your agent run, and what does it need to remember?

Single-task, < 30 turns Sliding window. Keep it simple. minimal
Long-running, single domain LLM-summarize with a domain-specific preservation list. default
Audit / replay required Event-source. Compression is a view concern.
Long-running mixed workload Hybrid: verbatim recent + summarized old.

Recommended default: Default to LLM-summarize. Spend your design time on the preservation list, not the algorithm.

Cross-project comparison

Project	Strategy	Trigger	Preservation rules
Claude Code	LLM-summarize	token threshold	file metadata graph + prose summary
OpenHands	event-source + summarizing condenser	iteration / token	task IDs, file paths, errors
Strix	LLM-summarize	turn count	vulns, credentials, payloads, errors
Hermes	hybrid	token threshold	first + last turns; summary middle
Mistral Vibe	sliding window	token threshold	none
Kimi Code	sliding window	turn count	none

Projects that implement this

Claude Code — Anthropic's official agentic CLI. Streaming tool calls, prompt caching, thinking signatures, multi-agent subagents, slash commands.
OpenHands (v0) — All-hands AI v0 — autonomous software engineer agent. Event-sourced state, microagents, controller-level guardrails.
Strix — Open-source 'AI hacker' for autonomous pentesting. XML tool format, markdown-as-skills, LLM-based dedupe, module-level agent graph.
OpenHands (v1) — OpenHands re-architected: cleaner controller, refined memory condenser, improved tool dispatch. v1 of the All-Hands agent.
Hermes Agent — 40+ tool, multi-platform agent. Provider adapters per LLM, trajectory compression preserves first/last turns, side-channel auxiliary client.

OpenHands (v0) ●●●

Memory compression preserves credentials, payloads, task IDs explicitly

Generic "summarize this conversation" loses the bits the agent needs to keep working. Mature systems enumerate preservation rules.

memory-compression

Strix ●●●

LLM-based deduplication that reasons about root cause

Two pentest reports describing the same SQL injection with different payloads aren't textually similar — but they should dedupe. Hashing fails; LLM reasoning works.

memory-compression multi-agent-coordination

Memory compression

Memory compression

The four strategies

The preservation list — your domain model

When does compression fire?

Token threshold

Turn count

Cost ceiling

Pick a strategy

Cross-project comparison

Projects that implement this

Related insights