← All concepts

Memory compression

Long sessions overflow the context window. The good implementations don't summarize — they enumerate what to keep.

5 projects 2 insights 4 variants
TL;DR 9 min read

After 30–80 turns, even a 200K context starts to fray, and you don’t want to keep paying input cost for the contents of a file you read once. Four strategies are alive in the corpus: LLM summarization (the production default), event sourcing (never compresses; derives views), sliding window (the simplest), and hybrid (verbatim recent + summarized old). The strategy choice matters less than what your summarizer is told to preserve. That list is your domain model.

Memory compression

Token budgets are finite. Conversations grow. After enough turns, the agent’s context is full of itself — old tool results, old reasoning, old file contents that have since changed. Most of it is noise. Compression is the act of throwing away the noise without throwing away the signal.

flowchart LR
T1[Turn 1] --> T2[Turn 2] --> T3[Turn 3] --> T4[Turn N-2] --> T5[Turn N-1] --> T6[Turn N]
T1 -. compress .-> S[(Summary)]
T2 -. compress .-> S
T3 -. compress .-> S
S --> Ctx[Context for next call]
T4 --> Ctx
T5 --> Ctx
T6 --> Ctx
class S sum
Old turns roll up into a structured summary; recent turns stay verbatim.

The four strategies

The standard for production agents. When token usage crosses a threshold, slice off old turns and call a separate LLM with a structured “preserve these things, summarize the rest” prompt. Prepend the summary to the remaining context.

def compress(turns, target_tokens):
    if total_tokens(turns) < target_tokens:
        return turns
    keep_recent = turns[-RECENT_N:]
    to_summarize = turns[:-RECENT_N]
    summary = aux_llm.complete(
        SUMMARIZE_PROMPT.format(content=to_summarize),
        # the prompt enumerates what must survive
    )
    return [SystemMessage(summary), *keep_recent]

The cost is one extra LLM call per compaction event — a few thousand input tokens, hundreds out. Cheap relative to what you save.

The quality is all in the prompt. See the next section.

Don’t compress. The append-only event log is the source of truth, forever. What hits the LLM is a derived view — last N events, filtered by type, possibly summarized at view-time but never written back to the log.

flowchart LR
EL[(Event Log<br/>append-only)] --> CV1[View · full]
EL --> CV2[View · summarized]
EL --> CV3[View · type-filtered]
CV2 --> LLM
The log is the source of truth; what reaches the LLM is one derived view.

The wins are heavy: deterministic replay, free audit trail, microagent triggers can subscribe by event type, and you never lose anything. The cost is infrastructure — schema evolution must be planned because old events live forever.

Drop the oldest turns. Cheap, predictable, terrible at preserving facts.

if (tokenCount(messages) > BUDGET) {
  messages = messages.slice(-RECENT_N);
}

This works for short-lived agents where there are no long-arc dependencies (a code-review bot on one diff, a triage bot on one ticket). It is spectacularly unsuitable for agents that need to remember what they discovered three hours ago.

Recent turns verbatim, deep past summarized. The pragmatic middle.

const RECENT = 10;
const recent = messages.slice(-RECENT);
const old = messages.slice(0, -RECENT);
const summary = await llmSummarize(old, { preserve: ['ids', 'paths'] });
return [systemMsg, summary, ...recent];

The recent N turns stay full-fidelity (so the model has crisp tool-result context); the older arc is condensed. Two knobs: how many recent turns, and how aggressively to compress the old.

The preservation list — your domain model

The summarizer prompt has two halves. The first half says “compress aggressively.” The second half says “but never lose these specific things.” The second half is what separates a useful summary from useless mush.

You are summarizing an agent's session for memory compression.

PRESERVE EXPLICITLY (do not paraphrase, do not drop):
- Task tracker IDs and their statuses
- File paths the agent has read or written
- Exact text of error messages (byte-for-byte)
- Tool arguments used (not just tool names)
- In-progress actions and their identifiers

COMPRESS LOOSELY:
- The agent's narration
- Repeated tool failures
- Casual reasoning

What goes in the first list is the agent’s domain knowledge. Across the corpus:

ProjectDomainPreserved verbatim
OpenHandssoftware engineeringtask IDs, file paths, in-progress action IDs, error text
Strixpentestingvulnerabilities, credentials, payloads, scope tokens
Claude Codecode editingfile metadata graph (paths read/written, hashes), tool args
Hermesgeneralfirst-turn charter, last 3–5 turns verbatim

When does compression fire?

  1. Token threshold

    The most common: when input token count exceeds a fraction (e.g. 75%) of the model’s context limit, compress. Easy to compute, predictable. Beware: each compaction busts your prompt cache.

  2. Turn count

    “Every 20 turns” or “every 50 turns.” Simpler than tokens but ignores variable turn size — one tool result might be 50K tokens by itself.

  3. Cost ceiling

    Track per-session spend; compress when it crosses a configured cap. Pragmatic for SaaS — converts a UX problem into a billing one.

Pick a strategy

? How long does your agent run, and what does it need to remember?
  • Single-task, < 30 turns Sliding window. Keep it simple. minimal
  • Long-running, single domain LLM-summarize with a domain-specific preservation list. default
  • Audit / replay required Event-source. Compression is a view concern.
  • Long-running mixed workload Hybrid: verbatim recent + summarized old.

Recommended default: Default to LLM-summarize. Spend your design time on the preservation list, not the algorithm.

Cross-project comparison

ProjectStrategyTriggerPreservation rules
Claude CodeLLM-summarizetoken thresholdfile metadata graph + prose summary
OpenHandsevent-source + summarizing condenseriteration / tokentask IDs, file paths, errors
StrixLLM-summarizeturn countvulns, credentials, payloads, errors
Hermeshybridtoken thresholdfirst + last turns; summary middle
Mistral Vibesliding windowtoken thresholdnone
Kimi Codesliding windowturn countnone

Projects that implement this

  • Claude Code — Anthropic's official agentic CLI. Streaming tool calls, prompt caching, thinking signatures, multi-agent subagents, slash commands.
  • OpenHands (v0) — All-hands AI v0 — autonomous software engineer agent. Event-sourced state, microagents, controller-level guardrails.
  • Strix — Open-source 'AI hacker' for autonomous pentesting. XML tool format, markdown-as-skills, LLM-based dedupe, module-level agent graph.
  • OpenHands (v1) — OpenHands re-architected: cleaner controller, refined memory condenser, improved tool dispatch. v1 of the All-Hands agent.
  • Hermes Agent — 40+ tool, multi-platform agent. Provider adapters per LLM, trajectory compression preserves first/last turns, side-channel auxiliary client.