← Insights

LLM-based deduplication that reasons about root cause

Two pentest reports describing the same SQL injection with different payloads aren't textually similar — but they should dedupe. Hashing fails; LLM reasoning works.

Strix difficulty 2/3 securitydedupenovel memory-compressionmulti-agent-coordination

Strix runs many sub-agents that each test for vulnerabilities. They report findings independently. Two findings might describe the same underlying bug with different payloads, line numbers, and prose — a textual hash would never match them.

Instead, Strix asks an LLM: “Are these two findings the same root cause? Here’s both. Reason about it.” Same-root-cause findings collapse into one report.

def is_same_root_cause(a: Finding, b: Finding) -> bool:
    response = llm.complete(
        DEDUPE_PROMPT.format(a=a.full_text(), b=b.full_text())
    )
    return response.startswith("YES")

The prompt is essentially: “two findings, both might describe the same vulnerability, decide if the root cause is identical.”

Why this is non-obvious

Most deduplication is fast and cheap (hash, embedding cosine). LLM-based dedupe is slow and expensive — orders of magnitude more cost per pair. You’d never use this for tweets or log lines.

But for high-stakes, low-volume domains (security findings, customer support tickets, legal contracts), the false-merge cost dwarfs the LLM-call cost. Spending tokens on dedupe is correct.

Pattern beyond Strix

This generalizes anywhere two reports might be the same despite surface differences:

  • Customer complaints (different language, same root cause).
  • Bug reports (different stack traces, same broken function).
  • Search results in citation-heavy domains.

When NOT to use it

  • Volume is high (millions of items): too expensive.
  • Surface similarity is a strong signal (textual matches): hash first, LLM only on near-misses.
  • Latency-sensitive flow: do it offline as a batch job.

Hybrid pattern

A practical optimization: cheap-first. Embedding cosine to find candidates above some threshold; LLM only for the candidate pairs. Most pairs prune cheaply; expensive reasoning only where it matters.

Sources

  • strix/05_skills_and_prompts.md:20 ? unverified