Deep Dive

Detecting Semantic Conflicts Between Documents: A Pragmatic Pipeline

A four-stage pipeline for finding where two documents contradict each other, not just where they overlap: chunking and embedding, cosine pre-filtering, section deduplication and neighbor expansion, then NLI or LLM inversion detection. Includes a CPU-only deployment path and a sub-ten-cent cost model on a realistic workload.

May 15, 202612 min readRémi Viau

Note by: Rémi Viau (Anatoly maintainer), with Claude (Anthropic Opus 4.7) as analytical partner.

Comparing two documents to find what they say in common is a solved problem. Comparing two documents to find where they contradict each other — same topic, opposite claim — is not. The naive instinct is to reach for cosine similarity over embeddings, and it will quietly betray you on day one.

This article walks through a pipeline that actually works, why each stage exists, and what to do when you have neither a GPU nor an unlimited LLM budget. The architecture below is what we ship when a client asks us to flag conflicts between, say, a contract draft and its previous version, or between two competing specification documents.

1. The problem nobody warns you about#

You have two documents — call them A and B. They cover overlapping ground. You want a system that surfaces the passages where they disagree, not the passages where they merely talk about the same thing.

A first cut looks deceptively simple:

Embed both documents.
For every chunk in A, find the most similar chunks in B via cosine similarity.
Surface the pairs.

Run this on a real corpus and you immediately hit two failure modes.

The first is that cosine similarity is blind to negation. The two sentences below have a cosine similarity well above 0.95 with any modern embedding model:

A: The contract expires on December 31, 2025.
B: The contract does not expire on December 31, 2025.

They are direct contradictions. Cosine similarity flags them as near-identical because the embedding space is dominated by topic and vocabulary, not by truth conditions. So your high-similarity bucket — the one you want to inspect for conflicts — will be packed with paraphrases, near-duplicates, and a small minority of real contradictions, all indistinguishable by score alone.

The second failure mode is the opposite. Two passages can contradict each other without any lexical overlap:

A: All employees are entitled to four weeks of paid leave.
B: Contractors are excluded from the paid leave program.

Whether this is a contradiction depends on whether contractors are considered employees in the surrounding text. Embeddings won't help you here. Neither will cosine. You need context, and you need actual inference.

So the pipeline has to do three things: find the right pairs of passages to compare, decide whether each pair actually contradicts, and do both cheaply enough to run on real documents.

2. Pipeline overview#

The architecture we converged on has four stages, each cheaper than the next is expensive:

┌──────────────┐    ┌─────────────┐    ┌────────────┐    ┌──────────────┐
│  Chunk +     │───▶│   Cosine    │───▶│  Section   │───▶│  Inversion   │
│  embed both  │    │ similarity  │    │  expansion │    │  detection   │
│  documents   │    │  (filter)   │    │            │    │  (NLI / LLM) │
└──────────────┘    └─────────────┘    └────────────┘    └──────────────┘
   ~free, batch       ~free, vector       free, set         the only
   once per doc       math at query       arithmetic        expensive call

Each stage exists to reduce the candidate space for the next. Cosine similarity is a permissive filter, not a decision-maker. By the time we get to the expensive inversion step, we should be passing in tens of pairs per document comparison, not thousands.

3. Stage 1 — Chunking and embedding#

Chunking matters more than the embedding model. We chunk both documents into overlapping windows of roughly 200 to 400 tokens, with ~50 tokens of overlap. Smaller chunks make cosine similarity sharper but lose context; larger chunks dilute the signal.

Three rules that have held up across projects:

Chunk on semantic boundaries, not character counts. Paragraphs, list items, and section headers should align with chunk edges whenever possible. Splitting mid-sentence breaks downstream NLI inference.
Store the parent section ID with every chunk. You will need it in stage 3 to merge results back into something a human can read.
Embed once per document version, cache aggressively. The embedding step is the highest-throughput stage but only runs when a document changes.

For the embedding model itself, any of the standard options work — OpenAI's text-embedding-3-small, Voyage AI's voyage-3, or a local bge-small-en-v1.5 if you need full sovereignty. The differences at this stage are smaller than people pretend; the filter is coarse on purpose.

4. Stage 2 — Cosine similarity as a coarse filter#

This is where most teams stop, and where most teams get it wrong. The point of cosine similarity here is not to decide whether two passages contradict. The point is to throw out the 99% of pairs that are obviously unrelated, so the expensive downstream model only sees plausible candidates.

A reasonable threshold sits between 0.75 and 0.85, depending on your embedding model and corpus. Below 0.75, you keep too much noise; above 0.85, you start losing real contradictions whose lexical surface diverges. Calibrate against an evaluation set or you'll be guessing forever.

Two upgrades worth considering before you ship:

Hybrid retrieval. Cosine alone misses contradictions phrased in different vocabulary. Adding a lexical retriever (BM25) and merging the result sets — or running a reranker over the union — recovers cases like the "employees vs contractors" example above. The cost is modest; the recall gain is real.

A reranker pass. A cross-encoder reranker (bge-reranker-v2-m3 is the current free workhorse) scores pairs together rather than independently, and catches semantic relationships that bi-encoder similarity misses. If you can afford the latency, drop it between cosine and the inversion stage. We weigh the same reranker as a retrieval baseline in the PageIndex vs Anatoly RAG note.

5. Stage 3 — Section deduplication and neighbor expansion#

After cosine filtering you have a list of chunk pairs. Two problems with feeding them directly to the inversion model.

First, chunks that came from the same parent section appear repeatedly across pairs. A 5000-word section split into 15 chunks can generate dozens of similar pairs that all point at the same underlying disagreement. Deduplicate by merging chunks back into their parent sections.

Second, individual chunks rarely carry enough context to judge contradiction on their own. The negation in A might live in a sentence two paragraphs above. The exception in B might be defined three list items down. We expand each retained section to include its immediate neighbors — the section before and after, or one paragraph on each side, depending on document structure.

The order matters: expand first, then deduplicate. If you dedupe first, you risk merging chunks whose neighbor expansions point in different directions, conflating two unrelated discussions into a single noisy pair. Expand, then collapse on parent-section identity.

The output of this stage is a set of section-pair candidates, each carrying enough context to be judged independently.

6. Stage 4 — Inversion detection#

Here is where the real decision happens. Two options dominate in practice.

6.1 Natural Language Inference (NLI) models#

NLI is exactly the task you want: given a premise and a hypothesis, classify the pair as entailment, contradiction, or neutral. Models like MoritzLaurer/DeBERTa-v3-large-mnli-fever-anli-ling-wanli are purpose-built for this and outperform general-purpose LLMs on the narrow task, at a fraction of the cost.

Pros:

Free to run if you have a GPU, near-free on CPU with quantized variants.
Deterministic outputs with calibrated probabilities.
No prompt engineering, no parsing failures.
No data leaves your infrastructure.

Cons:

Context window is small (~512 tokens). Long sections get truncated.
No natural-language explanation of why a pair was flagged.
Weaker on multi-step reasoning ("are contractors employees?") than a frontier LLM.
Slightly more sensitive to surface phrasing than an LLM.

6.2 LLMs in NLI mode#

You can prompt a small LLM to produce the same three-way classification. The pattern is straightforward:

Given two passages, answer ONLY with one word:
- ENTAILMENT if passage B follows from A
- CONTRADICTION if B contradicts A
- NEUTRAL otherwise
 
Passage A: {section_a}
Passage B: {section_b}
Answer:

With max_tokens: 3 and temperature: 0, this works well with any modern small LLM — Claude Haiku, Gemini Flash, Llama 3.1 8B via Groq, DeepSeek. The cost per pair is fractions of a cent. The latency is higher than NLI but the reasoning is more robust.

6.3 When to pick which#

Constraint	Pick
GPU available, high volume, no need for explanations	NLI model, hosted locally
No GPU, modest volume, want to ship quickly	Small LLM via API (Haiku, Flash, Groq)
Need natural-language explanations for end users	LLM, or NLI + LLM in cascade
Confidential documents, no external API allowed	NLI on CPU with ONNX + INT8 quantization
Highest accuracy, cost insensitive	NLI for filtering, then frontier LLM for confirmation

The cascade pattern in that last row is what we recommend for production: NLI on every candidate pair as a cheap filter, then a frontier LLM (Sonnet, GPT-4-class) on the survivors to confirm and generate the explanation a human will actually read. You get NLI's recall and the LLM's precision and verbosity, paying frontier prices only on the handful of pairs that matter.

7. Deploying without a GPU#

The most common reality check: production lives on a CPU-only box. Three things change.

Use quantized NLI models. A DeBERTa-v3-base exported to ONNX and quantized to INT8 runs at 50–150ms per pair on a decent CPU, versus 500ms+ for the unquantized PyTorch version. Use the optimum library to convert; the accuracy loss on NLI tasks is typically under 1%.

Batch aggressively. Sending 32 pairs through the model in a single forward pass is dramatically faster than 32 sequential calls. If your pipeline can buffer candidates and process them in bursts, do so.

Consider hosted inference. If you don't want to manage an inference server at all, Hugging Face Inference Endpoints will host a NLI model on a small GPU instance for a few dollars a day — cheaper than the engineer-hours you'll spend on CPU optimization if your traffic is intermittent. Replicate and Modal serve the same need with serverless billing.

OpenRouter is not the answer for true NLI. It aggregates LLMs, not encoders. You can use it to call a small LLM in NLI-mode, but you cannot get DeBERTa-class NLI through it. If your decision is "OpenRouter or Hugging Face Endpoints", that decision is really "small LLM or actual NLI", and the right answer depends on the trade-offs in §6.3.

8. Cost analysis on a realistic workload#

Assume a comparison between two 50-page documents, ~50 chunks each, ~2500 candidate pairs after cartesian product, ~150 pairs surviving cosine filtering at threshold 0.80, ~40 unique section pairs after dedup and expansion.

Stage	Volume	Tool	Cost per comparison
Embedding	100 chunks once	`text-embedding-3-small`	~$0.001
Cosine + filter	2500 pairs	local vector math	~$0
Dedup + expand	150 → 40 pairs	application code	~$0
Inversion (NLI)	40 pairs	local DeBERTa-base	~$0
Inversion (LLM)	40 pairs	Haiku at ~500 tokens/pair	~$0.02
Explanation (LLM)	~5 confirmed pairs	Sonnet at ~2000 tokens/pair	~$0.06

Total per document comparison: under ten cents with the cascade pattern, near-zero with NLI alone. The point of the pipeline is precisely to spend nothing on the 99% of pairs that don't matter, and pay frontier prices only on the few that do.

9. What we learned the hard way#

A few things that aren't obvious until you've shipped this:

Calibrate thresholds against a labeled set early. "Reasonable" cosine and NLI thresholds drift with document type. Build a 50-pair evaluation set the day you start; revisit it whenever you change models. The same empirical instinct produced the concision-discipline result.
Surface scores, not verdicts, in the UI. A binary "conflict detected" hides the model's uncertainty. Showing a confidence score lets reviewers triage and lets you tune thresholds based on real usage.
Cache embeddings by content hash, not document ID. Documents get renamed, re-uploaded, and edited. Hashing the chunk content directly avoids invalidating an entire embedding cache on a typo fix.
Test with adversarial pairs. Build a regression set that includes paraphrases, near-duplicates with single-word negations, and the contradictions that need multi-hop reasoning. If those three categories don't all score correctly, your pipeline isn't ready.
Plan for the explanation step from day one. Reviewers will not trust a black-box "these contradict" output. The LLM at the end of the cascade isn't optional UX — it's the difference between a tool people use and a tool they ignore.

10. Where to go from here#

The pipeline above ships. It's been validated on contract pairs, specification revisions, and policy documents in three languages. The next frontier — and the one we're actively exploring — is implicit contradiction detection: cases where the conflict only emerges after the system reasons over both documents jointly, rather than over individual section pairs.

That problem looks more like a planning task than a retrieval task, and the techniques start to overlap with the work we've documented in the research index. If you're working on the same problem, the research index is a good place to start.

For implementation questions, open a discussion on the GitHub repo or reach the author directly. The broader market context for code-audit and code-review tooling is in AI Code Audit vs AI Code Review in 2026.

Routing the Claude Agent SDK to Local LLMs: A Dual-Tier Qwen Stack with TurboQuant 4-bit KV

An Anatoly exploration: running a multi-step pipeline on local llama.cpp instead of Anthropic. Single Qwen3.6-35B-A3B GGUF in two thinking modes (haiku no-think, sonnet thinking), four SDK bugs including a thinking-disable trick worth a 12-fold speedup, TurboQuant 4-bit KV cache on a 24 GB RTX 3090 Ti, 100-call benchmark with Opus-as-judge. Local is 5 to 9 times faster than Anthropic and at the Opus ceiling on verify-rag.