Anatoly logo - multi-LLM AI agent that audits your codebase

Research

Articles & deep-dives

Technical research from the Anatoly team: benchmarks, comparisons, and findings reports on AI-assisted code auditing.

Featured

All articles

Benchmark12 min read

Behavioral invariants: benchmarking a code auditor on bugs that only appear over time

train-dispatch is Anatoly's behavioral-invariant benchmark fixture: a deterministic train dispatcher whose six planted defects surface only as violations of liveness, mutual exclusion, ordering, and conservation over an execution. It reports the first four bench runs and shows how the fixture exposed a retrieval hole in Anatoly's RAG: data-only files were invisible to the indexer, so documented invariants never reached the constants that broke them.

Deep Dive43 min read

Routing the Claude Agent SDK to Local LLMs: A Dual-Tier Qwen Stack with TurboQuant 4-bit KV

An Anatoly exploration: running a multi-step pipeline on local llama.cpp instead of Anthropic. Single Qwen3.6-35B-A3B GGUF in two thinking modes (haiku no-think, sonnet thinking), four SDK bugs including a thinking-disable trick worth a 12-fold speedup, TurboQuant 4-bit KV cache on a 24 GB RTX 3090 Ti, 100-call benchmark with Opus-as-judge. Local is 5 to 9 times faster than Anthropic and at the Opus ceiling on verify-rag.

Deep Dive12 min read

Detecting Semantic Conflicts Between Documents: A Pragmatic Pipeline

A four-stage pipeline for finding where two documents contradict each other, not just where they overlap: chunking and embedding, cosine pre-filtering, section deduplication and neighbor expansion, then NLI or LLM inversion detection. Includes a CPU-only deployment path and a sub-ten-cent cost model on a realistic workload.

Research Note7 min read

Should we swap Anatoly's RAG for PageIndex? An open question with a measurable answer

Framing a tooling question instead of decreeing it. The two systems are shaped differently: a function-level vector lookup on one side, an LLM-driven TOC walk on the other. Whether one should replace the other depends on workload economics and test conditions we have not yet measured. This note states the question honestly, lays out our prior, and describes the bounded experiment that would settle it.