Research
Articles & deep-dives
Technical research from the Anatoly team: benchmarks, comparisons, and findings reports on AI-assisted code auditing.
Featured
AI Code Audit vs AI Code Review in 2026: The 14 Tools That Matter, Sorted by What They Actually Do
A taxonomy and curated comparison of 14 AI code audit and AI code review tools available in 2026. Sorted by whether they ship at PR time (review) or scan existing codebases (audit), with pricing, self-hosting, local-model support, and honest tradeoffs for each.
Concision discipline: a Pareto-improving prompt strategy for code-audit agents
An empirical study showing that a 12-line anti-filler instruction added to a code-audit agent's system prompt simultaneously cut output tokens by 24.7%, cost by 20.7%, wall-clock duration by 27.9%, and improved F1 recall by 9.1 points on the slot-engine fixture.
All articles
Behavioral invariants: benchmarking a code auditor on bugs that only appear over time
train-dispatch is Anatoly's behavioral-invariant benchmark fixture: a deterministic train dispatcher whose six planted defects surface only as violations of liveness, mutual exclusion, ordering, and conservation over an execution. It reports the first four bench runs and shows how the fixture exposed a retrieval hole in Anatoly's RAG: data-only files were invisible to the indexer, so documented invariants never reached the constants that broke them.
Routing the Claude Agent SDK to Local LLMs: A Dual-Tier Qwen Stack with TurboQuant 4-bit KV
An Anatoly exploration: running a multi-step pipeline on local llama.cpp instead of Anthropic. Single Qwen3.6-35B-A3B GGUF in two thinking modes (haiku no-think, sonnet thinking), four SDK bugs including a thinking-disable trick worth a 12-fold speedup, TurboQuant 4-bit KV cache on a 24 GB RTX 3090 Ti, 100-call benchmark with Opus-as-judge. Local is 5 to 9 times faster than Anthropic and at the Opus ceiling on verify-rag.
Detecting Semantic Conflicts Between Documents: A Pragmatic Pipeline
A four-stage pipeline for finding where two documents contradict each other, not just where they overlap: chunking and embedding, cosine pre-filtering, section deduplication and neighbor expansion, then NLI or LLM inversion detection. Includes a CPU-only deployment path and a sub-ten-cent cost model on a realistic workload.
Should we swap Anatoly's RAG for PageIndex? An open question with a measurable answer
Framing a tooling question instead of decreeing it. The two systems are shaped differently: a function-level vector lookup on one side, an LLM-driven TOC walk on the other. Whether one should replace the other depends on workload economics and test conditions we have not yet measured. This note states the question honestly, lays out our prior, and describes the bounded experiment that would settle it.