Research

Articles & deep-dives

Technical research from the Anatoly team: benchmarks, comparisons, and findings reports on AI-assisted code auditing.

Featured

ComparisonMay 13, 202623 min read

AI Code Audit vs AI Code Review in 2026: The 14 Tools That Matter, Sorted by What They Actually Do

A taxonomy and curated comparison of 14 AI code audit and AI code review tools available in 2026. Sorted by whether they ship at PR time (review) or scan existing codebases (audit), with pricing, self-hosting, local-model support, and honest tradeoffs for each.

Findings ReportMay 6, 202613 min read

Concision discipline: a Pareto-improving prompt strategy for code-audit agents

An empirical study showing that a 12-line anti-filler instruction added to a code-audit agent's system prompt simultaneously cut output tokens by 24.7%, cost by 20.7%, wall-clock duration by 27.9%, and improved F1 recall by 9.1 points on the slot-engine fixture.

All articles

BenchmarkJune 15, 202612 min read

Behavioral invariants: benchmarking a code auditor on bugs that only appear over time

train-dispatch is Anatoly's behavioral-invariant benchmark fixture: a deterministic train dispatcher whose six planted defects surface only as violations of liveness, mutual exclusion, ordering, and conservation over an execution. It reports the first four bench runs and shows how the fixture exposed a retrieval hole in Anatoly's RAG: data-only files were invisible to the indexer, so documented invariants never reached the constants that broke them.

Deep DiveMay 27, 202643 min read

Routing the Claude Agent SDK to Local LLMs: A Dual-Tier Qwen Stack with TurboQuant 4-bit KV

An Anatoly exploration: running a multi-step pipeline on local llama.cpp instead of Anthropic. Single Qwen3.6-35B-A3B GGUF in two thinking modes (haiku no-think, sonnet thinking), four SDK bugs including a thinking-disable trick worth a 12-fold speedup, TurboQuant 4-bit KV cache on a 24 GB RTX 3090 Ti, 100-call benchmark with Opus-as-judge. Local is 5 to 9 times faster than Anthropic and at the Opus ceiling on verify-rag.

Deep DiveMay 15, 202612 min read

Detecting Semantic Conflicts Between Documents: A Pragmatic Pipeline

A four-stage pipeline for finding where two documents contradict each other, not just where they overlap: chunking and embedding, cosine pre-filtering, section deduplication and neighbor expansion, then NLI or LLM inversion detection. Includes a CPU-only deployment path and a sub-ten-cent cost model on a realistic workload.

Research NoteMay 14, 20267 min read

Should we swap Anatoly's RAG for PageIndex? An open question with a measurable answer

Framing a tooling question instead of decreeing it. The two systems are shaped differently: a function-level vector lookup on one side, an LLM-driven TOC walk on the other. Whether one should replace the other depends on workload economics and test conditions we have not yet measured. This note states the question honestly, lays out our prior, and describes the bounded experiment that would settle it.