Benchmark

Behavioral invariants: benchmarking a code auditor on bugs that only appear over time

train-dispatch is Anatoly's behavioral-invariant benchmark fixture: a deterministic train dispatcher whose six planted defects surface only as violations of liveness, mutual exclusion, ordering, and conservation over an execution. It reports the first four bench runs and shows how the fixture exposed a retrieval hole in Anatoly's RAG: data-only files were invisible to the indexer, so documented invariants never reached the constants that broke them.

June 15, 202612 min readRémi Viau

Behavioral invariants: benchmarking a code auditor on bugs that only appear over time#

Note by: Rémi Viau (Anatoly maintainer), with Claude (Anthropic Opus 4.8) as analytical partner. Repositories referenced: Anatoly, anatoly-bench, Evolve.

The defects a single line cannot show you#

Most code-audit benchmarks test defects you can see in one place. A constant set to the wrong value. A missing null check. A block of logic duplicated two files over. Anatoly's first fixture, slot-engine, is built that way: 27 planted defects across five scored axes, each anchored to a span of code you can point at.

A large class of real bugs does not work like that. They are violations of properties that only hold over an execution:

Liveness: every task eventually finishes.
Mutual exclusion: a shared resource holds at most one user at any instant.
Ordering: higher-priority work pre-empts lower-priority work.
Conservation: every item is processed exactly once, never zero times, never twice.
Temporal separation: two events stay at least a minimum distance apart.

You cannot point at the buggy line for any of these. The line that acquires a lock looks fine. The line that compares two items looks fine. The defect lives in what the program does over time, across several actors, which no single line reveals. These are also the bugs that hurt most in production: deadlocks, races, double-processing, lost updates.

That is the gap train-dispatch was built to measure. Can an LLM code auditor catch a behavioral-invariant violation from static review, when there is nothing to pattern-match and the only route to the bug is to reason about behavior? slot-engine could not answer that question, because every one of its defects is local by construction. We needed a second fixture whose defects are, by construction, not local.

The fixture: a deterministic train dispatcher#

train-dispatch is a small TypeScript project: the scheduling core of a toy railway. It exposes one pure function, runSchedule(), that advances a discrete-event simulation over integer ticks across a fixed network and a fixed timetable, and returns the movement trace. There is no random number generator anywhere in the engine. The same network and timetable always produce the same trace.

That determinism is the design choice the whole fixture rests on. slot-engine ships a stochastic slot machine, so measuring its behavior takes a 100,000-spin Monte-Carlo run. train-dispatch is deterministic, so a single execution is the ground truth. One run, one trace, no statistics.

The dispatcher advertises four operating guarantees, written into the seeded README.md and treated as the product contract:

Punctuality: every train arrives within a slack window of its scheduled time.
Exclusivity: a track block holds at most one train per tick.
Liveness: every dispatched train eventually reaches its destination.
Conservation: every dispatched train arrives exactly once.

The shipped implementation violates all four, in six specific, cataloged ways.

The six planted defects#

ID	Difficulty	Invariant violated
INV-DWELL	trivial	`DWELL_TICKS = 6` contradicts the README's documented 2-tick dwell
INV-PRIORITY	medium	the comparator orders by train id, not priority class, so an express never pre-empts a local
INV-HEADWAY	medium	a strict `>` under-enforces the minimum headway, so a follower enters a block too soon
INV-CONSERVATION	medium	a dispatched train is not removed from the ready queue, so it arrives twice
INV-MUTEX	hard	the occupancy guard checks the wrong block, so two trains share one block on a tick
INV-DEADLOCK	hard	section blocks are acquired in route order, not a fixed global order, so opposing trains circular-wait

Two of the three hardest defects, INV-MUTEX and INV-DEADLOCK, live in the same file (interlocking.ts), which forces the auditor to separate two distinct behavioral failures inside one module.

The fixture is generated by Evolve from a specification, then frozen. The specification is the single source of truth, both for Evolve (which converges the code until a --check script confirms every guarantee is measurably broken) and for the bench scorer (which reads the same catalog to grade Anatoly's findings). A hard rule of the spec is that the defects must look like ordinary, slightly careless production code. No comment may say "intentional", "bug", "race", "deadlock", or anything equivalent. A fixture that annotates its own bugs measures nothing. The auditor has to reason about the dispatcher's behavior to find them, which is the entire point.

One scored axis, on purpose#

train-dispatch scores a single axis: correction. The project ships no tests, almost no documentation, and a couple of deliberately clean "noise" modules whose only job is to create opportunities for false positives. Because there is only one scored axis, the global F1 equals the correction F1. There is no cross-axis averaging to dampen a single catch or miss.

That makes the signal sharp and a little brutal. With six defects on one axis, a single caught-or-missed finding moves global F1 by roughly 10 to 12 points. The benefit is that nothing hides: the score is a direct readout of how many behavioral invariants the auditor recovered, with the noise modules keeping precision honest.

Four runs in one day#

All four runs were recorded on 2026-06-15. The fixture was brand new, so unlike the long slot-engine history there is no frozen-surface era yet: the code surface itself moved between v1 and v3 as the Evolve output converged on its final shape. v3 and v4 share the same frozen project state.

Run	Global / correction F1	Precision	Recall	TP / FP / FN
v1	72.7%	80.0%	66.7%	4 / 1 / 2
v2	50.0%	50.0%	50.0%	3 / 3 / 3
v3	66.7%	66.7%	66.7%	4 / 2 / 2
v4	72.7%	80.0%	66.7%	4 / 1 / 2

Three defects are caught on every run: INV-PRIORITY, INV-MUTEX, and INV-CONSERVATION. That already says something worth saying plainly: an LLM auditor recovers mutual exclusion and conservation violations from static review more reliably than I expected going in. INV-HEADWAY is caught in three runs out of four. The dip to 50.0% at v2 is the single-axis design showing its teeth: one extra miss plus two extra false positives, on a fixture state that was still settling, costs more than 20 points at once.

The interesting comparison is v3 against v4. The project is byte-for-byte identical between them. The only thing that changed is Anatoly itself.

The finding: a benchmark that found a bug in our own retriever#

This is the part that justifies building the fixture, and it is not the score.

v4 carries one Anatoly-side change relative to v3, on the branch rag-index-declarations. It fixes a retrieval hole that train-dispatch was unusually good at exposing, because of a property slot-engine never had: a defect that lives in a file with no functions at all.

Data-only files were invisible to the indexer#

Anatoly's RAG indexer turns a file's symbols into searchable cards. The relevant filter, from the AGPL-3.0 Anatoly source, is one line:

// src/rag/indexer.ts (AGPL-3.0)
const functionSymbols = task.symbols.filter(
  (s) => s.kind === 'function' || s.kind === 'method' || s.kind === 'hook',
);
// A card is built for each functionSymbol, and only for those.

timetable.ts is a data-only module: top-level const, type, and enum, no functions. So it produced zero cards, and therefore stored no NLP vector. Downstream, the per-file documentation retriever asks the vector store for that file's average NLP vector:

// src/core/docs-resolver.ts (AGPL-3.0)
let queryEmbedding = await vectorStore.getAverageNlpVectorByFile(filePath);

For timetable.ts that call returned null, because there were no cards to average. The only query the retriever had left was the file name itself (Module: timetable), which is far too thin to pull the README's timing section to the constant that needed it.

Now connect that to INV-DWELL. INV-DWELL is the trivial defect: a constant, DWELL_TICKS = 6, that contradicts the README's documented dwell of 2 ticks. To even suspect it, the auditor needs the README sentence "A station dwell is 2 ticks" sitting next to the constant. But the file holding the constant was invisible to documentation retrieval, so the contradiction was unreachable. The easiest defect in the fixture was unreachable for a reason that had nothing to do with reasoning ability, and everything to do with what the indexer chose to index.

The fix#

The rag-index-declarations change indexes non-function declarations as a new card kind: 'declaration'. The summary is synthesized from the AST, with no extra LLM call, so the cost is negligible. The declaration card is included in the per-file NLP average, which gives a data-only file a real semantic fingerprint, but it is excluded from code-similarity and duplication search, so a bare constant never registers as a "duplicate" of a function.

The bench run is the proof that the hole is closed. After the fix, the timetable.ts correction review carries reference_source: README.md#Timing and the excerpt "A station dwell is 2 ticks...". Before the fix, that documentation could not physically reach the file. The retrieval foundation INV-DWELL needs is now in place, and it shows up as a tangible artifact in the report, not just as a number on a chart.

This is the argument for the fixture in one example. A benchmark is not only a scoreboard. A fixture deliberately built around a data-only file with a documented invariant surfaced an architectural retrieval gap that no amount of rereading audit output would have revealed, because the missing document never appeared in the output to begin with. The single-axis, behavior-first design made the gap legible.

What is still missed, and why it is the interesting part#

Two defects resist every run, v1 through v4, and they fail for two very different reasons.

INV-DWELL: detection works, classification does not. This is the subtle one. After the declaration-indexing fix, the README excerpt arrives, and the contradiction is detected: it surfaces as a doc_divergence plus a documentation: UNDOCUMENTED finding. But the correction-axis verdict on the constant is OK, with the rationale that there is no code-level defect, only a value that diverges from the README. The catalog files INV-DWELL as a correction defect, so the divergence classification scores no credit. The retrieval gap is closed; what remains is a policy question. Should the correction axis treat a constant that contradicts a documented invariant as a correctness defect in its own right, or accept the divergence classification? That is an owner decision, not a detection improvement. It is the same recurring theme as the slot-engine arbitration loop: a contradiction can be detected and surfaced without ever folding into the scored axis.

INV-DEADLOCK: genuine multi-actor reasoning. This is the hardest defect in the fixture, and no run has cracked it. The circular wait only manifests when an eastbound train and a westbound train approach the same single-track section from opposite directions and are considered together, over time. The line that reserves a block looks correct. The bug is the order in which two different trains acquire two shared blocks, visible only across the execution trace of both. Surfacing it from static review is a real frontier for an LLM auditor, and an honest one to leave open.

The two persistent misses cluster the open work cleanly. One is cross-axis classification policy. The other is multi-actor execution-trace reasoning. Neither is a retrieval problem anymore, which is exactly the kind of clarity a focused fixture is supposed to buy.

What this says about LLM code auditors#

Three takeaways I am comfortable defending from four runs, while being clear that four runs on a single-axis fixture is a small sample.

Several behavioral invariants are within reach of static review today. Mutual exclusion, conservation, and priority ordering are caught reliably here, which is more than the "LLMs only pattern-match syntax" prior would predict.
The two hard gaps are not the ones I would have guessed. They are a classification-policy decision and multi-actor trace reasoning, not raw detection. Knowing which gap you are looking at is most of the value.
A benchmark earns its keep when it is adversarial to your architecture, not just your score. The declaration-indexing hole is the proof. It was invisible from the report and obvious from the fixture, which is the whole reason to build fixtures that probe shapes your real codebases happen to contain but your existing tests do not isolate.

For two adjacent results from the same bench, see the concision-discipline finding, where a 12-line prompt change improved F1 and cut tokens at once on slot-engine, and our open question about replacing the function-level vector RAG, which sits directly upstream of the retrieval hole described here.

Reproducing this#

The fixture, the catalog, and the per-run baselines live in anatoly-bench under catalog/train-dispatch/. Each run is a single anatoly run against catalog/train-dispatch/project/, scored with anatoly-bench score. The audit agent itself is Anatoly, and the fixture is generated from its specification by Evolve. Because the bench consumes only Anatoly's public report artifacts and never imports its types, a change to the report format breaks the bench loudly, which is the correct signal that a contract moved.