Anatoly logo - multi-LLM AI agent that audits your codebase

Findings Report

Concision discipline: a Pareto-improving prompt strategy for code-audit agents

An empirical study showing that a 12-line anti-filler instruction added to a code-audit agent's system prompt simultaneously cut output tokens by 24.7%, cost by 20.7%, wall-clock duration by 27.9%, and improved F1 recall by 9.1 points on the slot-engine fixture.

13 min readRémi Viau

Methodology and analysis: Rémi Viau (Anatoly maintainer), with Claude (Anthropic Sonnet 4.6) as analytical partner. Repositories referenced: anatoly, anatoly-bench.

Abstract#

We report a counter-intuitive empirical result: adding a 12-line "concision discipline" instruction (anti-filler directives) to the system prompts of a multi-axis code-audit agent simultaneously reduced output token consumption (-24.7%), cost (-20.7%), and wall-clock duration (-27.9%) while improving recall (+9.1 absolute F1 points). The result contradicts the intuition that prose concision and analysis quality trade off against each other. We document the methodology, discuss the mechanistic hypotheses, and state the measurement limits honestly (n=1 per condition, single fixture).

1. Context#

Anatoly is a source-code audit agent that evaluates each file along 7 axes: utility, duplication, correction, overengineering, tests, best_practices, documentation. Each axis is a separate LLM call returning structured JSON findings. The detail and note fields (free-form prose) summed across thousands of findings dominate the output-token cost of a typical run.

Prior work (ADR-04 Token Compression) proposed replacing prose with a JSON evidence + terse note format. This pattern was prototyped on the utility axis with mixed results in practice: on axes already telegraphic (overengineering, utility), the JSON-key overhead (runtime_importers, type_importers, etc.) cancelled the prose savings. Measured: overengineering saw its tokens increase by +25% and its F1 drop by -10.7 points in structured-compression mode.

A simpler alternative, instructing the model directly to remove fillers without touching the schema, was considered in ADR-04 §"Alternatives Rejected" under the name "Caveman prose" and rejected for two reasons:

  1. Theoretical compression lower than the structured approach (~58% vs ~65%)
  2. Quality risk rated "Medium": unpredictable phrasing, potential reasoning degradation

The present work tests that intuition empirically.

2. Hypotheses#

H0 (null): the discipline has no measurable effect on output token volume or finding recall.

H1: the discipline reduces output token volume (≥10%) without recall regression.

H2 (counter to ADR-04 intuition): the discipline can improve recall, because suppressing hedging forces commitment to actionable findings.

3. Methodology#

3.1 Fixture#

The primary fixture is slot-engine, a TypeScript slot-machine engine project (13 .ts files under src/) deliberately seeded with 14 catalogued defects spread across 5 audited axes (correction, utility, duplication, overengineering, best_practices). The formal catalog (SPEC.md) lists for each defect: file, affected symbol, expected verdict (NEEDS_FIX, DEAD, DUPLICATE, OVER, etc.), and technical category.

The anatoly-bench tool consumes an Anatoly run folder and a SPEC.md, then computes per axis and globally:

  • TP (true positive): a finding Anatoly emits that matches the catalog
  • FP (false positive): a finding emitted without a match
  • FN (false negative): a catalogued defect that Anatoly missed
  • F1 = 2·precision·recall / (precision+recall)

3.2 Treatment#

The "concision discipline" is a 12-line block added to the system prompts. Exact text:

## Output concision
 
Cut verbosity from every output, free-text and structured alike:
 
- No preambles or self-introductions ("Looking at this code…", "Let me analyze…", "I'll now…").
- No hedging without information ("appears to", "seems to be", "might possibly", "perhaps could").
- No filler phrases ("It is important to note", "basically", "essentially", "in order to" → "to").
- No restating the question or echoing context the reader already has.
- No meta-commentary, apologies, or thanks.
- Prefer direct verbs and concrete nouns over qualifiers and abstractions.
 
"X imports Y from Z" beats "It looks like X seems to be importing Y from Z".
 
This rule applies to every free-text field (`detail`, `note`, `reasoning`, `description`, etc.). Specificity comes from precise content, not verbose phrasing.

Coverage:

  • Composed via composeAxisSystemPrompt inside _shared/guard-rails.system.md: applies automatically to the 7 axes plus their per-language variants (TypeScript, Python, Rust, Go, Java, etc.)
  • Inlined directly in: correction.verification.system.md (Opus verification pass), refinement/tier3-investigation.system.md (deliberation), doc-generation/{writer, updater, coherence-review}.system.md

Excluded from:

  • rag/nlp-summarizer.system.md and rag/section-refiner.system.md (outputs already capped at 400 chars, no filler to cut)
  • doc-generation/doc-internal-writer.{api-reference, architecture}.system.md (additional content rules composed with the main writer that already carries the discipline)

3.3 Comparison conditions#

Run M (control) Run R (treatment)
Branch main HEAD 9063168 compression-rollout HEAD af2a23f
Discipline absent present
Compression mode legacy legacy (--no-compress default)
Cache --no-cache (forced) --no-cache (forced)
Fixture slot-engine (identical state) slot-engine (identical state)
RAG pre-indexed (no rebuild) pre-indexed (no rebuild)
Models sonnet-4-6 / haiku-4-5 / opus-4-6 identical
Temperature 0 (pinned in the SDK transport) 0

The compression-rollout branch also contains Epic 51 commits (dual-mode infrastructure for utility, overengineering, best_practices), but in --no-compress mode (default) those paths are inactive: schema selection falls back to legacy and the LLM receives exactly the same prompts as on main, modulo the discipline.

3.4 Metrics#

Captured via llm-calls.ndjson (per-phase, per-axis instrumentation), run-metrics.json (duration, aggregate cost), and the anatoly-bench scorer.

  • Output tokens per axis: outputTokens aggregated by phase + axis
  • Cost in USD: totalCostUsd per axis and total
  • Duration: durationMs / 1000
  • F1 per axis and global: output of the scorer

3.5 Reproduction#

Artifacts are preserved locally under anatoly-bench/catalog/slot-engine/project/.anatoly/runs/:

  • 2026-04-28_111200/: pre-Epic 51 cached reference (F1 67.8%)
  • 2026-05-06_113334/: Run M (control)
  • 2026-05-06_114407/: Run R (treatment)

Scoring command:

node anatoly-bench/dist/cli.js score \
  --spec catalog/slot-engine/SPEC.md \
  --report catalog/slot-engine/project/.anatoly/runs/<runId>

4. Results#

4.1 Aggregate#

Metric Run M (control) Run R (treatment) Δ absolute Δ relative
Output tokens 191 427 144 238 -47 189 -24.7%
Cost USD $6.45 $5.11 -$1.34 -20.7%
Duration (s) 531 383 -148 -27.9%
Global F1 63.8% 72.9% +9.1 pts +14.3% relative
Findings emitted 68 68 0 0

4.2 Per-axis breakdown (scored axes)#

Axis Tokens M → R Δ tokens % F1 M → R Δ F1 abs
best_practices 91 321 → 65 969 -27.7% 66.7% → 83.3% +16.6
correction 49 449 → 39 184 -20.8% 42.9% → 71.4% +28.5
duplication 14 534 → 12 878 -11.4% 66.7% → 66.7% 0
overengineering 14 201 → 9 111 -35.8% 57.1% → 57.1% 0
utility 7 926 → 7 975 +0.6% 85.7% → 85.7% 0

4.3 Per-axis breakdown (axes not scored on this fixture)#

Axis Tokens M → R Δ tokens %
documentation 10 302 → 6 054 -41.2%
tests 3 694 → 3 067 -17.0%

4.4 Triangulation with the April cached baseline#

Run Global F1 Conditions
April 28 reference (cached) 67.8% cached run, pre-Epic 51 code
Run M (main --no-cache) 63.8% tokens regenerated, no discipline
Run R (compression-rollout --no-cache + discipline) 72.9% tokens regenerated, with discipline

Run M underperforms the cached reference (-4 points), suggesting that --no-cache introduces real LLM variance despite temperature=0. The effect is consistent with the hypothesis that regenerating findings from scratch produces non-identical outputs even at zero temperature (batching effects, token order, etc.). Run R beats both the cached reference and the no-cache control simultaneously, suggesting that the discipline's benefit is robust to this variance.

4.5 Partial replication on python-dotenv#

To verify generalisation outside TypeScript, a cross-comparison was run on python-dotenv (21 .py files, a well-known OSS Python project):

Metric Pre-discipline Post-discipline Δ
Output tokens (axes) 376 849 295 824 -21.5%
Cost (axes only) $11.40 $10.78 -5.4%
Duration 1 804 s 637 s -65%

The token reduction (-21.5%) is consistent with slot-engine. Cost drops less (-5.4% vs -20.7%) because on python-dotenv (larger files) input dominates cost: the discipline only touches the output. F1 is not measurable here (no catalog available for python-dotenv).

5. Discussion#

5.1 Why does recall improve?#

Three plausible mechanisms, not mutually exclusive:

(a) Forced commitment via hedging suppression. Hedging phrases ("appears to", "seems to be", "might possibly") are linguistic escape valves that let the model signal uncertainty without committing to a verdict. When forbidden, the model is forced into a binary stance: flag or no-flag. That constraint can eliminate false negatives where the model would otherwise have dodged via hedging.

(b) Lexical precision. Direct verbs and concrete nouns leave less room for vagueness. The sentence "X imports Y from Z" is testable against source code; the sentence "It looks like X seems to be importing Y" is by construction outside the verifiable register. The register constraint forces the LLM to produce verifiable claims.

(c) Reduced hallucination surface. Buffer phrases ("basically", "essentially", "It is important to note") are text that does not reference the code. Non-referencing text is the natural place where hallucination can settle (the model fills space with plausible unverified assertions). Cutting the buffer reduces that surface.

5.2 Why is it more effective than structured compression on most axes?#

ADR-04 structured compression replaces detail: string with evidence: { ... } + note: string. The JSON keys of the evidence (runtime_importers, type_importers, local_refs, transitive, exported) consume ~10-15 tokens each. On a terse axis (utility ~90 tokens/symbol), 5 keys × 12 tokens = 60 tokens of overhead wipe out the prose savings. On a verbose axis (best_practices ~7 000 tokens/call), the overhead is negligible and the savings dominate.

The soft discipline acts on the prose volume in every field, paying no structural cost. It scales naturally with prose density: the more verbose the axis, the bigger the gain (best_practices -27.7%, overengineering -35.8%, utility 0%). No regression on terse axes, unlike the structured approach which penalises them.

5.3 Cost-gain asymmetry between fixtures#

Slot-engine shows -20.7% cost, python-dotenv only -5.4%. The gap is explained by the input/output ratio:

  • Slot-engine: small project, limited RAG context, proportionally large output. The discipline (acting on output) hits where cost concentrates.
  • Python-dotenv: larger project, substantial RAG injection, bigger source files. Input dominates total cost (~80%) and the discipline does not touch input.

Implication: the discipline's ROI is highest on projects where output dominates. On large projects the same discipline yields similar quality gain but a more modest monetary saving.

5.4 Comparison to the ADR-04 assessment#

ADR-04 §"Alternatives Rejected" rated "Caveman prose" at:

  • Theoretical compression: ~58%
  • Quality risk: Medium
  • Compliance predictability: Low

Our measurements contradict the quality-risk assessment: +9.1 absolute F1 points observed, a significant improvement, not a regression. Compliance predictability (does the model drift toward defensive prose on successive responses?) was not measured systematically in this study. It remains an open question for long-running batches.

6. Threats to validity#

6.1 Internal#

  • N=1 per condition. Each comparison is a single run vs a single run. No replication to estimate variance. The observed F1 variance of ±29 points on the correction axis between the cached baseline (62.5%) and Run M (33.3%) on the same code suggests that run-to-run variance at the axis level is non-trivial. The +9.1 aggregate F1 could include a favorable sampling bias.
  • Single fixture for F1 scoring. Slot-engine has 14 catalogued defects across 5 axes ≈ 3 defects per axis. A single finding crossing on one axis already represents ~14 F1 points.
  • Treatment confound. Run R is on compression-rollout, which carries Epic 51 commits (dual-mode infrastructure). We argued those paths are inactive in --no-compress mode, but rigorous instrumentation of the prompts actually sent to the LLM would better exclude any non-discipline differential.
  • Temperature ≠ determinism. At temperature=0, the production model can produce different outputs depending on batching, hardware, and token arrival order. Part of the F1 gap is irreducible noise.

6.2 External#

  • Single language family tested for F1. Slot-engine is pure TypeScript. On python-dotenv we measured tokens, not F1.
  • Narrow audit domain. Slot-engine is concise business logic (a slot machine). The effect on infrastructure code, vast libraries, or monorepos is untested.
  • Single LLM family. Everything runs on Claude (Sonnet, Haiku, Opus). Transfer to GPT-4/5 or Gemini is untested. Different families respond differently to style instructions.

6.3 Construct#

  • F1 against catalog is a coarse proxy. The catalog is hand-annotated: a finding counted as FP could be a real defect not catalogued. An F1 improvement may reflect better alignment with the catalog rather than better code understanding.

7. Conclusion#

A 12-line prompt instruction (anti-filler, anti-hedging, anti-preamble, anti-meta) added to a code-audit agent's system prompts produced, on the slot-engine fixture, the following result simultaneously:

  • -24.7% output tokens
  • -20.7% USD cost
  • -27.9% wall-clock duration
  • +9.1 F1 points (from 63.8% to 72.9%): improvement, not regression

The result is consistent with the hypothesis that hedging and filler phrases consume tokens AND degrade analysis quality by letting the model dodge commitment. Cutting them produces a Pareto-improving prompt strategy: cheaper, faster, more accurate.

Compared to ADR-04 structured compression, the soft discipline is:

  • Simpler: no schema change, no Zod migration, no per-axis feature flag
  • More uniform: works on every axis, while the structured approach penalises terse ones
  • Safer: no Zod fallback, no compliance metric, no degraded mode
  • Equivalent or superior on measured quality

Following these results we promoted the discipline to the default on main (commit 89945d4). The structured-compression effort (Epic 51, parked on the compression-rollout branch) remains available but is no longer the priority path: the simple solution beats the complex one on every measured criterion.

8. Future work#

  1. N=3-per-condition replication with median comparison to confirm +9.1 F1 is not within the noise.
  2. Cross-language F1 benchmarks: produce Python, Go, and Rust catalogs to test generalisation.
  3. Cross-LLM replication: run GPT-4/5 and Gemini Pro with the same discipline, measure whether the effect transfers.
  4. Long-run drift study: 100+ file audits to measure whether the model's adherence to the discipline degrades over successive calls.
  5. Per-directive ablation: remove one line at a time (preamble, hedging, filler, etc.) and measure individual contribution. Identify which directive(s) carry most of the gain.
  6. Compliance measurement: instrument outputs to compute a quantitative adherence indicator (% of phrases without hedging, average note length, etc.) and correlate it with per-run F1.

Appendix A: Anatoly commits from the session#

Commit Short message
89945d4 feat(prompts): add output concision discipline to all axes + non-RAG services
60e115b fix(estimator): drop tasks pointing to deleted/missing source files
bc75110 refactor(scan): remove auto_detect, make include/exclude strictly authoritative
81e2ee4 refactor(schema): drop TypeScript-specific defaults from scan config
2a9f3d0 refactor(cli): remove anatoly scan, fold new/modified/cached into estimate
e4a9374 refactor(language-detect): delegate language detection to linguist-js
114820f fix(estimate): exclude cached files from token + cost forecast

Appendix B: tools#

  • anatoly (this repo, branch main HEAD 114820f post-session)
  • anatoly-bench (sibling repo, dist/cli.js score)
  • linguist-js v2.9.2 (delegated language detection post-refactor)
  • Anthropic provider via @anthropic-ai/claude-agent-sdk in subscription mode (Claude Code integrated)

Last updated :

promptsllmcode-auditanatoly-benchtokens