Anatoly evaluates every file through seven independent axis evaluators, each focused on a single dimension of code quality. The axes run in parallel for each file, and their results are merged into a unified ReviewFile.
Axes at a Glance#
| Axis | ID | Recommended Model | Verdicts | Purpose |
|---|---|---|---|---|
| Utility | utility |
Haiku | USED, DEAD, LOW_VALUE |
Detect dead or low-value code |
| Duplication | duplication |
Haiku | UNIQUE, DUPLICATE |
Find semantically duplicated functions |
| Correction | correction |
Sonnet | OK, NEEDS_FIX, ERROR |
Identify bugs and logic errors |
| Overengineering | overengineering |
Sonnet | LEAN, OVER, ACCEPTABLE |
Flag excessive complexity |
| Tests | tests |
Sonnet | GOOD, WEAK, NONE |
Assess test coverage quality |
| Best Practices | best_practices |
Sonnet | Score 0-10, 17 rules | Evaluate adherence to language-specific best practices |
| Documentation | documentation |
Sonnet | DOCUMENTED, PARTIAL, UNDOCUMENTED |
Detect JSDoc gaps and /docs/ desynchronization |
Per-Axis Details#
Utility#
Determines whether each exported symbol is actually consumed by other files in the project.
- Input: file source, symbol list, pre-computed usage graph data (importers per symbol, type-only vs runtime)
- Output: per-symbol verdict (
USED/DEAD/LOW_VALUE), confidence 0-100, detail string - Key feature: the usage graph provides ground-truth import data directly in the prompt, so the LLM validates rather than guesses
Duplication#
Detects semantically similar functions across the codebase using RAG-powered similarity search.
- Input: file source, symbol list, pre-resolved RAG similarity results (top candidates with scores, signatures, source snippets up to 50 lines)
- Output: per-symbol verdict (
UNIQUE/DUPLICATE), confidence 0-100, detail string, optionalduplicate_target(file, symbol, similarity description) - Key feature: candidate source code is read from disk and included in the prompt for code-to-code comparison. When RAG is unavailable, all symbols default to
UNIQUEwith 90% confidence.
Correction#
Identifies bugs, logic errors, and correctness issues. This is the only axis that uses a two-pass approach.
- Input: file source, symbol list, project dependency metadata (package names + versions), known false positives from correction memory
- Output: per-symbol verdict (
OK/NEEDS_FIX/ERROR), confidence 0-100, detail string, plusactions(with severity: CRITICAL/MAJOR/MINOR) - Two-pass verification: when Pass 1 flags any symbol as NEEDS_FIX or ERROR and the file imports external dependencies, a Pass 2 re-evaluates those findings against the actual README documentation of the implicated libraries. Findings contradicted by library docs are reclassified to OK. False positives are recorded in correction memory (
.anatoly/correction-memory.json) for future runs. - Deliberation feedback: when the Opus deliberation pass reclassifies a correction finding (NEEDS_FIX/ERROR → OK), the false positive is also recorded in correction memory with the deliberation reasoning, preventing recurrence in future runs.
Overengineering#
Flags symbols that exhibit excessive structural complexity relative to their purpose.
- Input: file source, symbol list, project tree structure (directory layout)
- Output: per-symbol verdict (
LEAN/OVER/ACCEPTABLE), confidence 0-100, detail string - Key feature: the project tree enables detection of structural fragmentation (single-file directories, excessive nesting, factory/adapter directories with few files)
Tests#
Assesses the quality and coverage of tests for each symbol.
- Input: file source, symbol list, coverage data (if available: statements, branches, functions, lines with covered/total counts)
- Output: per-symbol verdict (
GOOD/WEAK/NONE), confidence 0-100, detail string
Best Practices#
A file-level (not symbol-level) evaluation against 17 TypeScript best-practice rules. Each rule is scored PASS/WARN/FAIL with a severity tier (CRITICAL/HIGH/MEDIUM).
- Input: file source, file context (auto-detected: react-component, api-handler, utility, test, config, general), file stats (line count, symbol count), dependency metadata, project tree
- Output: overall score 0-10, per-rule status array, code suggestions (with optional before/after snippets)
- Key feature: file context detection adjusts which rules are most relevant (e.g., React-specific patterns only flagged for
.tsxfiles)
Documentation#
Detects JSDoc documentation gaps on exported symbols and evaluates concept coverage against /docs/ pages.
- Input: file source, symbol list, docs directory tree (ASCII, built once per run), relevant documentation pages (resolved via config mapping or filename convention, max 3 pages x 300 lines)
- Output: per-symbol verdict (
DOCUMENTED/PARTIAL/UNDOCUMENTED), confidence 0-100, detail string, optionaldocs_coveragewith per-concept status (COVERED/PARTIAL/MISSING/OUTDATED) - Key feature: two-level evaluation: (1) JSDoc inline per symbol — checks for description, params, return type; (2)
/docs/concept coverage — matches source module to documentation pages viadocumentation.module_mappingconfig or directory name convention. Gracefully degrades when no/docs/directory exists (evaluates JSDoc only).
Note: The documentation axis (Sonnet) evaluates existing documentation quality. It is distinct from the doc-generation pipeline (
anatoly docs, Sonnet), which writes.anatoly/docs/pages from source code context. The axis reads docs; the pipeline creates them.
Scoring Model#
Each symbol-level axis produces three key fields:
- Verdict -- the axis-specific enum value (e.g.,
USED,NEEDS_FIX,LEAN) - Confidence -- integer 0-100 representing the evaluator's certainty
- Detail -- human-readable explanation (minimum 10 characters, enforced by Zod)
Confidence drives verdict computation:
- Symbols with confidence below 60 are excluded from the global verdict calculation
- Symbols with confidence below 30 are considered too unreliable and discarded
- The merged symbol confidence is the minimum across all contributing axes
Axis Merger Logic#
After all axes complete, the mergeAxisResults function in axis-merger.ts combines them:
- Build axis map: index each axis's results by symbol name for O(1) lookup
- Merge per symbol: for each symbol in the task, look up its result from each axis. Missing results (axis did not return data for that symbol) fall back to safe defaults:
- utility:
USED, duplication:UNIQUE, correction:OK, overengineering:LEAN, tests:NONE, documentation:DOCUMENTED
- utility:
- Apply coherence rules:
- If utility =
DEAD, force tests =NONE(no point testing dead code) - If utility =
DEAD, force documentation =UNDOCUMENTED(no point documenting dead code) - If correction =
ERROR, force overengineering =ACCEPTABLE(complexity is secondary to correctness)
- If utility =
- Detect contradictions: cross-reference correction findings against best_practices results. If best_practices Rule 12 (Async/Promises/Error handling) passes but correction flags an async-related NEEDS_FIX, the correction confidence is capped at 55 (below the 60 verdict threshold).
- Merge actions: collect actions from all axes, tag each with its source axis, and re-assign sequential IDs.
- Compute verdict:
- Any symbol with correction =
ERROR(above confidence threshold) producesCRITICAL - Any symbol with
NEEDS_FIX,DEAD,DUPLICATE,OVER, orUNDOCUMENTED(exported) producesNEEDS_REFACTOR - 3 or more symbols with
PARTIALdocumentation producesNEEDS_REFACTOR - Otherwise:
CLEAN
- Any symbol with correction =
- Build axis metadata: record model, cost, and duration per axis in the
axis_metafield
Crash Isolation#
Each axis evaluator runs inside Promise.allSettled, meaning a failure in one axis does not prevent the others from completing. When an axis crashes:
- The failure is logged with full error details
- The failed axis ID is recorded in the
failedAxesarray - For symbols evaluated by the crashed axis, the merger applies safe defaults
- The symbol detail includes a sentinel:
*(axis crashed -- see transcript)* - The file's review is marked as "degraded" in run metrics (
degradedReviewscounter) - The transcript records the failure alongside the successful axes' transcripts
This design ensures that a transient API error or model timeout on one axis (e.g., best_practices) does not block the remaining six axes from producing useful results.
Model Configuration#
Each axis has a defaultModel property ('haiku' or 'sonnet'). The effective model is resolved as:
- Per-axis config override:
llm.axes.<axis>.model(highest priority) - Fast model pool: if
defaultModelis'haiku', usellm.fast_model(falls back tollm.index_model) - Standard model: if
defaultModelis'sonnet', usellm.model
This allows cost optimization by routing lightweight axes (utility, duplication, overengineering, tests, documentation) to cheaper models while using more capable models for correction and best_practices.