Research Note

Should we swap Anatoly's RAG for PageIndex? An open question with a measurable answer

Framing a tooling question instead of decreeing it. The two systems are shaped differently: a function-level vector lookup on one side, an LLM-driven TOC walk on the other. Whether one should replace the other depends on workload economics and test conditions we have not yet measured. This note states the question honestly, lays out our prior, and describes the bounded experiment that would settle it.

May 14, 20267 min readRémi Viau

Note by: Rémi Viau (Anatoly maintainer), with Claude (Anthropic Opus 4.7) as analytical partner. Repositories referenced: anatoly, anatoly-bench, PageIndex.

The question#

PageIndex from VectifyAI is a "vectorless" retrieval system: a document is parsed once into a hierarchical table-of-contents tree, and at query time an LLM reasons through that tree to pick the relevant section. Their headline number is 98.7% on FinanceBench, a QA benchmark over financial filings.

The question this note tries to answer: should we swap Anatoly's current function-level vector RAG for PageIndex, in part or in whole, to improve duplicate detection and reduce dependencies?

We do not have measurements yet. This is a note about how we are framing the question, what our current prior is, and what experiment would actually settle it.

What we know about the two shapes#

The two systems are structurally different, in a way that matters more than any single cost figure.

Dimension	Anatoly's RAG (current)	PageIndex
Indexed unit	Function (AST extract)	Document section (TOC node)
Query primitive	"k nearest neighbours of this vector"	"LLM, pick the relevant section"
LLM in the loop	At index time (Haiku, for NLP summaries that feed 40% of the hybrid score)	At query time (any LLM, GPT-4o by default but Haiku also workable)
Storage	LanceDB local (768-d code, 384-d NLP)	JSON tree, no vectors
Stack	TypeScript / Node	Python

The granularity row is the one that matters most. Anatoly's duplication axis runs a "find me functions semantically similar to this one" query, once per function, hundreds to thousands of times per audit. PageIndex does not natively expose that primitive: its idiom is "given a question, walk a TOC and quote a section." Reformulating per-function duplicate detection as TOC walks is not just expensive, it is a category change.

That is the load-bearing reason this swap is not a drop-in. Cost and stack arguments come second.

The cost picture, honestly#

A previous draft of this note claimed a 100× to 1000× cost gap. That figure was inflated and assumed GPT-4o on the PageIndex side and "free" on the Anatoly side. Neither is accurate.

Anatoly already calls an LLM in its RAG pipeline. The 384-d NLP embedding is generated from a Haiku-written summary of each function. Those summaries are not optional: they account for 40% of the hybrid similarity score in searchByIdHybrid(). Anatoly's "free retrieval" is really "LLM cost paid at index time, then cached aggressively."
PageIndex does not require GPT-4o. Running the same retrieval loop with Haiku would cut the per-query cost by roughly an order of magnitude versus the figure quoted in vendor demos.

The honest comparison is between two cost profiles:

Anatoly: N Haiku calls at index time, content-addressable cache, near-zero per-query cost.
PageIndex: ~0 at index time, M LLM calls per audit at query time.

Which one wins depends on the workload. Stable codebases re-audited often (high cache hit rate, many queries per indexed function) favour Anatoly by a wide margin. Codebases that move fast (low cache hit rate, few repeated queries per function) narrow the gap. We have not measured the crossover point. The claim "100× to 1000×" was rhetoric, not data; the realistic factor on a typical audit is likely 2× to 5×, still in Anatoly's favour for the workloads we ship today.

What we don't know#

Three things this note cannot resolve without measurements.

Hard duplicate cases. Vector cosine works well on easy duplicates (similar lexical surface, similar embeddings). On hard cases (two functions that compute the same thing via very different code paths and naming), LLM reasoning over both bodies could plausibly beat cosine similarity. That is precisely the "similarity is not relevance" thesis of PageIndex. We have a prior that this matters less in code than in prose, but we have not tested it.
F1 ceiling. Anatoly's RAG moved from 40.7% to 65.5% F1 on the slot-engine fixture between v2 and v17 of anatoly-bench. The trajectory is positive, but 65.5% is still modest in absolute terms. We do not know how much further the current architecture can go. Some of the remaining gap may be addressable inside the current stack (rerankers, better embeddings); some may not.
Cross-stack integration cost. Python from Node is a known pattern (subprocess or local HTTP). It is not free, but it is also not a structural blocker. Until we try it we are guessing.

Our prior, conditional#

Given the granularity argument and the workload profile we ship (stable codebases, repeated audits, cache hit rates around 95%), we lean against replacing the function-level vector store with PageIndex. That prior is moderately strong on the duplication axis, where granularity is the dominant constraint. It is weaker on long-form documentation retrieval, where TOC-tree navigation may actually fit better than the current heading-chunk approach in doc-indexer.ts.

We would update toward PageIndex if any of the following turned out to be true.

On a curated set of "hard" duplicates, an LLM with both function bodies in context produces a measurably better F1 than the current hybrid retriever.
The cache hit rate on a representative project drops well below the ~95% we assume, eroding the amortisation advantage.
The integration cost of a Python sidecar turns out to be lower than the marginal gain on long-form doc retrieval.

The experiment that would settle it#

The most useful next step is not more analysis, it is a bounded experiment.

Pick 20 catalogued duplicate pairs from anatoly-bench covering easy, medium, and hard cases.
Run three retrievers on the same fixture: the current hybrid retriever, the current retriever plus a local reranker (see "What we'd do anyway" below), and a PageIndex-style LLM walk with Haiku.
Score F1 and total cost end-to-end. The cost number must include indexing-time spend on the Anatoly side and per-query spend on the PageIndex side.

The experiment fits in a weekend and would either falsify the conditional prior above or move it toward a real recommendation. Without that, "do not replace" is rhetoric.

What we'd do anyway#

Three things are worth doing regardless of how the experiment turns out, because they extend the F1 trajectory of the current retriever and they are cheap. We treated them as alternative paths in the first draft of this note; on reflection they are baseline work, not alternatives.

Local lightweight reranker. Add a stage after the top-k hybrid retrieval that re-scores neighbours with a small ONNX reranker such as bge-reranker-v2-m3. Rerankers reliably improve F1 on noisy top-k lists at no LLM cost. This belongs in the comparison above as a fair baseline.
Tune thresholds and hybrid weights. The current 60/40 code/NLP split in searchByIdHybrid was set by intuition. Cross-validating weight, top-k, and the similarity floor on anatoly-bench is free and could move F1 by a measurable amount. The same empirical instinct that produced the concision-discipline result applies here.
Evaluate newer code embeddings. Compare the current jina-embeddings-v2-base-code against more recent options such as Qodo-Embed-1, one fixture run per model.

Key files for reference#

For a reader who wants to read the code this note discusses:

src/rag/vector-store.ts: LanceDB wrapper, heart of the retrieval.
src/rag/embeddings.ts: ONNX / SDK embedding orchestration.
src/rag/indexer.ts: AST extraction into FunctionCard records.
src/rag/doc-indexer.ts: Markdown chunking; the most plausible site for a PageIndex experiment.
src/core/file-evaluator.ts: the preResolveRag() entry point in the audit pipeline.
src/core/axes/duplication.ts: primary RAG consumer.

The broader market context for code-audit and code-review tooling is in AI Code Audit vs AI Code Review in 2026.