Deep Dive
Routing the Claude Agent SDK to Local LLMs: A Dual-Tier Qwen Stack with TurboQuant 4-bit KV
An Anatoly exploration: running a multi-step pipeline on local llama.cpp instead of Anthropic. Single Qwen3.6-35B-A3B GGUF in two thinking modes (haiku no-think, sonnet thinking), four SDK bugs including a thinking-disable trick worth a 12-fold speedup, TurboQuant 4-bit KV cache on a 24 GB RTX 3090 Ti, 100-call benchmark with Opus-as-judge. Local is 5 to 9 times faster than Anthropic and at the Opus ceiling on verify-rag.
Routing the Claude Agent SDK to Local LLMs: A Dual-Tier Qwen Stack with TurboQuant 4-bit KV#
Note by: Rémi Viau (Anatoly maintainer), with Claude (Anthropic Opus 4.7) as analytical partner.
The Claude Agent SDK is a convenient front-end. The model behind it does not have to be a Claude model. This article is a field report from an Anatoly exploration: pointing the SDK at local llama.cpp servers running Qwen3 models on a single RTX 3090 Ti with 24 GB of VRAM. The trigger was the cost of a long-running multi-step document fact-check pipeline that was burning about five dollars per run on Anthropic Sonnet, most of it on the high-volume verification steps that hammer the SDK with thousands of small calls.
We started with a dual-model setup (a 4 B dense Qwen3 for the haiku tier, a Qwen3.6-35B-A3B for the sonnet tier) and converged on a single-model architecture after the mini-bench in §7.8: one Qwen3.6-35B-A3B GGUF runs in two containers, distinguished only by a thinking flag. The haiku container (thinking OFF) absorbs the thousand-plus verify calls per run; the sonnet container (thinking ON) handles the few high-stakes correction passes. Same model file, ~10 second container restart to switch tiers (the GGUF stays in the OS page cache). The opus role stays on Anthropic, and the production "best-of-both" wiring routes the correct-phase rewrite back to Anthropic Opus where the local 35B falls 2 to 3 Opus-judge points short.
The interesting parts are not the architecture diagram. They are the four SDK integration bugs that nobody warns you about, the one-line container flag that gave us a 12-fold speedup once we found it, and the discovery that landed the final architecture: a 35B-A3B MoE model in no-think mode beats a dedicated 4B model on the haiku-tier workloads, at the same latency, because MoE only activates ~3 B parameters per token. The benchmark below (100 calls, 5 providers, 4 workloads, N=5, Opus-as-judge) shows local-turbo with --parallel 2 running 4 to 9 times faster than Anthropic on the high-volume steps, at parity on verify, beating Anthropic on importance, and a few points below on extract and rewrite. Cost per run drops to zero on pure local, $4 on the recommended hybrid setup.
Quick verdict: is the local stack good enough?#
Yes on the high-volume Haiku-tier workloads. The Sonnet-tier rewrite still needs Anthropic Opus. One Qwen3.6-35B-A3B GGUF runs in two containers, the only difference is the thinking flag. Here is the Opus-judge verdict, broken out by which Anthropic model each local container replaces:
| Local container | Model & mode | Replaces | Workloads (calls per run) | Opus-judge verdict |
|---|---|---|---|---|
llm-haiku |
Qwen3.6-35B-A3B, thinking OFF | Anthropic Haiku | verify-rag (~1 300), extract (~88), importance (~300) | verify: parity at 9/10. Importance: 9/10, one point above Anthropic Haiku itself. Extract: 4 to 5/10 vs a 6/10 ceiling (the 4 B benched at 4 to 5/10; not retested on the 35 B, expected to match or exceed since the 35 B beats the 4 B on every workload tested in §7.8). Anthropic Haiku itself does not score higher than 6/10 on this rubric. |
llm-sonnet |
Qwen3.6-35B-A3B, thinking ON | Anthropic Sonnet | correct-section-rewrite (~8) | 6/10 vs the 9/10 ceiling. Applies every requested fix but misses the [#filename] citation tags that Opus produces by default. The gap does not close with thinking, prompt tweaks, or the larger model: Opus stays mandatory for this workload in production. |
The bottom line. A single 35 B MoE model in no-think mode meets Anthropic Haiku on every high-volume workload (and beats it on importance, where the 4 B alternative we initially shipped was scoring 2/10). The same model with thinking ON on the rewrite phase lands 3 Opus-judge points below Anthropic Sonnet/Opus and we have not been able to close the gap. The shipped production setup is therefore hybrid: local for the Haiku-tier middle (extract, verify, importance, omissions), Anthropic Opus for the eight correct-phase rewrites. End-to-end ~59 minutes local-with-Opus-correct vs ~4 hours full Anthropic; ~$4 per run vs ~$5 full Anthropic or $0 if you accept the rewrite gap.
At a glance: how close is local to Anthropic?#
Quality. Opus-judge rating per workload, on a 0-to-10 scale where 10 means "indistinguishable from the Anthropic reference". Gold bars = Anthropic ceiling (Anthropic run twice on the same prompt, second run rated against the first). Mint bars = the local 35 B-A3B in its production thinking config (no-think for haiku-tier, thinking ON for rewrite). Importance and correct values come from the §7.8 mini-bench on the 35 B; extract and verify are taken from the §7.4 4 B bench since we did not rerun them on the 35 B (the chart picks the upper bound of the measured range for extract).
Local lands at parity on verify-rag (1 300 calls per run, the cost-saving workload) and one point above the Anthropic-vs-Anthropic ceiling on importance thanks to the 35B-A3B no-think config. Extract is one to two points below ceiling. The persistent gap is on correct-rewrite (6/10 vs 9/10): the locals miss the [#filename] citation tags Opus produces, and neither thinking nor prompting closes it. The production recipe (§7.9) routes those eight calls back to Anthropic Opus.
Speed. Wall-clock seconds per call. Gold bars = Anthropic. Mint bars = local 35 B-A3B on local-turbo-parallel. The correct bar shown is the pure-local fallback (35 B thinking ON, 17.57 s); production actually routes correct to Anthropic Opus at ~13 s per call (see §7.9), so neither bar reflects the shipped path for that workload.
The local bars are 4 to 9 times shorter on the three Haiku-tier workloads and a little over 2 times shorter on Sonnet-tier correct-rewrite. The importance number moved from 1.52 s (the 4 B model we initially shipped) to 2.42 s (the 35 B no-think we ship now): still 4.3× faster than Anthropic, with the +1 quality point that comes with it. End-to-end the local pipeline runs in roughly one hour versus four hours on Anthropic.
Key findings (TL;DR)#
If you only read one section beyond the verdict above, read this one.
- Drop-in routing works. The Claude Agent SDK reads
ANTHROPIC_BASE_URLandANTHROPIC_DEFAULT_*_MODELon every call, so pointing it at a localllama-serverAnthropic-compatible endpoint requires no SDK fork. - One GGUF, two thinking modes. Production runs a single Qwen3.6-35B-A3B file in two containers: the haiku container (thinking OFF) handles high-volume JSON workloads; the sonnet container (thinking ON) handles the rewrite. The 35B-A3B MoE only activates ~3 B parameters per token, so it is the same latency as a 4 B model with one point higher quality on the haiku workloads. This is the architecture we ship after a follow-up mini-bench rejected the initial dual-model setup.
- Four SDK integration bugs to know. Per-tier
--aliasfor/v1/models;parallel=1, ctx_per_slot=32768for long prompts; ban all 27 built-in tools, not the 12 you remember; and disable thinking at the engine, not in the prompt. - Thinking-disable at the engine is the biggest single win. The prompt-level
/no_thinkdirective is ignored 92% of the time on Qwen3.5 and Qwen3.6. Passing--jinja --reasoning off --chat-template-kwargs '{"enable_thinking": false}'tollama-serverdelivers a 12-fold speedup on the high-volume haiku workloads. The same flag is also a quality lever: on strict-JSON workloads, thinking ON drops the Opus rating because the model prefixes the JSON with prose. - Recommended config: TurboQuant with
--parallel 2, not mainline llama.cpp. 4 to 9 times faster than Anthropic on the high-volume steps, at the Opus-judge 9/10 ceiling on verify-rag (full parity) and one point above the ceiling on importance. - Correct-phase stays on Anthropic Opus. A local 35B (thinking ON or OFF) lands 3 Opus-judge points below Opus on rewrite, mainly because it omits the
[#filename]citation tags. The production setup is hybrid: local for the high-volume middle, Anthropic Opus for the 8 rewrite calls. ~$4 per run vs ~$5 full Anthropic or $0 if you accept the rewrite gap. - End-to-end: ~59 minutes local vs ~4 hours Anthropic (~4× faster, near-zero cost modulo electricity and the 8 Opus calls).
Benchmark methodology: 100 calls, 5 providers, 4 workloads, N=5 trials, Opus LLM-as-judge with an Anthropic-vs-Anthropic ceiling for calibration. Host: RTX 3090 Ti, llama.cpp build pinned. A follow-up mini-bench (§7.8) compared 4 B vs 35 B in both thinking modes on a single workload, and motivated the single-model architecture.
1. Why the Claude Agent SDK lets you do this at all#
The Claude Agent SDK reads its target from a small set of environment variables it consults at every call: ANTHROPIC_BASE_URL, ANTHROPIC_AUTH_TOKEN, ANTHROPIC_MODEL, and the three ANTHROPIC_DEFAULT_*_MODEL for haiku/sonnet/opus. If you point the base URL at a server that speaks the /v1/messages shape, the SDK does not know or care that the responder is not Anthropic. llama.cpp's server has implemented an Anthropic-compatible endpoint for some months now (see the server README), and it works.
The pipeline keeps two local containers behind the SDK's two budget tiers, with the embedding container as the third occupant of the same GPU. The haiku and sonnet containers run the same Qwen3.6-35B-A3B GGUF; only the thinking flag differs at startup:
Three containers, one up at a time. Switching from the haiku container to the sonnet container is a container restart with different flags, not a different model: the GGUF stays in the OS page cache, so the swap takes about 10 seconds in practice. A small context manager swaps the eight relevant environment variables on entry, restores them on exit. Calling opus clears the overrides and lets the SDK hit Anthropic native; calling haiku or sonnet redirects to the matching local server. The pipeline can chain opus → haiku → sonnet → opus in a single run safely.
That is the easy half. The hard half is everything the SDK does before and around the call.
2. Bug 1: expose a stable per-tier model alias via /v1/models#
The SDK validates the model name at session init, before any /v1/messages call. It does this by hitting /v1/models and checking the requested ID is in the returned list. With llama.cpp's defaults, /v1/models returns the GGUF filename, qwen3.6-35b-a3b-Q4_K_M.gguf, which is neither stable nor what the SDK was told to ask for. The SDK then refuses the call with a misleading "model not found" error.
The fix is to pass --alias <stable-name> to llama-server at startup, with a different name on each tier:
# haiku container
llama-server ... --alias local-haiku
# sonnet container
llama-server ... --alias local-sonnetAfter that, curl -s http://127.0.0.1:11451/v1/models | jq .data[0].id returns local-haiku, the sonnet port returns local-sonnet, and matching ANTHROPIC_DEFAULT_HAIKU_MODEL=local-haiku + ANTHROPIC_DEFAULT_SONNET_MODEL=local-sonnet on the client side validates cleanly.
A subtle trap that bit us when we added the second tier: when you swap from haiku to sonnet, you must also clear the haiku model variable in the environment. Otherwise the SDK can route a residual haiku-tagged call to a haiku container that is no longer running. The fix is a clean snapshot/restore of the full eight-variable set on every tier swap, and a dedicated regression test for consecutive swaps.
3. Bug 2: per-slot context vs total context#
The next error you hit is more brutal:
API Error: 400 request (22 584 tokens) exceeds the available context size (4 096 tokens)llama.cpp divides --ctx-size by --parallel to compute the context per slot. The defaults are parallel=4, ctx_size=16384, which gives each slot 4 096 tokens. A correction prompt with 22 k of input plus 10 k of expected output blows through that on the first token.
There are two ways to fix this: raise --ctx-size, or lower --parallel. For a single-GPU box where the pipeline serialises calls anyway (the orchestrator launches sections via asyncio.gather but the GPU is the bottleneck), parallel=1, ctx_per_slot=32768 is the simpler choice. llama.cpp's internal --cont-batching already amortises expert loads on a MoE model like Qwen3.6-35B-A3B (3 B active parameters out of 35 B), so you do not lose much throughput by setting parallel=1.
If you do want concurrency, the right move is to raise --ctx-size to parallel × desired_per_slot. Just remember the KV cache grows with total context, not per-slot context.
4. Bug 3: ban all 27 built-in tools, not the 12 you remember#
The most expensive bug we shipped against, and the most surprising. On the extraction phase (which expects JSON output, not tool calls), we suddenly saw 84 tool.builtin ERR AskUserQuestion errors in five minutes, cascading into "Reached maximum number of turns (3)" failures and empty extractions.
The mechanism is this. The Claude Agent SDK exposes a default set of built-in tools to every model unless you explicitly disable them. The list of tools we had been disabling was 12 long (file ops, web, task). The actual default set is around 27, and it includes AskUserQuestion, EnterPlanMode, Skill, CronCreate, TaskOutput, PushNotification, ScheduleWakeup, and a dozen others. Setting allowed_tools=[] does not suffice: the SDK still exposes the built-ins; you have to list them in disallowed_tools explicitly.
Why did this break with the local Qwen model but not with Sonnet? Two reasons. First, Sonnet has been trained to use built-in tools sparingly and to favour direct JSON output when the prompt asks for it. The Qwen model has not, and it cheerfully called AskUserQuestion to "think out loud" when uncertain (this was observed with the 4 B variant we initially shipped; the 35 B with thinking OFF is better but the disable-list discipline remains the right default). Second, our max_turns=3 cap meant three calls to AskUserQuestion were enough to exhaust the budget without ever emitting the requested JSON.
The fix is a single constant listing all 27 built-ins, applied to every JSON-only step:
BUILTIN_TOOLS_ALL = [
"AskUserQuestion", "EnterPlanMode", "ExitPlanMode",
"Skill", "CronCreate", "CronDelete", "CronList",
"TaskOutput", "TaskStop", "PushNotification",
"ScheduleWakeup", "RemoteTrigger", "Monitor",
"WebFetch", "WebSearch", "TodoWrite",
"Read", "Write", "Edit", "NotebookEdit",
"Bash", "Glob", "Grep", "ToolSearch",
"EnterWorktree", "ExitWorktree", "BashOutput",
]For steps that genuinely need one tool (the corrector needs Write), the disallow list becomes [t for t in BUILTIN_TOOLS_ALL if t != "Write"].
The before/after on the extraction phase tells the story:
| Before | After | |
|---|---|---|
tool.builtin ERR over 5 minutes |
84 | 0 |
| Chunk failures | 11+ | 0 |
| Facts extracted per chunk | 1 (always failing) | 14 to 31 |
The lesson generalises to anyone running a non-Anthropic model behind the SDK: do not assume the default tool disabling list is exhaustive. Print what the SDK actually exposes to the model and ban the full list.
5. Bug 4: disable thinking at the container, not in the prompt#
This was the most expensive bug we did not ship against initially, and the one that quietly cost us hours per run before we found the fix. Qwen3.5 and Qwen3.6 are unified "thinking" models: they emit an internal reasoning trace before the final answer, the same shape as DeepSeek-R1 or o1-style reasoning. For long-form rewriting on the sonnet tier this is helpful. For a verify call that has to decide yes / no on a single fact in 35 output tokens, it is catastrophic.
The first attempt was to suffix /no_think to every user prompt, a convention shipped in the original Qwen3 chat template. The hit rate was 8% out of 24 sample calls. The other 92% kept thinking. Result: each verify call spent 15 to 50 seconds reasoning before emitting the JSON, blowing the verify phase out to several hours per run.
The diagnosis took longer than it should have. Qwen3.5 and Qwen3.6 dropped the /no_think directive from their chat template (it was present in Qwen3 original, removed in 3.5+). The supported path is chat_template_kwargs: {enable_thinking: false}, which is documented for the OpenAI endpoint but ignored on the Anthropic endpoint of llama.cpp.
The fix is to pass the flag at container startup, not per request, via the llama-server arguments:
llama-server \
--jinja \
--reasoning off \
--chat-template-kwargs '{"enable_thinking": false}' \
...--jinja forces the use of the chat template embedded in the GGUF (the only template that knows how to interpret enable_thinking). --reasoning off is the recent llama.cpp flag that disables the reasoning block at the engine level. --chat-template-kwargs is the portable form. The three flags together are belt-and-suspenders; the Qwen documentation recommends combining them for robustness.
The right policy is per-container, both pointing at the same GGUF:
llm-haikucontainer (disable_thinking=Trueby default): high-volume workloads, low-latency, strict-JSON output. Verify, extract, importance scoring. Thinking ON costs both latency and quality here, since the model prefixes the JSON with reasoning prose (see §7.8 mini-bench).llm-sonnetcontainer (disable_thinking=Falseby default): low-volume, longer outputs. Correct-phase rewrite, cohesion checks. Reasoning depth helps the final text, though even with thinking ON the local 35 B does not match Anthropic Opus on this workload (production routes correct to Opus anyway, see §7.9).
The empirical result on the importance-scoring workload, measured under TurboQuant, tells the whole story:
| Before (thinking leak) | After (flag honoured) | Speedup | |
|---|---|---|---|
| Wall mean | 21.7 s | 1.83 s | ×12 |
| Output tokens (mean) | 358 | 9 | ×40 less waste |
usage.output_tokens from llama.cpp |
0 (broken) | 9 (correct) | fixed |
A twelvefold speedup on a workload that runs 300 times per pipeline run, and a similar speedup on verify (1 300 calls per run) which suffered from the exact same thinking leak. On verify, the math works out to roughly 40 minutes after the fix (1 300 × 1.87 s) versus about 8 hours before (1 300 × 21.7 s).
The generalised lesson: with thinking models, the disable-thinking knob has to be at the engine level, not in the prompt. Prompt-level directives are a courtesy the model can refuse. Engine-level flags are not.
6. TurboQuant: 4-bit KV cache to fit the workload on a 24 GB card#
With Qwen3.6-35B-A3B UD-IQ4_XS occupying about 18 GB of VRAM, a 32 k-token context window with FP16 KV cache consumes another 3 GB. That leaves roughly 3 GB of headroom on a 24 GB card, no margin for batching, and zero room for the embedding model the same pipeline uses for retrieval. The same trade-off shows up in any system that mixes embeddings and chat on one card: you either swap models in and out (which we also do, see §8) or shrink the runtime memory of one of them.
TurboQuant is a KV cache quantization scheme. The model weights stay in their original quant (UD-IQ4_XS in our case); only the KV cache is recoded at runtime to 4.25 bits per value (turbo4, about 3.8× compression vs FP16) or 3.25 bits per value (turbo3_tcq, about 5× compression). The technique was first described in the TurboQuant paper and has since landed in several llama.cpp forks. It is invoked through the standard llama.cpp interface, -ctk turbo4 -ctv turbo4, so swapping it in is one flag away once the binary is built with the right kernels.
6.1 Choosing a fork#
Four llama.cpp forks ship TurboQuant kernels at the time of writing:
| Fork | Status | Platform | Notes |
|---|---|---|---|
| TheTom/turboquant_plus | Metal-first | Apple Silicon | Parity with q8_0 on M5 |
| Aaryan-Kapoor/turboquant-tq3_0 | CPU only | x86_64 | Not useful with a CUDA card |
| Madreag/turbo3-cuda | RTX 5090 (Ada Lovelace) | sm_89 | Wrong architecture target for us |
| spiritbuun/llama-cpp-turboquant-cuda | RTX 3090 (Ampere) | sm_86 | Benchmarks published on a 3090 |
We picked spiritbuun because the README reports a constant ~30 tok/s decode rate from 4 K to 128 K context on a 3090, and a perplexity that beats q8_0 thanks to a documented "norm correction". There is some bus factor risk on a fork with ~600 stars; the mitigation is to pin the upstream commit and to be able to rebuild without depending on the fork's CI.
6.2 Multi-stage Dockerfile#
The build needs the CUDA devel image (nvcc, cuBLAS); the runtime only needs the runtime image. Multi-stage cuts the final image to about 3 GB.
# builder stage
FROM nvidia/cuda:12.6.3-devel-ubuntu22.04 AS builder
RUN apt-get update && apt-get install -y \
git cmake build-essential ninja-build libcurl4-openssl-dev
# Stub libcuda.so.1 for the link step; see §6.3.1
RUN ln -sf libcuda.so /usr/local/cuda/lib64/stubs/libcuda.so.1
ENV LIBRARY_PATH=/usr/local/cuda/lib64/stubs:${LIBRARY_PATH}
RUN git clone https://github.com/spiritbuun/llama-cpp-turboquant-cuda.git /src
WORKDIR /src
RUN cmake -B build -G Ninja \
-DGGML_CUDA=ON \
-DGGML_CUDA_FA_ALL_QUANTS=ON \
-DCMAKE_CUDA_ARCHITECTURES=86 \
-DCMAKE_EXE_LINKER_FLAGS="-Wl,-rpath-link=/usr/local/cuda/lib64/stubs" \
-DCMAKE_SHARED_LINKER_FLAGS="-Wl,-rpath-link=/usr/local/cuda/lib64/stubs" \
&& cmake --build build --target llama-server
# runtime stage
FROM nvidia/cuda:12.6.3-runtime-ubuntu22.04
RUN apt-get update && apt-get install -y \
libcurl4 libgomp1 ca-certificates wget
COPY --from=builder /src/build/bin/llama-server /usr/local/bin/
COPY --from=builder /src/build/bin/*.so /usr/local/lib/
COPY --from=builder /src/codebooks /opt/codebooks
HEALTHCHECK CMD wget -qO- http://localhost:8080/health || exit 1
ENTRYPOINT ["llama-server"]6.3 Build gotchas worth remembering#
6.3.1 libcuda.so.1 not found at link time#
The linker looks for libcuda.so.1 (the versioned soname), but the CUDA devel image only ships libcuda.so (unversioned) in /usr/local/cuda/lib64/stubs/. The real libcuda.so.1 comes from the host driver at runtime via --gpus all. The link still has to resolve the symbol, so you need three things combined:
- A symlink to fake the versioned soname:
ln -sf libcuda.so /usr/local/cuda/lib64/stubs/libcuda.so.1. ENV LIBRARY_PATH=/usr/local/cuda/lib64/stubs:...sogccfinds-lcudaat compile-link time.-Wl,-rpath-link=/usr/local/cuda/lib64/stubssoldcan resolveNEEDEDentries that point at libcuda transitively (libggmldepends on libcuda, and the finalllama-serverlink checks the chain).
Miss any one of the three and the link fails with undefined reference to 'cuMemCreate' and friends.
6.3.2 libgomp.so.1 missing at runtime#
The CUDA runtime image does not ship OpenMP. The llama-server binary links against libgomp.so.1 for CPU-side helper parallelism. Boot the container without it and you get:
/usr/local/bin/llama-server: error while loading shared libraries:
libgomp.so.1: cannot open shared object fileThe fix is one line in the runtime stage: apt-get install -y libgomp1.
7. Benchmark: five providers, four workloads, Opus-as-judge#
The fixes above are only worth shipping if the local stack is actually competitive on the workloads that matter. The benchmark below is what we use to settle the question. It is in the repo, reproducible from a single command, and reads from the same JSONL files that the rest of the pipeline produces, so the workloads are the real workloads.
7.1 Methodology#
The benchmark cycles sequentially through five providers × four workloads × N=5 trials, for 100 calls total. The five providers:
- anthropic: Anthropic native API. Reference.
- local-mainline: Qwen on llama.cpp mainline, FP16 KV cache.
- local-turbo: Qwen on the spiritbuun fork, turbo4 KV cache,
parallel=1. - local-turbo-parallel: same as local-turbo but
parallel=2. The 1.5 GB freed by turbo4 lets us batch two concurrent slots, which the orchestrator'sasyncio.gathercan fill. - local-turbo-haiku-think: ablation that runs the high-volume
haiku-tier workloads on the sonnet model (Qwen3.6-35B-A3B) with thinking ON. Tests the question "does deeper reasoning improve verify/extract quality enough to be worth the latency?".
Trial 1 of Anthropic is the quality reference. Local outputs are scored three ways:
- Cosine similarity between the local output's embedding (Qwen3-Embedding-8B) and the Anthropic reference embedding.
- ROUGE-L F1 on lowercased word tokens (complementary lexical signal).
- Opus LLM-as-judge on a 0-to-10 absolute equivalence scale. One judgement per (provider, workload) cell. Crucially, the same judge rates Anthropic trial 2 against Anthropic trial 1 as the empirical ceiling: the "indistinguishable" score for two runs of the same backend on the same input. Any local provider rated at the ceiling is at parity with Anthropic.
The workloads:
| Label | Tier | Profile | Description |
|---|---|---|---|
| extract-chunk | haiku | ~3 k in, ~700 out, JSON | Extract atomic facts (88 calls per run) |
| verify-rag-fact | haiku | ~3 k in, ~50 out, JSON | Verify-RAG inner call (~1 300 per run) |
| importance-score | haiku | ~1 k in, ~15 out, JSON | Score 0 to 1 on a single fact |
| correct-section-rewrite | sonnet | ~4 k in, ~1.5 to 3 k out, markdown | Rewrite a section to fix issues |
Host: RTX 3090 Ti, WSL2 Ubuntu, Python 3.12, ctx_per_slot=32 768, llama.cpp build pinned in the bench manifest. 100 calls total, 0 errors.
7.2 Performance results#
| Workload | Provider | Wall mean (s) | Output tps | In tokens | Out tokens |
|---|---|---|---|---|---|
| extract-chunk | anthropic | 53.80 | 182.8 | 3 284 | 9 801 |
| extract-chunk | local-mainline | 8.08 | 100.2 | 1 232 | 809 |
| extract-chunk | local-turbo | 7.40 | 100.3 | 1 230 | 745 |
| extract-chunk | local-turbo-parallel | 5.79 | 101.4 | 1 230 | 590 |
| extract-chunk | local-turbo-haiku-think | 82.36 | 68.1 | 1 228 | 2 737 |
| importance-score | anthropic | 10.39 | 67.5 | 2 966 | 713 |
| importance-score | local-mainline | 1.95 | 8.1 | 968 | 16 |
| importance-score | local-turbo | 1.85 | 8.4 | 965 | 15 |
| importance-score | local-turbo-parallel | 1.52 | 9.9 | 965 | 15 |
| importance-score | local-turbo-haiku-think | 41.34 | 10.5 | 963 | 412 |
| verify-rag-fact | anthropic | 10.90 | 97.5 | 3 297 | 1 064 |
| verify-rag-fact | local-mainline | 2.50 | 20.1 | 1 279 | 50 |
| verify-rag-fact | local-turbo | 2.13 | 23.0 | 1 277 | 49 |
| verify-rag-fact | local-turbo-parallel | 1.87 | 27.3 | 1 277 | 50 |
| verify-rag-fact | local-turbo-haiku-think | 14.13 | 53.1 | 1 275 | 633 |
| correct-section-rewrite | anthropic | 39.78 | 82.1 | 3 956 | 3 230 |
| correct-section-rewrite | local-mainline | 34.59 | 62.1 | 2 021 | 2 418 |
| correct-section-rewrite | local-turbo | 24.12 | 66.4 | 2 018 | 1 611 |
| correct-section-rewrite | local-turbo-parallel | 17.57 | 93.0 | 2 018 | 1 642 |
| correct-section-rewrite | local-turbo-haiku-think | 29.16 | 59.4 | 2 018 | 1 475 |
Three things to notice. First, local-turbo-parallel is fastest on every workload: the second slot fills under asyncio.gather and the MoE expert loads amortise across the two requests. Second, Anthropic's output token counts are now correct (9 801 on extract, 3 230 on correct): a previous version of this bench had a propagation bug that pinned them to 0 to 8, which has since been fixed. Third, the thinking-haiku ablation is dramatically slower (×5 to ×20) because the model burns 400 to 2 700 reasoning tokens before emitting the 15- to 50-token JSON answer.
7.3 Quality results: proxy metrics#
| Workload | Provider | Cosine vs ref | ROUGE-L F1 | JSON valid | JSON key OK |
|---|---|---|---|---|---|
| extract-chunk | local-mainline | 0.639 | 0.223 | 60% | 60% |
| extract-chunk | local-turbo | 0.659 | 0.203 | 80% | 80% |
| extract-chunk | local-turbo-parallel | 0.634 | 0.174 | 60% | 60% |
| extract-chunk | local-turbo-haiku-think | 0.857 | 0.298 | 40% | 40% |
| importance-score | local-mainline | 0.427 | 0.019 | 100% | 100% |
| importance-score | local-turbo | 0.434 | 0.018 | 100% | 100% |
| importance-score | local-turbo-parallel | 0.450 | 0.014 | 100% | 100% |
| importance-score | local-turbo-haiku-think | 0.926 | 0.368 | 80% | 80% |
| verify-rag-fact | local-mainline | 0.873 | 0.107 | 100% | 100% |
| verify-rag-fact | local-turbo | 0.867 | 0.101 | 100% | 100% |
| verify-rag-fact | local-turbo-parallel | 0.870 | 0.107 | 100% | 100% |
| verify-rag-fact | local-turbo-haiku-think | 0.932 | 0.403 | 80% | 80% |
| correct-section-rewrite | local-mainline | 0.889 | 0.460 | n/a | n/a |
| correct-section-rewrite | local-turbo | 0.925 | 0.505 | n/a | n/a |
| correct-section-rewrite | local-turbo-parallel | 0.901 | 0.480 | n/a | n/a |
| correct-section-rewrite | local-turbo-haiku-think | 0.921 | 0.513 | n/a | n/a |
Two findings stand out. First, local-turbo beats local-mainline on the correct-section-rewrite cosine (0.925 vs 0.889), reversing the conclusion of an earlier single-trial bench. The KV q8 to turbo4 trade-off does not cost measurable quality on long outputs once you average over N=5. Second, local-turbo-haiku-think wins on every cosine metric because the larger model with thinking produces output that is semantically closer to Anthropic's. Whether that closeness translates into actual correctness is the question the Opus judge answers next.
A note on the importance-score cosine values (0.43 to 0.45): the workload's output is about 15 tokens, and cosine on outputs that short is statistical noise. The 100% JSON validity tells the real story: the locals produce the right shape, just with different wording from Anthropic. The same trap shows up in any retrieval system that scores on short fragments, and it is the reason our semantic conflict detection pipeline does NLI on full sections rather than cosine on chunks.
7.4 Quality results: Opus-as-judge#
Opus rates each output against the Anthropic reference on a 0 to 10 absolute equivalence scale. The anthropic-baseline row is Anthropic trial 2 judged against trial 1 (same backend, different sample): it sets the empirical ceiling for "indistinguishable" on this workload.
| Workload | Provider | Opus rating | Note |
|---|---|---|---|
| extract-chunk | anthropic-baseline (ceiling) | 6/10 | Verbatim quotes, comparable fact count, minor variations |
| extract-chunk | local-mainline | 5/10 | Some compound facts violating atomicity |
| extract-chunk | local-turbo | 5/10 | Fewer atomic facts than reference (8 vs 17) |
| extract-chunk | local-turbo-parallel | 4/10 | One quote uses '...' to splice non-contiguous text |
| extract-chunk | local-turbo-haiku-think | 4/10 | Violates atomicity (compound facts) |
| importance-score | anthropic-baseline (ceiling) | 8/10 | Identical JSON output |
| importance-score | local-mainline | 2/10 | Score 0.7 / bucket 'important' diverges from reference 'marquant' |
| importance-score | local-turbo | 2/10 | Bucket wrong relative to reference |
| importance-score | local-turbo-parallel | 4/10 | Score close (0.8 vs 0.9) but bucket wrong |
| importance-score | local-turbo-haiku-think | 8/10 | Identical JSON, at the ceiling |
| verify-rag-fact | anthropic-baseline (ceiling) | 9/10 | Same verdict, quote, src_idx, notes |
| verify-rag-fact | local-mainline | 9/10 | Correct verdict, source, quote |
| verify-rag-fact | local-turbo | 9/10 | At parity with Anthropic |
| verify-rag-fact | local-turbo-parallel | 9/10 | At parity with Anthropic |
| verify-rag-fact | local-turbo-haiku-think | 7/10 | Correct but emits prose reasoning before JSON |
| correct-section-rewrite | anthropic-baseline (ceiling) | 9/10 | All four hallucinations fixed, all three omissions integrated |
| correct-section-rewrite | local-mainline | 6/10 | All fixes applied, but unsolicited reasoning preamble |
| correct-section-rewrite | local-turbo | 6/10 | All fixes applied, light formatting differences |
| correct-section-rewrite | local-turbo-parallel | 6/10 | French version, all fixes applied |
| correct-section-rewrite | local-turbo-haiku-think | 7/10 | All fixes applied, slightly cleaner output |
The Opus judge changes the reading of every other table. The right way to read it is as a percentage of the ceiling, not as an absolute deficit, because the ceiling itself varies by workload: Anthropic-vs-Anthropic only scores 6/10 on extract (it penalises its own prose violations), while it scores 9/10 on verify and rewrite.
- verify-rag-fact: parity with Anthropic. Local-mainline, local-turbo, and local-turbo-parallel all score 9/10, the same as Anthropic-vs-Anthropic. This is the workload with the most calls per run (~1 300) and the one with the largest cost saving, so this is the headline result. Re-rolling Anthropic on the same prompt does no better.
- extract-chunk: 66 to 83% of ceiling. Locals score 4 to 5/10 against a ceiling of 6/10 (Opus is strict on this workload even with Anthropic itself). On a ceiling-relative scale the locals are at 66 to 83% of the achievable, not 40 to 50% as the absolute number suggests. Atomicity violations (the 4 B emits 8 atomic facts where Anthropic Haiku emits 17) are the recurring failure mode; real margin is about one point.
- correct-section-rewrite: a real two-to-three-point gap. Ceiling 9, locals 6 to 7. The locals do apply all requested fixes (Opus: "all four hallucinations softened and all three omissions integrated with correct citations"), but they lose points to formatting drift and the occasional reasoning preamble. This is the one workload where the gap is mesurable and not just a ceiling artifact, and where upgrading the sonnet-tier would pay off the most.
- importance-score: the only true quality failure, and the one with a clean fix. Locals score 2/10 against an 8/10 ceiling without thinking, because the 4 B systematically picks the wrong bucket on borderline cases (Opus: "the candidate's score and bucket diverge from the reference's 'marquant'"). Flipping thinking ON for this workload alone lifts the rating to 8/10 = ceiling, identical JSON included. Thinking is the single available quality lever here.
7.5 Anthropic vs local-turbo-parallel, the headline#
| Metric | Anthropic | local-turbo-parallel | Delta |
|---|---|---|---|
| Extract wall | 53.80 s | 5.79 s | ×9.3 faster |
| Importance wall | 10.39 s | 1.52 s | ×6.8 faster |
| Verify wall | 10.90 s | 1.87 s | ×5.8 faster |
| Correct wall | 39.78 s | 17.57 s | ×2.3 faster |
| Verify Opus rating | 9/10 (ceiling) | 9/10 | at ceiling |
| Correct Opus rating | 9/10 (ceiling) | 6/10 | one point below |
| Cost per run (pure local) | ~$5 | $0 | free; the production hybrid (Opus on correct) costs ~$4, see §7.9 |
7.5 bis Caveat: N=5 does not converge on the correct phase#
Before we draw conclusions from §7.5, one honest caveat. The correct-section-rewrite workload has a wide trial-to-trial variance: in an isolated A/B run on local-mainline with N=5, individual trials ranged from 22 to 56 seconds (stdev 11.9 s without optimisations, 6.6 s with). At N=5 the mean has not converged, and the gap between local-mainline and local-turbo on this single workload is within the noise.
The other three workloads (extract, verify, importance) are stable enough at N=5 to read confidently. Any comparative claim that hinges on correct-phase wall times should be checked against a higher-N rerun or bootstrapped confidence intervals before being treated as definitive. The Opus-judge ratings on correct-phase are less noisy than the wall times (the rubric is deterministic enough that all four local variants land at 6 or 7), so those are still readable.
7.6 End-to-end estimate on one full run#
A representative run on a medium-sized source document is 88 document chunks × extract + 1 300 facts × verify + 300 × importance + 8 sections × correct.
| Phase | Mode | Local-turbo-parallel | Anthropic |
|---|---|---|---|
| Extract | haiku no-think | ~9 min | ~30 min |
| Verify-RAG | haiku no-think | ~40 min | ~3 h |
| Importance | haiku no-think | ~8 min | ~25 min |
| Correct | sonnet thinking ON | ~2 min | ~5 min |
| Total | ~59 min | ~4 h |
The local pipeline is roughly four times faster than Anthropic end-to-end while staying free, modulo electricity. The Anthropic estimate uses observed concurrency rather than a strict serial extrapolation of per-call wall times. The bulk of the local speedup comes from the thinking-disable flag of §5 combined with parallel=2 on TurboQuant.
7.7 Which local provider to pick#
The earlier draft of this article said "mainline by default, turbo when VRAM forces it". With N=5 trials and Opus-as-judge added, that recommendation flips. The list below is the post-bench reco for the dual-model dual-tier setup; §7.8 then collapses the architecture further to a single 35 B model, and §7.9 fixes the final production wiring. Read this section as the framework and §7.9 as the shipped recipe.
- Default: local-turbo with
--parallel 2(i.e. thelocal-turbo-parallelprovider). Fastest on every workload, beatslocal-mainlineon correct-rewrite quality (cosine 0.901 vs 0.889, modulo the §7.5 bis caveat), and lands at the 9/10 Opus ceiling on verify. The 1.5 GB freed by turbo4 KV is what makesparallel=2viable on a 24 GB card. - Skip mainline: it is slower than turbo on every workload in this run, and the quality advantage from the earlier draft disappeared once we averaged over more trials.
- If you must run correct locally, turn thinking ON on sonnet: it bumps the 35 B from 5/10 to 6/10 vs the 9/10 ceiling (see §7.8). Production routes correct to Anthropic Opus instead (§7.9) and skips the sonnet container entirely, so this lever only matters if you accept the local quality gap.
- Avoid thinking-ON on verify and extract: the cosine boost is real but the Opus judge rates these variants lower than thinking-OFF on the same models. Reasoning leaks pollute the JSON output, costing more in parsing failures than it gains in semantic depth.
7.8 Mini-bench: from dual-model to single-model#
After the v3 bench above, two follow-up mini-benches (28 May) overturned the dual-model architecture we initially shipped. The headline result: a 35 B-A3B MoE model in no-think mode beats a dedicated 4 B model on the haiku-tier workloads at the same latency, and thinking ON is counterproductive on strict-JSON workloads.
Mini-bench A, importance workload, four local configs:
| Config | Wall mean | Stdev | Opus rating | Trial 1 output |
|---|---|---|---|---|
| 4 B no-think | 2.13 s | 0.02 s | 8/10 | {"score": 0.8, "bucket": "marquant"} |
| 4 B thinking ON | 17.14 s | 16.43 s | 6/10 | prose then JSON (penalised) |
| 35 B no-think | 2.42 s | 0.15 s | 9/10 | {"score": 1.0, "bucket": "marquant"} |
| 35 B thinking ON | 13.76 s | 1.94 s | 6/10 | prose then JSON (penalised) |
Three takeaways:
- The 35 B no-think beats the 4 B no-think on everything: one more Opus point (9 vs 8), tight latency variance (0.15 s vs the 4 B which still varies trial-to-trial on borderline buckets), and effectively the same wall-clock (2.42 s vs 2.13 s). The MoE only activates ~3 B parameters per token, so its decode rate is comparable to a dense 4 B model.
- The 4 B no-think is inconsistent across trials. On the importance workload it sometimes returns
importantinstead of the referencemarquanton borderline cases. The 35 B picks the right bucket every time we measured. - Thinking ON is harmful on strict-JSON workloads. Both the 4 B and the 35 B drop to 6/10 when thinking is ON, because they prefix the JSON with prose ("Let me analyse...") which violates the prompt's
No prose. No fences.instruction. Opus deducts points accordingly. Thinking is a quality lever, not a free upgrade: on workloads where the rubric forbids prose, it costs you.
Decision from mini-bench A: drop the 4 B from production and run the 35 B-A3B no-think on the haiku tier. The on-disk GGUF inventory becomes one chat model plus the embedder.
Mini-bench B, correct workload, can the 35 B replace Anthropic Opus on rewrite?
| Config | Wall mean | Opus rating | Verdict |
|---|---|---|---|
| 35 B no-think | 9.61 s | 5/10 | fast but quality insufficient |
| 35 B thinking ON | 23.50 s | 6/10 | thinking helps marginally |
| Anthropic Opus (direct) | 12.88 s | 9/10 | ceiling quality, $4/run |
| Anthropic Sonnet | 26.01 s | 9/10 | Opus quality, double latency |
Verdict: Opus stays mandatory for the correct phase. The 35 B locals plateau at 5 to 6/10 regardless of thinking. The Opus judge points at one repeatable failure mode: the local 35 B does not include the [#filename] citation tags in the omission sentences it inserts, and Opus does. That is an instruction the local model ignores no matter how we prompt it, and the gap is visible in the final markdown. Measured shortfall: 3 to 4 points on the 0-10 scale.
7.9 Production architecture and recipe#
The two mini-benches above define the architecture we ship today:
| Phase | Backend | Thinking | Why |
|---|---|---|---|
| extract, verify, importance, omissions | local Qwen3.6-35B-A3B (haiku container) | OFF | 9/10 on importance and verify, 5/10 on extract (one point below ceiling), ~2 to 3 s per call, free |
| correct (8 calls) | Anthropic Opus | n/a | 9/10 ceiling, ~13 s per call, ~$4 per run |
The sonnet container exists in the code for completeness, but the production "best-of-both" mode never loads it: when the correct phase fires, the SDK routes those eight calls back to Anthropic Opus and the sonnet container is skipped, saving 30 seconds of GPU loading and 18 GB of VRAM.
The economics:
| Configuration | Cost / run | Correct-phase Opus rating |
|---|---|---|
| Full Anthropic | ~$5 | 9/10 (ceiling) |
| Pure local (sonnet container with thinking ON) | $0 | 6/10 |
| Hybrid (local middle + Opus correct) | ~$4 | 9/10 (ceiling) |
You save about a dollar over full Anthropic, get the full ceiling on the content-touching phase, and keep the 4 to 9× speedup on the high-volume middle. The provider switch is already in place (just provider_for("opus") on the correct phase). This is the shipped default in --correct-via-opus mode.
8. Operational notes that bit us#
A handful of things that are not in the bug list above but cost us time anyway.
Three-way VRAM mutex on the same GGUF. The same 24 GB card runs the NLP container (Qwen3-Embedding-8B, ~5 GB), the haiku LLM container (Qwen3.6-35B-A3B no-think, ~18 GB), and the sonnet LLM container (same GGUF with thinking ON). Pairs do not co-exist; the active container has to stop before the next one starts. The haiku-to-sonnet swap is fast (~10 s) because the GGUF stays in the OS page cache and only the llama-server process restarts with new flags. The nlp-to-llm swap, where the chat model has to be loaded into VRAM from cold, takes about 30 s. Each phase that needs a tier calls ensure_llm_running(tier=...) itself rather than relying on a top-level orchestrator. The same "each task manages its own container" pattern shows up in our semantic conflict detection pipeline.
Swap cost is small. A full run does two to three tier swaps total (nlp → haiku → sonnet, or nlp → sonnet direct if the extracted facts are cached from a previous run). At ~30 s for an nlp ↔ llm swap and ~10 s for a same-GGUF haiku ↔ sonnet swap, the total swap overhead is about 1 to 1.5 minutes on a ~59 minute run.
Opt-in routing. Local routing has to be explicit. A single environment variable, USE_LOCAL_LLM=1, enables it; without it, the SDK hits Anthropic native. The reason is operational: if the local containers are down or the GPU is busy, the pipeline should still run on Sonnet rather than hang.
Critical jobs stay on Anthropic. provider_for("opus") clears the override and forces Anthropic. Anything we cannot accept a regression on (the final verification pass, the user-facing summary) stays on a frontier model. The local stack handles the high-volume, lower-stakes middle of the pipeline.
Fork freshness. spiritbuun's fork moves fast. Building with --no-cache --ref <new-sha> rebuilds against a specific upstream commit. Pinning is not optional with a fork that pushes daily on the critical path.
Run N ≥ 5 trials before drawing conclusions. An earlier single-trial version of the bench had local-mainline beating local-turbo on rewrite quality. With N=5 that flipped, and the recommendation in §7.7 inverted. Single-trial benches on stochastic decoders are anecdotes; we treat them as such now.
9. What we would tell someone starting today#
If you are planning to route the Claude Agent SDK at one or more local LLMs, the order we would recommend:
- Pick one mid-size MoE model and use it on both tiers. A 30 B-class MoE that activates ~3 B parameters per token (Qwen3.6-35B-A3B in our case) has the latency of a 4 B dense model and the quality of a 30 B dense model. Running it on both tiers with different thinking flags is simpler than maintaining a separate 4 B for the high-volume tier and a 30 B for the low-volume tier, and our mini-bench (§7.8) shows it is also higher quality.
- Get a per-tier model alias right first. Stable names in
/v1/modelsare the difference between "model not found" and "model works". Add--alias <tier-name>to eachllama-servercommand before anything else, and make sure your tier swap clears the residual model variables. - Disable thinking at the engine level, not in the prompt. Prompt-level directives are a courtesy the model will refuse 90% of the time. The
--jinja --reasoning off --chat-template-kwargs '{"enable_thinking": false}'combo is what worked for us on Qwen3.6. The speedup is twelvefold and the quality improves too on strict-JSON workloads, because reasoning prose violatesNo prose. No fences.rubrics. - Audit the SDK's built-in tools list. Print what the SDK actually exposes to your model and ban the full set. Do not trust your memory of "the 12 tools"; the real number drifts as the SDK ships features.
- Benchmark with N ≥ 5 trials and an Opus-as-judge ceiling. Single-trial benches lie about stochastic decoders. The Anthropic-vs-Anthropic baseline rating is the ceiling that lets you say "at parity" honestly instead of chasing a 0.96 cosine that does not mean anything.
- Default to TurboQuant with
--parallel 2. Mainline llama.cpp is slower and not measurably higher quality once you average over enough trials. TurboQuant + flash attention + the 1.5 GB of headroom the 4-bit KV gives you for the second parallel slot is the configuration to start from. - Treat thinking as a tool, not a default. It helps long-form rewriting and hurts strict-JSON output. The default policy is
OFFon the haiku-tier container andONon the sonnet-tier container; resist the urge to flip thinking ON globally "for quality", it will silently degrade the high-volume workloads. - Accept that some workloads stay on Anthropic. Our correct-phase rewrite plateau at 6/10 against a 9/10 ceiling on every local config we tried, because the locals consistently omit the
[#filename]citation tags. Route those calls back to Anthropic Opus and stop chasing the gap.
The bigger pattern is that LLM-agnostic execution is at the centre of any serious production stack. The same instinct shows up in our PageIndex vs Anatoly RAG note (the same retrieval pipeline runs on whichever LLM is cheapest at the moment) and in AI Code Audit vs AI Code Review in 2026 (the audit harness has to outlive any single model generation). If your architecture only works against one provider, the model release cycle will eat you alive.
For implementation questions, open a discussion on the GitHub repo or reach the author directly. The full research index has the surrounding context on retrieval, conflict detection, and the audit harness this routing layer feeds into.