Deep Dive

Routing the Claude Agent SDK to Local LLMs: A Dual-Tier Qwen Stack with TurboQuant 4-bit KV

An Anatoly exploration: running a multi-step pipeline on local llama.cpp instead of Anthropic. Single Qwen3.6-35B-A3B GGUF in two thinking modes (haiku no-think, sonnet thinking), four SDK bugs including a thinking-disable trick worth a 12-fold speedup, TurboQuant 4-bit KV cache on a 24 GB RTX 3090 Ti, 100-call benchmark with Opus-as-judge. Local is 5 to 9 times faster than Anthropic and at the Opus ceiling on verify-rag.

May 27, 202643 min readRémi Viau

Routing the Claude Agent SDK to Local LLMs: A Dual-Tier Qwen Stack with TurboQuant 4-bit KV#

Note by: Rémi Viau (Anatoly maintainer), with Claude (Anthropic Opus 4.7) as analytical partner.

The Claude Agent SDK is a convenient front-end. The model behind it does not have to be a Claude model. This article is a field report from an Anatoly exploration: pointing the SDK at local llama.cpp servers running Qwen3 models on a single RTX 3090 Ti with 24 GB of VRAM. The trigger was the cost of a long-running multi-step document fact-check pipeline that was burning about five dollars per run on Anthropic Sonnet, most of it on the high-volume verification steps that hammer the SDK with thousands of small calls.

We started with a dual-model setup (a 4 B dense Qwen3 for the haiku tier, a Qwen3.6-35B-A3B for the sonnet tier) and converged on a single-model architecture after the mini-bench in §7.8: one Qwen3.6-35B-A3B GGUF runs in two containers, distinguished only by a thinking flag. The haiku container (thinking OFF) absorbs the thousand-plus verify calls per run; the sonnet container (thinking ON) handles the few high-stakes correction passes. Same model file, ~10 second container restart to switch tiers (the GGUF stays in the OS page cache). The opus role stays on Anthropic, and the production "best-of-both" wiring routes the correct-phase rewrite back to Anthropic Opus where the local 35B falls 2 to 3 Opus-judge points short.

The interesting parts are not the architecture diagram. They are the four SDK integration bugs that nobody warns you about, the one-line container flag that gave us a 12-fold speedup once we found it, and the discovery that landed the final architecture: a 35B-A3B MoE model in no-think mode beats a dedicated 4B model on the haiku-tier workloads, at the same latency, because MoE only activates ~3 B parameters per token. The benchmark below (100 calls, 5 providers, 4 workloads, N=5, Opus-as-judge) shows local-turbo with --parallel 2 running 4 to 9 times faster than Anthropic on the high-volume steps, at parity on verify, beating Anthropic on importance, and a few points below on extract and rewrite. Cost per run drops to zero on pure local, $4 on the recommended hybrid setup.

Quick verdict: is the local stack good enough?#

Yes on the high-volume Haiku-tier workloads. The Sonnet-tier rewrite still needs Anthropic Opus. One Qwen3.6-35B-A3B GGUF runs in two containers, the only difference is the thinking flag. Here is the Opus-judge verdict, broken out by which Anthropic model each local container replaces:

Local container	Model & mode	Replaces	Workloads (calls per run)	Opus-judge verdict
`llm-haiku`	Qwen3.6-35B-A3B, thinking OFF	Anthropic Haiku	verify-rag (~1 300), extract (~88), importance (~300)	verify: parity at 9/10. Importance: 9/10, one point above Anthropic Haiku itself. Extract: 4 to 5/10 vs a 6/10 ceiling (the 4 B benched at 4 to 5/10; not retested on the 35 B, expected to match or exceed since the 35 B beats the 4 B on every workload tested in §7.8). Anthropic Haiku itself does not score higher than 6/10 on this rubric.
`llm-sonnet`	Qwen3.6-35B-A3B, thinking ON	Anthropic Sonnet	correct-section-rewrite (~8)	6/10 vs the 9/10 ceiling. Applies every requested fix but misses the `[#filename]` citation tags that Opus produces by default. The gap does not close with thinking, prompt tweaks, or the larger model: Opus stays mandatory for this workload in production.

The bottom line. A single 35 B MoE model in no-think mode meets Anthropic Haiku on every high-volume workload (and beats it on importance, where the 4 B alternative we initially shipped was scoring 2/10). The same model with thinking ON on the rewrite phase lands 3 Opus-judge points below Anthropic Sonnet/Opus and we have not been able to close the gap. The shipped production setup is therefore hybrid: local for the Haiku-tier middle (extract, verify, importance, omissions), Anthropic Opus for the eight correct-phase rewrites. End-to-end ~59 minutes local-with-Opus-correct vs ~4 hours full Anthropic; ~$4 per run vs ~$5 full Anthropic or $0 if you accept the rewrite gap.

At a glance: how close is local to Anthropic?#

Quality. Opus-judge rating per workload, on a 0-to-10 scale where 10 means "indistinguishable from the Anthropic reference". Gold bars = Anthropic ceiling (Anthropic run twice on the same prompt, second run rated against the first). Mint bars = the local 35 B-A3B in its production thinking config (no-think for haiku-tier, thinking ON for rewrite). Importance and correct values come from the §7.8 mini-bench on the 35 B; extract and verify are taken from the §7.4 4 B bench since we did not rerun them on the 35 B (the chart picks the upper bound of the measured range for extract).

Local lands at parity on verify-rag (1 300 calls per run, the cost-saving workload) and one point above the Anthropic-vs-Anthropic ceiling on importance thanks to the 35B-A3B no-think config. Extract is one to two points below ceiling. The persistent gap is on correct-rewrite (6/10 vs 9/10): the locals miss the [#filename] citation tags Opus produces, and neither thinking nor prompting closes it. The production recipe (§7.9) routes those eight calls back to Anthropic Opus.

Speed. Wall-clock seconds per call. Gold bars = Anthropic. Mint bars = local 35 B-A3B on local-turbo-parallel. The correct bar shown is the pure-local fallback (35 B thinking ON, 17.57 s); production actually routes correct to Anthropic Opus at ~13 s per call (see §7.9), so neither bar reflects the shipped path for that workload.

The local bars are 4 to 9 times shorter on the three Haiku-tier workloads and a little over 2 times shorter on Sonnet-tier correct-rewrite. The importance number moved from 1.52 s (the 4 B model we initially shipped) to 2.42 s (the 35 B no-think we ship now): still 4.3× faster than Anthropic, with the +1 quality point that comes with it. End-to-end the local pipeline runs in roughly one hour versus four hours on Anthropic.

Key findings (TL;DR)#

If you only read one section beyond the verdict above, read this one.

Drop-in routing works. The Claude Agent SDK reads ANTHROPIC_BASE_URL and ANTHROPIC_DEFAULT_*_MODEL on every call, so pointing it at a local llama-server Anthropic-compatible endpoint requires no SDK fork.
One GGUF, two thinking modes. Production runs a single Qwen3.6-35B-A3B file in two containers: the haiku container (thinking OFF) handles high-volume JSON workloads; the sonnet container (thinking ON) handles the rewrite. The 35B-A3B MoE only activates ~3 B parameters per token, so it is the same latency as a 4 B model with one point higher quality on the haiku workloads. This is the architecture we ship after a follow-up mini-bench rejected the initial dual-model setup.
Four SDK integration bugs to know. Per-tier --alias for /v1/models; parallel=1, ctx_per_slot=32768 for long prompts; ban all 27 built-in tools, not the 12 you remember; and disable thinking at the engine, not in the prompt.
Thinking-disable at the engine is the biggest single win. The prompt-level /no_think directive is ignored 92% of the time on Qwen3.5 and Qwen3.6. Passing --jinja --reasoning off --chat-template-kwargs '{"enable_thinking": false}' to llama-server delivers a 12-fold speedup on the high-volume haiku workloads. The same flag is also a quality lever: on strict-JSON workloads, thinking ON drops the Opus rating because the model prefixes the JSON with prose.
Recommended config: TurboQuant with --parallel 2, not mainline llama.cpp. 4 to 9 times faster than Anthropic on the high-volume steps, at the Opus-judge 9/10 ceiling on verify-rag (full parity) and one point above the ceiling on importance.
Correct-phase stays on Anthropic Opus. A local 35B (thinking ON or OFF) lands 3 Opus-judge points below Opus on rewrite, mainly because it omits the [#filename] citation tags. The production setup is hybrid: local for the high-volume middle, Anthropic Opus for the 8 rewrite calls. ~$4 per run vs ~$5 full Anthropic or $0 if you accept the rewrite gap.
End-to-end: ~59 minutes local vs ~4 hours Anthropic (~4× faster, near-zero cost modulo electricity and the 8 Opus calls).

Benchmark methodology: 100 calls, 5 providers, 4 workloads, N=5 trials, Opus LLM-as-judge with an Anthropic-vs-Anthropic ceiling for calibration. Host: RTX 3090 Ti, llama.cpp build pinned. A follow-up mini-bench (§7.8) compared 4 B vs 35 B in both thinking modes on a single workload, and motivated the single-model architecture.

Caveats: how much to trust these numbers#

Four things to flag before the deep dive, so you read the rest with the right calibration.

N=5 is small, and the Opus-as-judge is a single judgement per cell. The whole bench is N=5 trials, and the quality column is one Opus-as-judge rating per (provider, workload) cell on a 0-to-10 scale that Opus itself produces stochastically. That is closer to a quantified anecdote than a measurement. One-point gaps like 4/10 vs 5/10 or 8/10 vs 9/10 sit below what the protocol can reliably distinguish. The Anthropic-vs-Anthropic ceiling anchors what "indistinguishable" looks like, but it does not rescue the small sample size. Mentally apply a ±1 band on every Opus rating you read in this article.

The "at a glance" quality chart mixes measured and inferred values. The extract and verify bars come from the 4 B run in §7.4. We did not rerun those two workloads on the 35 B we now ship, on the assumption that the 35 B "should match or exceed" the 4 B given the importance and correct results we did rerun in §7.8. The verify parity (9/10 = ceiling) is a real measured result on both models and the headline of the article. The extract value is extrapolated. Only readers who read the chart caption catch this.

"4 to 9 times faster" is partly an output-volume artifact. On extract, Anthropic produces about 9 800 output tokens (17 atomic facts) in 53 seconds; local produces 590 to 809 tokens (8 facts) in 6 seconds. Part of the wall-clock win comes from local doing less work, not from being faster at the same work. In tokens per second, Anthropic actually leads (~183 tps vs ~100 for local-turbo). The latency ratio is the thing that matters operationally when a pipeline waits on each call to return, but presenting it as "9 times faster" without flagging the throughput gap is flattering. The honest framing is: same shape of work, less of it, faster wall-clock per call.

The dollar saving is not the real win. "Cost drops to zero" is only true for the pure-local mode, which we explicitly do not ship because it leaves rewrite quality at 6/10. The actually-recommended hybrid setup (Opus on the eight correct-phase calls) costs roughly $4 per run versus ~$5 full Anthropic. That is about a dollar per run, in exchange for a dedicated RTX 3090 Ti, ~59 minutes of local compute, the electricity to drive it, and ongoing maintenance of a llama.cpp fork that pushes daily on the critical path (§8). The dollar is real but small. The real gains are latency (a pipeline that returns in an hour rather than four) and keeping the documents on local hardware. We should have led with those, not with the dollar.

1. Why the Claude Agent SDK lets you do this at all#

The Claude Agent SDK reads its target from a small set of environment variables it consults at every call: ANTHROPIC_BASE_URL, ANTHROPIC_AUTH_TOKEN, ANTHROPIC_MODEL, and the three ANTHROPIC_DEFAULT_*_MODEL for haiku/sonnet/opus. If you point the base URL at a server that speaks the /v1/messages shape, the SDK does not know or care that the responder is not Anthropic. llama.cpp's server has implemented an Anthropic-compatible endpoint for some months now (see the server README), and it works.

The pipeline keeps two local containers behind the SDK's two budget tiers, with the embedding container as the third occupant of the same GPU. The haiku and sonnet containers run the same Qwen3.6-35B-A3B GGUF; only the thinking flag differs at startup:

Three containers, one up at a time. Switching from the haiku container to the sonnet container is a container restart with different flags, not a different model: the GGUF stays in the OS page cache, so the swap takes about 10 seconds in practice. A small context manager swaps the eight relevant environment variables on entry, restores them on exit. Calling opus clears the overrides and lets the SDK hit Anthropic native; calling haiku or sonnet redirects to the matching local server. The pipeline can chain opus → haiku → sonnet → opus in a single run safely.

That is the easy half. The hard half is everything the SDK does before and around the call.

2. Bug 1: expose a stable per-tier model alias via `/v1/models`#

The SDK validates the model name at session init, before any /v1/messages call. It does this by hitting /v1/models and checking the requested ID is in the returned list. With llama.cpp's defaults, /v1/models returns the GGUF filename, qwen3.6-35b-a3b-Q4_K_M.gguf, which is neither stable nor what the SDK was told to ask for. The SDK then refuses the call with a misleading "model not found" error.

The fix is to pass --alias <stable-name> to llama-server at startup, with a different name on each tier:

# haiku container
llama-server ... --alias local-haiku
# sonnet container
llama-server ... --alias local-sonnet

After that, curl -s http://127.0.0.1:11451/v1/models | jq .data[0].id returns local-haiku, the sonnet port returns local-sonnet, and matching ANTHROPIC_DEFAULT_HAIKU_MODEL=local-haiku + ANTHROPIC_DEFAULT_SONNET_MODEL=local-sonnet on the client side validates cleanly.

A subtle trap that bit us when we added the second tier: when you swap from haiku to sonnet, you must also clear the haiku model variable in the environment. Otherwise the SDK can route a residual haiku-tagged call to a haiku container that is no longer running. The fix is a clean snapshot/restore of the full eight-variable set on every tier swap, and a dedicated regression test for consecutive swaps.

3. Bug 2: per-slot context vs total context#

The next error you hit is more brutal:

API Error: 400 request (22 584 tokens) exceeds the available context size (4 096 tokens)

llama.cpp divides --ctx-size by --parallel to compute the context per slot. The defaults are parallel=4, ctx_size=16384, which gives each slot 4 096 tokens. A correction prompt with 22 k of input plus 10 k of expected output blows through that on the first token.

There are two ways to fix this: raise --ctx-size, or lower --parallel. For a single-GPU box where the pipeline serialises calls anyway (the orchestrator launches sections via asyncio.gather but the GPU is the bottleneck), parallel=1, ctx_per_slot=32768 is the simpler choice. llama.cpp's internal --cont-batching already amortises expert loads on a MoE model like Qwen3.6-35B-A3B (3 B active parameters out of 35 B), so you do not lose much throughput by setting parallel=1.

If you do want concurrency, the right move is to raise --ctx-size to parallel × desired_per_slot. Just remember the KV cache grows with total context, not per-slot context.

4. Bug 3: ban all 27 built-in tools, not the 12 you remember#

The most expensive bug we shipped against, and the most surprising. On the extraction phase (which expects JSON output, not tool calls), we suddenly saw 84 tool.builtin ERR AskUserQuestion errors in five minutes, cascading into "Reached maximum number of turns (3)" failures and empty extractions.

The mechanism is this. The Claude Agent SDK exposes a default set of built-in tools to every model unless you explicitly disable them. The list of tools we had been disabling was 12 long (file ops, web, task). The actual default set is around 27, and it includes AskUserQuestion, EnterPlanMode, Skill, CronCreate, TaskOutput, PushNotification, ScheduleWakeup, and a dozen others. Setting allowed_tools=[] does not suffice: the SDK still exposes the built-ins; you have to list them in disallowed_tools explicitly.

Why did this break with the local Qwen model but not with Sonnet? Two reasons. First, Sonnet has been trained to use built-in tools sparingly and to favour direct JSON output when the prompt asks for it. The Qwen model has not, and it cheerfully called AskUserQuestion to "think out loud" when uncertain (this was observed with the 4 B variant we initially shipped; the 35 B with thinking OFF is better but the disable-list discipline remains the right default). Second, our max_turns=3 cap meant three calls to AskUserQuestion were enough to exhaust the budget without ever emitting the requested JSON.

The fix is a single constant listing all 27 built-ins, applied to every JSON-only step:

BUILTIN_TOOLS_ALL = [
    "AskUserQuestion", "EnterPlanMode", "ExitPlanMode",
    "Skill", "CronCreate", "CronDelete", "CronList",
    "TaskOutput", "TaskStop", "PushNotification",
    "ScheduleWakeup", "RemoteTrigger", "Monitor",
    "WebFetch", "WebSearch", "TodoWrite",
    "Read", "Write", "Edit", "NotebookEdit",
    "Bash", "Glob", "Grep", "ToolSearch",
    "EnterWorktree", "ExitWorktree", "BashOutput",
]

For steps that genuinely need one tool (the corrector needs Write), the disallow list becomes [t for t in BUILTIN_TOOLS_ALL if t != "Write"].

The before/after on the extraction phase tells the story:

	Before	After
`tool.builtin ERR` over 5 minutes	84	0
Chunk failures	11+	0
Facts extracted per chunk	1 (always failing)	14 to 31

The lesson generalises to anyone running a non-Anthropic model behind the SDK: do not assume the default tool disabling list is exhaustive. Print what the SDK actually exposes to the model and ban the full list.

5. Bug 4: disable thinking at the container, not in the prompt#

This was the most expensive bug we did not ship against initially, and the one that quietly cost us hours per run before we found the fix. Qwen3.5 and Qwen3.6 are unified "thinking" models: they emit an internal reasoning trace before the final answer, the same shape as DeepSeek-R1 or o1-style reasoning. For long-form rewriting on the sonnet tier this is helpful. For a verify call that has to decide yes / no on a single fact in 35 output tokens, it is catastrophic.

The first attempt was to suffix /no_think to every user prompt, a convention shipped in the original Qwen3 chat template. The hit rate was 8% out of 24 sample calls. The other 92% kept thinking. Result: each verify call spent 15 to 50 seconds reasoning before emitting the JSON, blowing the verify phase out to several hours per run.

The diagnosis took longer than it should have. Qwen3.5 and Qwen3.6 dropped the /no_think directive from their chat template (it was present in Qwen3 original, removed in 3.5+). The supported path is chat_template_kwargs: {enable_thinking: false}, which is documented for the OpenAI endpoint but ignored on the Anthropic endpoint of llama.cpp.

The fix is to pass the flag at container startup, not per request, via the llama-server arguments:

llama-server \
  --jinja \
  --reasoning off \
  --chat-template-kwargs '{"enable_thinking": false}' \
  ...

--jinja forces the use of the chat template embedded in the GGUF (the only template that knows how to interpret enable_thinking). --reasoning off is the recent llama.cpp flag that disables the reasoning block at the engine level. --chat-template-kwargs is the portable form. The three flags together are belt-and-suspenders; the Qwen documentation recommends combining them for robustness.

The right policy is per-container, both pointing at the same GGUF:

llm-haiku container (disable_thinking=True by default): high-volume workloads, low-latency, strict-JSON output. Verify, extract, importance scoring. Thinking ON costs both latency and quality here, since the model prefixes the JSON with reasoning prose (see §7.8 mini-bench).
llm-sonnet container (disable_thinking=False by default): low-volume, longer outputs. Correct-phase rewrite, cohesion checks. Reasoning depth helps the final text, though even with thinking ON the local 35 B does not match Anthropic Opus on this workload (production routes correct to Opus anyway, see §7.9).

The empirical result on the importance-scoring workload, measured under TurboQuant, tells the whole story:

	Before (thinking leak)	After (flag honoured)	Speedup
Wall mean	21.7 s	1.83 s	×12
Output tokens (mean)	358	9	×40 less waste
`usage.output_tokens` from llama.cpp	0 (broken)	9 (correct)	fixed

A twelvefold speedup on a workload that runs 300 times per pipeline run, and a similar speedup on verify (1 300 calls per run) which suffered from the exact same thinking leak. On verify, the math works out to roughly 40 minutes after the fix (1 300 × 1.87 s) versus about 8 hours before (1 300 × 21.7 s).

The generalised lesson: with thinking models, the disable-thinking knob has to be at the engine level, not in the prompt. Prompt-level directives are a courtesy the model can refuse. Engine-level flags are not.

6. TurboQuant: 4-bit KV cache to fit the workload on a 24 GB card#

With Qwen3.6-35B-A3B UD-IQ4_XS occupying about 18 GB of VRAM, a 32 k-token context window with FP16 KV cache consumes another 3 GB. That leaves roughly 3 GB of headroom on a 24 GB card, no margin for batching, and zero room for the embedding model the same pipeline uses for retrieval. The same trade-off shows up in any system that mixes embeddings and chat on one card: you either swap models in and out (which we also do, see §8) or shrink the runtime memory of one of them.

TurboQuant is a KV cache quantization scheme. The model weights stay in their original quant (UD-IQ4_XS in our case); only the KV cache is recoded at runtime to 4.25 bits per value (turbo4, about 3.8× compression vs FP16) or 3.25 bits per value (turbo3_tcq, about 5× compression). The technique was first described in the TurboQuant paper and has since landed in several llama.cpp forks. It is invoked through the standard llama.cpp interface, -ctk turbo4 -ctv turbo4, so swapping it in is one flag away once the binary is built with the right kernels.

6.1 Choosing a fork#

Four llama.cpp forks ship TurboQuant kernels at the time of writing:

Fork	Status	Platform	Notes
TheTom/turboquant_plus	Metal-first	Apple Silicon	Parity with q8_0 on M5
Aaryan-Kapoor/turboquant-tq3_0	CPU only	x86_64	Not useful with a CUDA card
Madreag/turbo3-cuda	RTX 5090 (Ada Lovelace)	sm_89	Wrong architecture target for us
spiritbuun/llama-cpp-turboquant-cuda	RTX 3090 (Ampere)	sm_86	Benchmarks published on a 3090

We picked spiritbuun because the README reports a constant ~30 tok/s decode rate from 4 K to 128 K context on a 3090, and a perplexity that beats q8_0 thanks to a documented "norm correction". There is some bus factor risk on a fork with ~600 stars; the mitigation is to pin the upstream commit and to be able to rebuild without depending on the fork's CI.

6.2 Multi-stage Dockerfile#

The build needs the CUDA devel image (nvcc, cuBLAS); the runtime only needs the runtime image. Multi-stage cuts the final image to about 3 GB.

# builder stage
FROM nvidia/cuda:12.6.3-devel-ubuntu22.04 AS builder
 
RUN apt-get update && apt-get install -y \
    git cmake build-essential ninja-build libcurl4-openssl-dev
 
# Stub libcuda.so.1 for the link step; see §6.3.1
RUN ln -sf libcuda.so /usr/local/cuda/lib64/stubs/libcuda.so.1
ENV LIBRARY_PATH=/usr/local/cuda/lib64/stubs:${LIBRARY_PATH}
 
RUN git clone https://github.com/spiritbuun/llama-cpp-turboquant-cuda.git /src
WORKDIR /src
 
RUN cmake -B build -G Ninja \
        -DGGML_CUDA=ON \
        -DGGML_CUDA_FA_ALL_QUANTS=ON \
        -DCMAKE_CUDA_ARCHITECTURES=86 \
        -DCMAKE_EXE_LINKER_FLAGS="-Wl,-rpath-link=/usr/local/cuda/lib64/stubs" \
        -DCMAKE_SHARED_LINKER_FLAGS="-Wl,-rpath-link=/usr/local/cuda/lib64/stubs" \
    && cmake --build build --target llama-server
 
# runtime stage
FROM nvidia/cuda:12.6.3-runtime-ubuntu22.04
 
RUN apt-get update && apt-get install -y \
    libcurl4 libgomp1 ca-certificates wget
 
COPY --from=builder /src/build/bin/llama-server /usr/local/bin/
COPY --from=builder /src/build/bin/*.so /usr/local/lib/
COPY --from=builder /src/codebooks /opt/codebooks
 
HEALTHCHECK CMD wget -qO- http://localhost:8080/health || exit 1
ENTRYPOINT ["llama-server"]

6.3 Build gotchas worth remembering#

6.3.1 `libcuda.so.1` not found at link time#

The linker looks for libcuda.so.1 (the versioned soname), but the CUDA devel image only ships libcuda.so (unversioned) in /usr/local/cuda/lib64/stubs/. The real libcuda.so.1 comes from the host driver at runtime via --gpus all. The link still has to resolve the symbol, so you need three things combined:

A symlink to fake the versioned soname: ln -sf libcuda.so /usr/local/cuda/lib64/stubs/libcuda.so.1.
ENV LIBRARY_PATH=/usr/local/cuda/lib64/stubs:... so gcc finds -lcuda at compile-link time.
-Wl,-rpath-link=/usr/local/cuda/lib64/stubs so ld can resolve NEEDED entries that point at libcuda transitively (libggml depends on libcuda, and the final llama-server link checks the chain).

Miss any one of the three and the link fails with undefined reference to 'cuMemCreate' and friends.

6.3.2 `libgomp.so.1` missing at runtime#

The CUDA runtime image does not ship OpenMP. The llama-server binary links against libgomp.so.1 for CPU-side helper parallelism. Boot the container without it and you get:

/usr/local/bin/llama-server: error while loading shared libraries:
libgomp.so.1: cannot open shared object file

The fix is one line in the runtime stage: apt-get install -y libgomp1.

7. Benchmark: five providers, four workloads, Opus-as-judge#

The fixes above are only worth shipping if the local stack is actually competitive on the workloads that matter. The benchmark below is what we use to settle the question. It is in the repo, reproducible from a single command, and reads from the same JSONL files that the rest of the pipeline produces, so the workloads are the real workloads.

7.1 Methodology#

The benchmark cycles sequentially through five providers × four workloads × N=5 trials, for 100 calls total. The five providers:

anthropic: Anthropic native API. Reference.
local-mainline: Qwen on llama.cpp mainline, FP16 KV cache.
local-turbo: Qwen on the spiritbuun fork, turbo4 KV cache, parallel=1.
local-turbo-parallel: same as local-turbo but parallel=2. The 1.5 GB freed by turbo4 lets us batch two concurrent slots, which the orchestrator's asyncio.gather can fill.
local-turbo-haiku-think: ablation that runs the high-volume haiku-tier workloads on the sonnet model (Qwen3.6-35B-A3B) with thinking ON. Tests the question "does deeper reasoning improve verify/extract quality enough to be worth the latency?".

Trial 1 of Anthropic is the quality reference. Local outputs are scored three ways:

Cosine similarity between the local output's embedding (Qwen3-Embedding-8B) and the Anthropic reference embedding.
ROUGE-L F1 on lowercased word tokens (complementary lexical signal).
Opus LLM-as-judge on a 0-to-10 absolute equivalence scale. One judgement per (provider, workload) cell. Crucially, the same judge rates Anthropic trial 2 against Anthropic trial 1 as the empirical ceiling: the "indistinguishable" score for two runs of the same backend on the same input. Any local provider rated at the ceiling is at parity with Anthropic.

The workloads:

Label	Tier	Profile	Description
extract-chunk	haiku	~3 k in, ~700 out, JSON	Extract atomic facts (88 calls per run)
verify-rag-fact	haiku	~3 k in, ~50 out, JSON	Verify-RAG inner call (~1 300 per run)
importance-score	haiku	~1 k in, ~15 out, JSON	Score 0 to 1 on a single fact
correct-section-rewrite	sonnet	~4 k in, ~1.5 to 3 k out, markdown	Rewrite a section to fix issues

Host: RTX 3090 Ti, WSL2 Ubuntu, Python 3.12, ctx_per_slot=32 768, llama.cpp build pinned in the bench manifest. 100 calls total, 0 errors.

7.2 Performance results#

Workload	Provider	Wall mean (s)	Output tps	In tokens	Out tokens
extract-chunk	anthropic	53.80	182.8	3 284	9 801
extract-chunk	local-mainline	8.08	100.2	1 232	809
extract-chunk	local-turbo	7.40	100.3	1 230	745
extract-chunk	local-turbo-parallel	5.79	101.4	1 230	590
extract-chunk	local-turbo-haiku-think	82.36	68.1	1 228	2 737
importance-score	anthropic	10.39	67.5	2 966	713
importance-score	local-mainline	1.95	8.1	968	16
importance-score	local-turbo	1.85	8.4	965	15
importance-score	local-turbo-parallel	1.52	9.9	965	15
importance-score	local-turbo-haiku-think	41.34	10.5	963	412
verify-rag-fact	anthropic	10.90	97.5	3 297	1 064
verify-rag-fact	local-mainline	2.50	20.1	1 279	50
verify-rag-fact	local-turbo	2.13	23.0	1 277	49
verify-rag-fact	local-turbo-parallel	1.87	27.3	1 277	50
verify-rag-fact	local-turbo-haiku-think	14.13	53.1	1 275	633
correct-section-rewrite	anthropic	39.78	82.1	3 956	3 230
correct-section-rewrite	local-mainline	34.59	62.1	2 021	2 418
correct-section-rewrite	local-turbo	24.12	66.4	2 018	1 611
correct-section-rewrite	local-turbo-parallel	17.57	93.0	2 018	1 642
correct-section-rewrite	local-turbo-haiku-think	29.16	59.4	2 018	1 475

Three things to notice. First, local-turbo-parallel is fastest on every workload: the second slot fills under asyncio.gather and the MoE expert loads amortise across the two requests. Second, Anthropic's output token counts are now correct (9 801 on extract, 3 230 on correct): a previous version of this bench had a propagation bug that pinned them to 0 to 8, which has since been fixed. Third, the thinking-haiku ablation is dramatically slower (×5 to ×20) because the model burns 400 to 2 700 reasoning tokens before emitting the 15- to 50-token JSON answer.

7.3 Quality results: proxy metrics#

Workload	Provider	Cosine vs ref	ROUGE-L F1	JSON valid	JSON key OK
extract-chunk	local-mainline	0.639	0.223	60%	60%
extract-chunk	local-turbo	0.659	0.203	80%	80%
extract-chunk	local-turbo-parallel	0.634	0.174	60%	60%
extract-chunk	local-turbo-haiku-think	0.857	0.298	40%	40%
importance-score	local-mainline	0.427	0.019	100%	100%
importance-score	local-turbo	0.434	0.018	100%	100%
importance-score	local-turbo-parallel	0.450	0.014	100%	100%
importance-score	local-turbo-haiku-think	0.926	0.368	80%	80%
verify-rag-fact	local-mainline	0.873	0.107	100%	100%
verify-rag-fact	local-turbo	0.867	0.101	100%	100%
verify-rag-fact	local-turbo-parallel	0.870	0.107	100%	100%
verify-rag-fact	local-turbo-haiku-think	0.932	0.403	80%	80%
correct-section-rewrite	local-mainline	0.889	0.460	n/a	n/a
correct-section-rewrite	local-turbo	0.925	0.505	n/a	n/a
correct-section-rewrite	local-turbo-parallel	0.901	0.480	n/a	n/a
correct-section-rewrite	local-turbo-haiku-think	0.921	0.513	n/a	n/a

Two findings stand out. First, local-turbo beats local-mainline on the correct-section-rewrite cosine (0.925 vs 0.889), reversing the conclusion of an earlier single-trial bench. The KV q8 to turbo4 trade-off does not cost measurable quality on long outputs once you average over N=5. Second, local-turbo-haiku-think wins on every cosine metric because the larger model with thinking produces output that is semantically closer to Anthropic's. Whether that closeness translates into actual correctness is the question the Opus judge answers next.

A note on the importance-score cosine values (0.43 to 0.45): the workload's output is about 15 tokens, and cosine on outputs that short is statistical noise. The 100% JSON validity tells the real story: the locals produce the right shape, just with different wording from Anthropic. The same trap shows up in any retrieval system that scores on short fragments, and it is the reason our semantic conflict detection pipeline does NLI on full sections rather than cosine on chunks.

7.4 Quality results: Opus-as-judge#

Opus rates each output against the Anthropic reference on a 0 to 10 absolute equivalence scale. The anthropic-baseline row is Anthropic trial 2 judged against trial 1 (same backend, different sample): it sets the empirical ceiling for "indistinguishable" on this workload.

Workload	Provider	Opus rating	Note
extract-chunk	anthropic-baseline (ceiling)	6/10	Verbatim quotes, comparable fact count, minor variations
extract-chunk	local-mainline	5/10	Some compound facts violating atomicity
extract-chunk	local-turbo	5/10	Fewer atomic facts than reference (8 vs 17)
extract-chunk	local-turbo-parallel	4/10	One quote uses '...' to splice non-contiguous text
extract-chunk	local-turbo-haiku-think	4/10	Violates atomicity (compound facts)
importance-score	anthropic-baseline (ceiling)	8/10	Identical JSON output
importance-score	local-mainline	2/10	Score 0.7 / bucket 'important' diverges from reference 'marquant'
importance-score	local-turbo	2/10	Bucket wrong relative to reference
importance-score	local-turbo-parallel	4/10	Score close (0.8 vs 0.9) but bucket wrong
importance-score	local-turbo-haiku-think	8/10	Identical JSON, at the ceiling
verify-rag-fact	anthropic-baseline (ceiling)	9/10	Same verdict, quote, src_idx, notes
verify-rag-fact	local-mainline	9/10	Correct verdict, source, quote
verify-rag-fact	local-turbo	9/10	At parity with Anthropic
verify-rag-fact	local-turbo-parallel	9/10	At parity with Anthropic
verify-rag-fact	local-turbo-haiku-think	7/10	Correct but emits prose reasoning before JSON
correct-section-rewrite	anthropic-baseline (ceiling)	9/10	All four hallucinations fixed, all three omissions integrated
correct-section-rewrite	local-mainline	6/10	All fixes applied, but unsolicited reasoning preamble
correct-section-rewrite	local-turbo	6/10	All fixes applied, light formatting differences
correct-section-rewrite	local-turbo-parallel	6/10	French version, all fixes applied
correct-section-rewrite	local-turbo-haiku-think	7/10	All fixes applied, slightly cleaner output

The Opus judge changes the reading of every other table. The right way to read it is as a percentage of the ceiling, not as an absolute deficit, because the ceiling itself varies by workload: Anthropic-vs-Anthropic only scores 6/10 on extract (it penalises its own prose violations), while it scores 9/10 on verify and rewrite.

verify-rag-fact: parity with Anthropic. Local-mainline, local-turbo, and local-turbo-parallel all score 9/10, the same as Anthropic-vs-Anthropic. This is the workload with the most calls per run (~1 300) and the one with the largest cost saving, so this is the headline result. Re-rolling Anthropic on the same prompt does no better.
extract-chunk: 66 to 83% of ceiling. Locals score 4 to 5/10 against a ceiling of 6/10 (Opus is strict on this workload even with Anthropic itself). On a ceiling-relative scale the locals are at 66 to 83% of the achievable, not 40 to 50% as the absolute number suggests. Atomicity violations (the 4 B emits 8 atomic facts where Anthropic Haiku emits 17) are the recurring failure mode; real margin is about one point.
correct-section-rewrite: a real two-to-three-point gap. Ceiling 9, locals 6 to 7. The locals do apply all requested fixes (Opus: "all four hallucinations softened and all three omissions integrated with correct citations"), but they lose points to formatting drift and the occasional reasoning preamble. This is the one workload where the gap is mesurable and not just a ceiling artifact, and where upgrading the sonnet-tier would pay off the most.
importance-score: the only true quality failure, and the one with a clean fix. Locals score 2/10 against an 8/10 ceiling without thinking, because the 4 B systematically picks the wrong bucket on borderline cases (Opus: "the candidate's score and bucket diverge from the reference's 'marquant'"). Flipping thinking ON for this workload alone lifts the rating to 8/10 = ceiling, identical JSON included. Thinking is the single available quality lever here.

7.5 Anthropic vs local-turbo-parallel, the headline#

Metric	Anthropic	local-turbo-parallel	Delta
Extract wall	53.80 s	5.79 s	×9.3 faster
Importance wall	10.39 s	1.52 s	×6.8 faster
Verify wall	10.90 s	1.87 s	×5.8 faster
Correct wall	39.78 s	17.57 s	×2.3 faster
Verify Opus rating	9/10 (ceiling)	9/10	at ceiling
Correct Opus rating	9/10 (ceiling)	6/10	one point below
Cost per run (pure local)	~$5	$0	free; the production hybrid (Opus on correct) costs ~$4, see §7.9

7.5 bis Caveat: N=5 does not converge on the correct phase#

Before we draw conclusions from §7.5, one honest caveat. The correct-section-rewrite workload has a wide trial-to-trial variance: in an isolated A/B run on local-mainline with N=5, individual trials ranged from 22 to 56 seconds (stdev 11.9 s without optimisations, 6.6 s with). At N=5 the mean has not converged, and the gap between local-mainline and local-turbo on this single workload is within the noise.

The other three workloads (extract, verify, importance) are stable enough at N=5 to read confidently. Any comparative claim that hinges on correct-phase wall times should be checked against a higher-N rerun or bootstrapped confidence intervals before being treated as definitive. The Opus-judge ratings on correct-phase are less noisy than the wall times (the rubric is deterministic enough that all four local variants land at 6 or 7), so those are still readable.

7.6 End-to-end estimate on one full run#

A representative run on a medium-sized source document is 88 document chunks × extract + 1 300 facts × verify + 300 × importance + 8 sections × correct.

Phase	Mode	Local-turbo-parallel	Anthropic
Extract	haiku no-think	~9 min	~30 min
Verify-RAG	haiku no-think	~40 min	~3 h
Importance	haiku no-think	~8 min	~25 min
Correct	sonnet thinking ON	~2 min	~5 min
Total		~59 min	~4 h

The local pipeline is roughly four times faster than Anthropic end-to-end while staying free, modulo electricity. The Anthropic estimate uses observed concurrency rather than a strict serial extrapolation of per-call wall times. The bulk of the local speedup comes from the thinking-disable flag of §5 combined with parallel=2 on TurboQuant.

7.7 Which local provider to pick#

The earlier draft of this article said "mainline by default, turbo when VRAM forces it". With N=5 trials and Opus-as-judge added, that recommendation flips. The list below is the post-bench reco for the dual-model dual-tier setup; §7.8 then collapses the architecture further to a single 35 B model, and §7.9 fixes the final production wiring. Read this section as the framework and §7.9 as the shipped recipe.

Default: local-turbo with --parallel 2 (i.e. the local-turbo-parallel provider). Fastest on every workload, beats local-mainline on correct-rewrite quality (cosine 0.901 vs 0.889, modulo the §7.5 bis caveat), and lands at the 9/10 Opus ceiling on verify. The 1.5 GB freed by turbo4 KV is what makes parallel=2 viable on a 24 GB card.
Skip mainline: it is slower than turbo on every workload in this run, and the quality advantage from the earlier draft disappeared once we averaged over more trials.
If you must run correct locally, turn thinking ON on sonnet: it bumps the 35 B from 5/10 to 6/10 vs the 9/10 ceiling (see §7.8). Production routes correct to Anthropic Opus instead (§7.9) and skips the sonnet container entirely, so this lever only matters if you accept the local quality gap.
Avoid thinking-ON on verify and extract: the cosine boost is real but the Opus judge rates these variants lower than thinking-OFF on the same models. Reasoning leaks pollute the JSON output, costing more in parsing failures than it gains in semantic depth.

7.8 Mini-bench: from dual-model to single-model#

After the v3 bench above, two follow-up mini-benches (28 May) overturned the dual-model architecture we initially shipped. The headline result: a 35 B-A3B MoE model in no-think mode beats a dedicated 4 B model on the haiku-tier workloads at the same latency, and thinking ON is counterproductive on strict-JSON workloads.

Mini-bench A, importance workload, four local configs:

Config	Wall mean	Stdev	Opus rating	Trial 1 output
4 B no-think	2.13 s	0.02 s	8/10	`{"score": 0.8, "bucket": "marquant"}`
4 B thinking ON	17.14 s	16.43 s	6/10	prose then JSON (penalised)
35 B no-think	2.42 s	0.15 s	9/10	`{"score": 1.0, "bucket": "marquant"}`
35 B thinking ON	13.76 s	1.94 s	6/10	prose then JSON (penalised)

Three takeaways:

The 35 B no-think beats the 4 B no-think on everything: one more Opus point (9 vs 8), tight latency variance (0.15 s vs the 4 B which still varies trial-to-trial on borderline buckets), and effectively the same wall-clock (2.42 s vs 2.13 s). The MoE only activates ~3 B parameters per token, so its decode rate is comparable to a dense 4 B model.
The 4 B no-think is inconsistent across trials. On the importance workload it sometimes returns important instead of the reference marquant on borderline cases. The 35 B picks the right bucket every time we measured.
Thinking ON is harmful on strict-JSON workloads. Both the 4 B and the 35 B drop to 6/10 when thinking is ON, because they prefix the JSON with prose ("Let me analyse...") which violates the prompt's No prose. No fences. instruction. Opus deducts points accordingly. Thinking is a quality lever, not a free upgrade: on workloads where the rubric forbids prose, it costs you.

Decision from mini-bench A: drop the 4 B from production and run the 35 B-A3B no-think on the haiku tier. The on-disk GGUF inventory becomes one chat model plus the embedder.

Mini-bench B, correct workload, can the 35 B replace Anthropic Opus on rewrite?

Config	Wall mean	Opus rating	Verdict
35 B no-think	9.61 s	5/10	fast but quality insufficient
35 B thinking ON	23.50 s	6/10	thinking helps marginally
Anthropic Opus (direct)	12.88 s	9/10	ceiling quality, $4/run
Anthropic Sonnet	26.01 s	9/10	Opus quality, double latency

Verdict: Opus stays mandatory for the correct phase. The 35 B locals plateau at 5 to 6/10 regardless of thinking. The Opus judge points at one repeatable failure mode: the local 35 B does not include the [#filename] citation tags in the omission sentences it inserts, and Opus does. That is an instruction the local model ignores no matter how we prompt it, and the gap is visible in the final markdown. Measured shortfall: 3 to 4 points on the 0-10 scale.

7.9 Production architecture and recipe#

The two mini-benches above define the architecture we ship today:

Phase	Backend	Thinking	Why
extract, verify, importance, omissions	local Qwen3.6-35B-A3B (haiku container)	OFF	9/10 on importance and verify, 5/10 on extract (one point below ceiling), ~2 to 3 s per call, free
correct (8 calls)	Anthropic Opus	n/a	9/10 ceiling, ~13 s per call, ~$4 per run

The sonnet container exists in the code for completeness, but the production "best-of-both" mode never loads it: when the correct phase fires, the SDK routes those eight calls back to Anthropic Opus and the sonnet container is skipped, saving 30 seconds of GPU loading and 18 GB of VRAM.

The economics:

Configuration	Cost / run	Correct-phase Opus rating
Full Anthropic	~$5	9/10 (ceiling)
Pure local (sonnet container with thinking ON)	$0	6/10
Hybrid (local middle + Opus correct)	~$4	9/10 (ceiling)

You save about a dollar over full Anthropic, get the full ceiling on the content-touching phase, and keep the 4 to 9× speedup on the high-volume middle. The provider switch is already in place (just provider_for("opus") on the correct phase). This is the shipped default in --correct-via-opus mode.

8. Operational notes that bit us#

A handful of things that are not in the bug list above but cost us time anyway.

Three-way VRAM mutex on the same GGUF. The same 24 GB card runs the NLP container (Qwen3-Embedding-8B, ~5 GB), the haiku LLM container (Qwen3.6-35B-A3B no-think, ~18 GB), and the sonnet LLM container (same GGUF with thinking ON). Pairs do not co-exist; the active container has to stop before the next one starts. The haiku-to-sonnet swap is fast (~10 s) because the GGUF stays in the OS page cache and only the llama-server process restarts with new flags. The nlp-to-llm swap, where the chat model has to be loaded into VRAM from cold, takes about 30 s. Each phase that needs a tier calls ensure_llm_running(tier=...) itself rather than relying on a top-level orchestrator. The same "each task manages its own container" pattern shows up in our semantic conflict detection pipeline.

Swap cost is small. A full run does two to three tier swaps total (nlp → haiku → sonnet, or nlp → sonnet direct if the extracted facts are cached from a previous run). At ~30 s for an nlp ↔ llm swap and ~10 s for a same-GGUF haiku ↔ sonnet swap, the total swap overhead is about 1 to 1.5 minutes on a ~59 minute run.

Opt-in routing. Local routing has to be explicit. A single environment variable, USE_LOCAL_LLM=1, enables it; without it, the SDK hits Anthropic native. The reason is operational: if the local containers are down or the GPU is busy, the pipeline should still run on Sonnet rather than hang.

Critical jobs stay on Anthropic. provider_for("opus") clears the override and forces Anthropic. Anything we cannot accept a regression on (the final verification pass, the user-facing summary) stays on a frontier model. The local stack handles the high-volume, lower-stakes middle of the pipeline.

Fork freshness. spiritbuun's fork moves fast. Building with --no-cache --ref <new-sha> rebuilds against a specific upstream commit. Pinning is not optional with a fork that pushes daily on the critical path.

Run N ≥ 5 trials before drawing conclusions. An earlier single-trial version of the bench had local-mainline beating local-turbo on rewrite quality. With N=5 that flipped, and the recommendation in §7.7 inverted. Single-trial benches on stochastic decoders are anecdotes; we treat them as such now.

9. What we would tell someone starting today#

If you are planning to route the Claude Agent SDK at one or more local LLMs, the order we would recommend:

Pick one mid-size MoE model and use it on both tiers. A 30 B-class MoE that activates ~3 B parameters per token (Qwen3.6-35B-A3B in our case) has the latency of a 4 B dense model and the quality of a 30 B dense model. Running it on both tiers with different thinking flags is simpler than maintaining a separate 4 B for the high-volume tier and a 30 B for the low-volume tier, and our mini-bench (§7.8) shows it is also higher quality.
Get a per-tier model alias right first. Stable names in /v1/models are the difference between "model not found" and "model works". Add --alias <tier-name> to each llama-server command before anything else, and make sure your tier swap clears the residual model variables.
Disable thinking at the engine level, not in the prompt. Prompt-level directives are a courtesy the model will refuse 90% of the time. The --jinja --reasoning off --chat-template-kwargs '{"enable_thinking": false}' combo is what worked for us on Qwen3.6. The speedup is twelvefold and the quality improves too on strict-JSON workloads, because reasoning prose violates No prose. No fences. rubrics.
Audit the SDK's built-in tools list. Print what the SDK actually exposes to your model and ban the full set. Do not trust your memory of "the 12 tools"; the real number drifts as the SDK ships features.
Benchmark with N ≥ 5 trials and an Opus-as-judge ceiling. Single-trial benches lie about stochastic decoders. The Anthropic-vs-Anthropic baseline rating is the ceiling that lets you say "at parity" honestly instead of chasing a 0.96 cosine that does not mean anything.
Default to TurboQuant with --parallel 2. Mainline llama.cpp is slower and not measurably higher quality once you average over enough trials. TurboQuant + flash attention + the 1.5 GB of headroom the 4-bit KV gives you for the second parallel slot is the configuration to start from.
Treat thinking as a tool, not a default. It helps long-form rewriting and hurts strict-JSON output. The default policy is OFF on the haiku-tier container and ON on the sonnet-tier container; resist the urge to flip thinking ON globally "for quality", it will silently degrade the high-volume workloads.
Accept that some workloads stay on Anthropic. Our correct-phase rewrite plateau at 6/10 against a 9/10 ceiling on every local config we tried, because the locals consistently omit the [#filename] citation tags. Route those calls back to Anthropic Opus and stop chasing the gap.

The bigger pattern is that LLM-agnostic execution is at the centre of any serious production stack. The same instinct shows up in our PageIndex vs Anatoly RAG note (the same retrieval pipeline runs on whichever LLM is cheapest at the moment) and in AI Code Audit vs AI Code Review in 2026 (the audit harness has to outlive any single model generation). If your architecture only works against one provider, the model release cycle will eat you alive.

For implementation questions, open a discussion on the GitHub repo or reach the author directly. The full research index has the surrounding context on retrieval, conflict detection, and the audit harness this routing layer feeds into.

Detecting Semantic Conflicts Between Documents: A Pragmatic Pipeline

A four-stage pipeline for finding where two documents contradict each other, not just where they overlap: chunking and embedding, cosine pre-filtering, section deduplication and neighbor expansion, then NLI or LLM inversion detection. Includes a CPU-only deployment path and a sub-ten-cent cost model on a realistic workload.