Choose lite/advanced/external tier; configure OpenAI, Voyage, OpenRouter (Qwen3-8B), Cohere, Mistral, or custom OpenAI-compatible endpoints; enterprise dedicated deployment patterns

Choose the embedding tier and (optionally) the third-party provider that powers Anatoly's semantic RAG index — from zero-config local CPU to best-of-breed cloud APIs.

Overview#

Anatoly produces two embedding vectors for every indexed function — a code vector (structural/syntactic semantics) and an NLP vector (natural-language semantics). Three execution paths are available, in increasing order of recall and operational complexity:

Tier	Engine	Setup	Hardware	Recall	Cost	Use case
`lite`	ONNX in-process via `@huggingface/transformers`	None (auto on first run)	CPU only	Good	Free	Default. Works everywhere, no external services.
`advanced`	GGUF llama.cpp Docker container (`anatoly-local`)	Run `anatoly local-embeddings upgrade` once	NVIDIA GPU + ≥ 12 GB VRAM + Docker	Best	Free (after model download, ~10 GB)	Local power users with a capable GPU who want maximum quality without sending code to a third party.
`external`	Vercel AI SDK → any OpenAI-compatible API (OpenAI, Voyage, OpenRouter, Cohere, Mistral, custom)	Set provider + API key in `.anatoly.yml`	None (CPU only)	Provider-dependent	Per-token billed by the provider	Cloud-friendly, zero local infra. Best when you have a Voyage/OpenRouter/OpenAI account or a corporate inference endpoint.

The active tier is selected at first run via the embedded wizard or by editing .anatoly.yml directly. The CLI flags --rag-lite and --rag-advanced override the persisted choice for a single run; the external tier requires an explicit YAML config.

The lite tier is unaffected by anything in this document — it runs in-process with no provider concept. The remainder of this guide covers the external tier, which is also the foundation under the hood for the local advanced tier (the GGUF Docker container is modelled internally as a provider named anatoly-local).

Configuration shape#

The provider is declared under rag.embedding in .anatoly.yml, split per axis so you can mix providers (best-of-breed):

rag:
  embedding:
    code:
      provider: voyage              # required
      model: voyage-code-3          # optional, registry default applies if omitted
      base_url: https://...         # optional, registry default applies for known providers
      env_key: VOYAGE_API_KEY       # optional, registry default applies for known providers
    nlp:
      provider: openrouter
      model: qwen/qwen3-embedding-8b

Both code and nlp sections are independently optional. If only one is set, the other duplicates it at runtime. If both are absent, Anatoly falls back to lite or advanced based on .anatoly/embeddings-ready.json.

The schema is EmbeddingConfigSchema in src/schemas/config.ts. Custom fields are accepted via .passthrough() for forward compatibility.

Supported providers#

The registry lives at src/rag/known-embedding-providers.ts. Each entry provides default URLs, env var names, batch constraints, and recommended models — so a YAML config containing only provider: openai works out of the box.

`openai`#

Native via @ai-sdk/openai. The fastest path for users who already have an OpenAI account.

Field	Value
`base_url`	`null` (native SDK)
`env_key`	`OPENAI_API_KEY`
Default code model	`text-embedding-3-large` (3072d)
Default NLP model	`text-embedding-3-large` (3072d)
Notes	`text-embedding-3-small` (1536d) is the cheaper alternative if vector store size matters.

rag:
  embedding:
    code: { provider: openai, model: text-embedding-3-large }
    nlp:  { provider: openai, model: text-embedding-3-large }

`voyage`#

Voyage AI — the recommended code retrieval provider. voyage-code-3 is SOTA on CoIR and CodeSearchNet benchmarks.

Field	Value
`base_url`	`https://api.voyageai.com/v1`
`env_key`	`VOYAGE_API_KEY`
Default code model	`voyage-code-3` (1024d, Matryoshka 256/512/1024/2048)
Default NLP model	`voyage-3-large` (1024d)
Notes	Voyage is the embedding partner recommended by Anthropic. Strong on multi-language code.

rag:
  embedding:
    code: { provider: voyage, model: voyage-code-3 }
    nlp:  { provider: voyage, model: voyage-3-large }

`openrouter`#

Aggregator route to the open-weights Qwen3-Embedding-8B (4096d, strict parity with the local advanced GGUF tier on the NLP axis). Empirically verified 2026-05-04: response is OpenAI-strict, batch ordering preserved, pricing trivial (~$0.01 per 1M tokens). Reuses the same env var as the LLM openrouter entry from Epic 43, so users who already authenticate against OpenRouter for completions get embeddings out of the box.

Field	Value
`base_url`	`https://openrouter.ai/api/v1`
`env_key`	`OPENROUTER_API_KEY`
Default code model	`qwen/qwen3-embedding-8b` (4096d)
Default NLP model	`qwen/qwen3-embedding-8b` (4096d)
Notes	OpenRouter exposes other embedding models too (e.g. `openai/text-embedding-3-large` routed via OpenRouter). Override `model:` to use them. The Qwen3-8B route is preferred for parity with the local advanced tier.

rag:
  embedding:
    code: { provider: openrouter, model: qwen/qwen3-embedding-8b }
    nlp:  { provider: openrouter, model: qwen/qwen3-embedding-8b }

Direct DashScope routing (base_url: https://dashscope-intl.aliyuncs.com/compatible-mode/v1, env_key: DASHSCOPE_API_KEY) remains available as a custom provider — see the Custom provider section.

`cohere`#

Cohere Embed v3 — strong on multilingual NLP retrieval. Less specialised on code.

Field	Value
`base_url`	`https://api.cohere.com/v1`
`env_key`	`COHERE_API_KEY`
Default code model	`embed-english-v3.0` (1024d)
Default NLP model	`embed-english-v3.0` (1024d)
Notes	Use `embed-multilingual-v3.0` for non-English codebases.

`mistral`#

Mistral Embed — single model, simplest setup.

Field	Value
`base_url`	`https://api.mistral.ai/v1`
`env_key`	`MISTRAL_API_KEY`
Default code model	`mistral-embed` (1024d)
Default NLP model	`mistral-embed` (1024d)

Recommended combo — best-of-breed quality#

For users who want the highest semantic recall without running a local GPU, mix Voyage for code with OpenRouter-routed Qwen3-8B for NLP:

# .anatoly.yml
rag:
  embedding:
    code:
      provider: voyage
      model: voyage-code-3
    nlp:
      provider: openrouter
      model: qwen/qwen3-embedding-8b

Axis	Provider/Model	Why
Code	`voyage/voyage-code-3` (1024d)	SOTA on CoIR for code retrieval, ~13% above OpenAI text-embedding-3-large on aggregate
NLP	`openrouter/qwen/qwen3-embedding-8b` (4096d)	Same open-weights model as the local `advanced` GGUF tier — bit-comparable recall, no GPU required

Required env vars: VOYAGE_API_KEY and OPENROUTER_API_KEY.

This combo is the closest cloud-friendly equivalent to running the GGUF advanced tier locally — no GPU required, parity dim/recall on the NLP axis, and OpenRouter pricing on Qwen3-8B is roughly $0.01 per 1M tokens (negligible at typical audit scale).

Custom provider#

Any OpenAI-compatible /v1/embeddings endpoint can be used by declaring the provider name plus base_url and env_key:

rag:
  embedding:
    code:
      provider: my-internal-embed
      base_url: https://embed.internal.corp/v1
      env_key: INTERNAL_EMBED_KEY
      model: nomic-embed-code-v2
    nlp:
      provider: my-internal-embed
      base_url: https://embed.internal.corp/v1
      env_key: INTERNAL_EMBED_KEY
      model: gte-large-en-v1.5

The endpoint must:

Accept POST /embeddings with body { model, input, encoding_format: "float" } (input may be a string or an array).
Return { data: [{ embedding: number[], index: number }], usage: { prompt_tokens: number } } — the OpenAI-strict shape.
The model field in the request body is sent verbatim; servers that ignore it (like llama.cpp) are tolerated.

Endpoints that diverge from this shape (nested embedding[[...]], missing data[], etc.) are not supported. Run them behind a thin proxy that normalises the response.

Cloud Anatoly (SaaS)#

The hosted SaaS version of Anatoly routes embeddings server-side to a provider chosen by Anatoly (HuggingFace Inference Endpoints, Modal, Voyage, etc. — selected for cost and quality, subject to change). The client does not see or configure the provider; the cloud workspace consumes an authenticated Anatoly endpoint and the embedding "just happens".

This means:

No rag.embedding configuration required when running against anatoly.cloud.
No client-side API keys for embeddings — billing is rolled into the SaaS subscription.
Provider transparency is intentional — Anatoly may switch backends to optimise margins or recall, without changing the client experience.

If you need to know or control which provider runs your embeddings, choose the Enterprise dedicated deployment path below.

Enterprise dedicated deployment#

For organisations that require data sovereignty, a custom provider, or audit isolation, Anatoly runs as the same CLI binary inside the customer's VPC or private cloud, configured via .anatoly.yml. Three deployment patterns are supported:

(a) Azure OpenAI internal#

Route both axes to Azure-hosted OpenAI deployments. Azure exposes embeddings under https://{resource}.openai.azure.com/openai/deployments/{deployment}/embeddings?api-version=... — pass the full URL as base_url.

rag:
  embedding:
    code:
      provider: azure-openai-internal
      base_url: https://contoso.openai.azure.com/openai/deployments/text-embedding-3-large/embeddings?api-version=2024-02-01
      env_key: AZURE_OPENAI_KEY
      model: text-embedding-3-large
    nlp:
      provider: azure-openai-internal
      base_url: https://contoso.openai.azure.com/openai/deployments/text-embedding-3-large/embeddings?api-version=2024-02-01
      env_key: AZURE_OPENAI_KEY
      model: text-embedding-3-large

code and nlp may point at different Azure deployments of the same resource if you want to mix model sizes.

(b) Self-hosted GGUF cluster#

If you already operate llama.cpp or TEI containers behind an internal load balancer, point Anatoly at them. The serving stack is the same one Anatoly uses locally for advanced — just on your infrastructure.

rag:
  embedding:
    code:
      provider: anatoly-local-cluster
      base_url: https://embed-code.internal.corp/v1
      env_key: INTERNAL_EMBED_KEY
      model: nomic-embed-code
    nlp:
      provider: anatoly-local-cluster
      base_url: https://embed-nlp.internal.corp/v1
      env_key: INTERNAL_EMBED_KEY
      model: qwen3-embedding-8b

This pattern delivers the same recall as the local advanced tier without exposing the GPU container on the audit machine.

(c) HuggingFace Inference Endpoints (customer account)#

Deploy Qwen/Qwen3-Embedding-8B and nomic-ai/nomic-embed-code (or any other model) on dedicated HF Inference Endpoints inside your AWS/Azure account, then point Anatoly at them.

rag:
  embedding:
    code:
      provider: hf-internal
      base_url: https://abc123-code.eu-west-1.aws.endpoints.huggingface.cloud/v1
      env_key: HF_INTERNAL_TOKEN
      model: nomic-embed-code
    nlp:
      provider: hf-internal
      base_url: https://abc123-nlp.eu-west-1.aws.endpoints.huggingface.cloud/v1
      env_key: HF_INTERNAL_TOKEN
      model: Qwen3-Embedding-8B

In all three patterns the customer retains full control of the data path: code chunks never leave the customer's network. Anatoly's CLI has no embedded telemetry on the embedding axis.

Operational notes#

Dimension probe and signature cache#

For models not in the registry (custom providers or new model IDs), Anatoly probes the dimension at boot time with a single embed("anatoly probe") call, then caches the result in .anatoly/embeddings-ready.json under dim_code / dim_nlp plus an embedding_signature (SHA-256 of {provider, code_model, nlp_model}). Subsequent runs skip the probe unless the signature changes.

Batch limits#

The Vercel AI SDK handles automatic chunking. For external providers the default batch size is 2048 (SDK default). For the local anatoly-local provider, the registry pins max_per_call: 16 and supports_parallel: false to match the llama.cpp container's context window and the sequential code/NLP swap pattern.

Missing API keys#

If a provider's env_key is referenced but process.env[env_key] is not set, the wizard writes the YAML anyway and warns. The audit will fail at the first embedding call with a clear error: No API key for embedding provider "X". Set {ENV_KEY} in your environment. Export the key and re-run.

Switching providers post-setup#

Edit rag.embedding in .anatoly.yml and re-run anatoly run. The signature cache invalidates automatically; the dim probe runs once for the new provider and caches the result. No manual cleanup of .anatoly/ is required unless dimensions change in a way that breaks the existing LanceDB index — in that case anatoly clean rag-index rebuilds it.

Embedding Providers

Overview#

Configuration shape#

Supported providers#

`openai`#

`voyage`#

`openrouter`#

`cohere`#

`mistral`#

Recommended combo — best-of-breed quality#

Custom provider#

Cloud Anatoly (SaaS)#

Enterprise dedicated deployment#

(a) Azure OpenAI internal#

(b) Self-hosted GGUF cluster#

(c) HuggingFace Inference Endpoints (customer account)#

Operational notes#

Dimension probe and signature cache#

Batch limits#

Missing API keys#

Switching providers post-setup#

See also#

Embedding Providers

Overview#

Configuration shape#

Supported providers#

openai#

voyage#

openrouter#

cohere#

mistral#

Recommended combo — best-of-breed quality#

Custom provider#

Cloud Anatoly (SaaS)#

Enterprise dedicated deployment#

(a) Azure OpenAI internal#

(b) Self-hosted GGUF cluster#

(c) HuggingFace Inference Endpoints (customer account)#

Operational notes#

Dimension probe and signature cache#

Batch limits#

Missing API keys#

Switching providers post-setup#

See also#

`openai`#

`voyage`#

`openrouter`#

`cohere`#

`mistral`#