Model Boons

Gateway-granted capabilities for models that lack them natively — vision (image-to-text relay), structured output (JSON-schema enforcement), and compression (token reduction before dispatch) — plus the gateway tool loop, which runs registered MCP tools on a model's behalf.

A boon is a capability obleth grants a model at the gateway, on top of what the model can do on its own. Instead of every model needing native support for every modality — or every caller wiring up extra plumbing — obleth detects when a request needs a capability the target model lacks, fulfils it at the gateway, and rewrites the request (and, where needed, the response) so the original model can answer.

There are three boons today, all built on the same engine and opted into through a model's boons list:

Boon	`boons` value	Grants	Rewrites
Vision	`vision`	Image input on a text-only model, by relaying images to a describer model	Request
Structured output	`structured_output`	`response_format` JSON-schema adherence	Request + response
Compression	`compression`	Token reduction (structural JSON/code compaction, dedup, deterministic lossy text) before dispatch	Request

A further gateway capability, the gateway tool loop, is configured from the same Settings → Model boons panel but works differently: rather than opting in through the boons list, a model is granted registered MCP servers via its tool_servers list, and obleth runs those tools on the model's behalf. It is covered at the end of this page.

A related gateway capability, guardrails, is built on the same engine and shares the fail-open posture, but it is enabled per tenant (by setting a content policy) rather than per model — see its own guide.

How boons work

A few rules apply to every boon:

Opt-in, twice. A boon never fires unless it is enabled globally (Settings → Model boons, stored in app_settings and hot-reloadable) and the target model has the boon in its per-model boons list. Nothing is granted by default.
Only when the model lacks it natively. Each boon checks the matching capability flag and steps aside if the model already has it — vision skips models flagged supports_vision, tools skips supports_function_calling, structured output skips supports_response_schema. Native capability always wins; the boon is a fallback.
Fail-open, always. Any error — no helper model configured, an upstream failure, a timeout, an unparseable reply — leaves the request or response unchanged. A flaky helper must never block traffic the target model might still handle on its own.
Applied in order. For a single chat request obleth applies vision, then lossless compression, then the gateway tool loop, then the deterministic dedup and lossy text passes, then structured output. The structured-output boon also rewrites the response: when it is active, obleth forces a non-streaming upstream call, buffers the completion, transforms it, and — for clients that asked for stream: true — re-emits the result as synthesized SSE. Streaming requests are therefore buffered while the structured-output boon is active; consider raising your client's request timeout. (The gateway tool loop, by contrast, can stream live — see its section.)
Observable per request. Responses carry an x-obleth-boons header listing the boons that acted on the request. A non-fatal issue (for example, structured-output validation that could not be repaired) is reported in x-obleth-boons-warning while the original completion still passes through.
Escape hatch. Send the request header x-obleth-boons: off to bypass all boon processing for that single request.

Boons that call a helper model (the vision describer, the structured-output fixer) meter that call against the calling tenant as its own usage record, so the extra cost is attributed and visible in the request log.

The vision boon

When a text-only model receives a chat request that contains an image, the vision boon:

Detects the image_url content part(s) in the request.
Relays each image to a configured describer model — a vision-capable model such as glm-4-5v.
Replaces the image part inline with the returned text, as [Image description: …].
Forwards the rewritten, now text-only request to the originally requested model.

The target model never sees the image bytes; it sees a faithful text description in their place and answers as if it could see.

client ──▶ obleth                                  (chat request with image_url)
             │  target lacks vision + boon enabled
             ├──▶ describer (glm-4-5v)             "describe this image"
             │◀── "A 3D voxel render of…"
             │  image part → "[Image description: …]"
             ├──▶ target model (text-only)         (rewritten, text-only request)
             │◀── answer
client ◀─────┘                                     answer

When it applies

The boon runs for a request only when all of the following hold:

The vision boon is enabled globally.
A describer model is configured.
The target model has opted into the vision boon (its boons list includes vision). Boons are off by default and granted per model — obleth never applies a boon to a model that hasn't asked for it.
The target model is not flagged supports_vision (models that can see images are left untouched — their images pass straight through).
The request actually contains an image_url content part.

If any condition is false, the request is forwarded unchanged.

Fail-open by design

The vision boon never blocks or fails a request. If the describer is unreachable, returns an error, times out, or returns an empty description, the affected image is left unchanged and the request is forwarded as-is. A flaky describer must not take down traffic the target model might still handle.

Each image is described independently, so one failed image does not discard the descriptions already produced for the others in the same request.

Billing and attribution

Every describe call is metered against the calling tenant and written to the usage ledger as its own record:

model is the describer model name (so its cost lands on the describer's line).
admission is boon and request_type is vision_boon, so boon traffic is easy to isolate in the request log and cost breakdown.
Cost is computed from the describer's input_cost_per_token / output_cost_per_token using the token usage the describer reports.

The original request is billed normally on top, against the target model.

Prerequisites

You need two models registered:

A describer — a vision-capable chat model. Give it the vision tag (or set supports_vision: true via the API).
One or more target models that lack native vision (supports_vision: false, the default) that you opt into the vision boon.

curl -X POST http://localhost:9180/api/v1/models \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "model_name": "glm-4-5v",
    "model_type": "chat",
    "upstream_model": "glm-4-5v",
    "api_base": "https://provider.example/v1",
    "api_key": "sk_upstream",
    "tags": ["vision"],
    "supports_vision": true,
    "enabled": true
  }'

supports_vision is a capability flag on every chat model. It defaults to false, so existing routes need no changes. In the dashboard it is derived from the vision routing tag (Models → Routing tags → vision) — ticking vision marks the model as natively image-capable and eligible to serve as a system-wide describer. The Management API still accepts supports_vision directly.

Enabling the boon

Vision boon settings live in the app_settings store (key boons) and are hot-reloadable — the proxy picks up changes within its refresh interval, no restart required.

From the control plane, open Settings → Model boons:

Enable vision boon — the master switch.
Describer model — a dropdown of your vision-capable models.
Describe prompt — the instruction sent to the describer.
Max images per request — cap on how many images are described per request (default 6).
Describe timeout (ms) — per-image timeout (default 30000).

Or via the Management API:

curl -X PUT http://localhost:9180/api/v1/settings/boons \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "vision_enabled": true,
    "vision_fallback_model": "glm-4-5v",
    "vision_describe_prompt": "Describe this image in thorough, faithful detail: all visible text (verbatim), UI elements, code, diagrams, charts, and layout.",
    "vision_max_images": 6,
    "vision_timeout_ms": 30000
  }'

Send vision_fallback_model as "" to clear the describer (which deactivates the boon, since no describer is set). See the Management API for the full settings shape.

The boon is disabled by default. If you configure a describer but leave Enable vision boon off, images pass straight through to the target model — which, if it is text-only, will typically reject them. Make sure the master switch is on.

Granting the boon to a model

The global switch turns the vision boon on; each model then opts in individually. A model only receives a boon when its boons list contains that boon's name — nothing is granted by default, so you choose exactly which text-only models should fall back to the describer.

From the dashboard, open a model's config and tick the boon under the Boons group (Models → a row → Boons → vision). Via the Management API, set the boons array on create or update:

curl -X PUT http://localhost:9180/api/v1/models/$MODEL_ID \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{ "boons": ["vision"] }'

boons is a fixed-vocabulary list (vision, structured_output, compression) stored per model. Leave it empty (the default) to keep a model boon-free — images sent to a text-only model with no boon are forwarded unchanged.

Using it from a client

Send a normal OpenAI-style chat request with an image to a text-only model. No client changes are required — the boon is transparent.

curl http://localhost/v1/chat/completions \
  -H "Authorization: Bearer sk_..." \
  -H "Content-Type: application/json" \
  -d '{
    "model": "minimax-m2-7-fast",
    "messages": [{
      "role": "user",
      "content": [
        {"type": "text", "text": "What is this?"},
        {"type": "image_url", "image_url": {"url": "data:image/png;base64,iVBORw0KGgo..."}}
      ]
    }]
  }'

obleth relays the image to the describer, rewrites the request, and the text-only minimax-m2-7-fast answers. The request log shows two entries: a vision_boon call against glm-4-5v, and the chat call against minimax-m2-7-fast.

Verifying

After a request, confirm the boon ran by looking for the vision_boon record in the usage ledger:

SELECT toDateTime(ts_ms / 1000) AS t, model, request_type, status_code
FROM obleth.usage
WHERE request_type = 'vision_boon'
ORDER BY ts_ms DESC
LIMIT 5;

If you see the target model return errors but no vision_boon record, the boon did not fire — re-check the conditions under When it applies: the boon is enabled, a describer is set, the target model has opted in (boons includes vision), and it is flagged supports_vision: false.

The gateway tool loop

The gateway tool loop gives a model actual tools, not just the tool-calling capability. When an operator grants a model access to one or more registered MCP servers, obleth injects those servers' tools into the model's chat requests, executes the model's tool calls against the MCP upstream itself, appends the results, and re-asks the model — looping (bounded) until the model produces a final answer. The client sends a plain OpenAI chat request and gets a grounded answer back; it never has to run a tool itself.

client ──▶ obleth                         (plain chat request, no tools)
             │  model granted MCP servers + native function calling
             │  inject discovered tools, add the nudge
             ├──▶ target model             "I should search…" → tool_calls
             │  execute tool calls against the MCP server(s)
             ├──▶ MCP server               tools/call → result
             │  append results, re-ask
             ├──▶ target model             (loops up to max_turns)
             │◀── final answer
client ◀─────┘                            grounded answer (no tool_calls)

This is a distinct mechanism from the two boons above. Capabilities vs tools: a capability is what a model does natively (function calling); a tool is something registered at the gateway (an MCP server). The vision and structured- output boons add a capability a model lacks; the tool loop hands real tools to a model that already has the function-calling capability.

When it applies

The loop runs for a request only when all of the following hold:

The gateway tool loop is enabled globally.
The target model has one or more MCP servers granted via its tool_servers list.
The target model is flagged supports_function_calling. A model that is granted tool servers but lacks native function calling gets no tools injected, and obleth logs a loud warning — enable function calling on the model to use the loop.

How it works

Discover & inject. obleth fetches each granted server's tool list (cached) and merges the definitions into the request's tools array. For plain chat clients it also injects a system nudge so the model actually reaches for a tool when a question needs external information.
Execute & loop. When the model returns tool_calls, obleth executes each one against its MCP server, appends the result as a tool message, and re-dispatches the conversation. It repeats until the model answers with no tool calls, or until max_turns is reached — at which point it strips the tools and asks the model to conclude from what it gathered, so a plain chat client never receives unexpected tool_calls.
Bounded & efficient. One MCP session is opened per server for the whole request (a rate-limited server sees a single initialization, not one per call), and each tool result is truncated to bound context growth.

Streaming, errors, and billing

Streamed live or buffered. Unlike the structured-output boon, the tool loop can stream the final answer live, token-by-token; only the tool execution between turns pauses the stream. Non-streaming clients get the buffered answer.
Fail-open. A tool error becomes a text result the model can read and recover from. A dispatch failure returns the last completion with an x-obleth-boons-warning. The loop never fails a request outright.
Clients that bring their own tools (agentic clients, IDE assistants) keep control of them: the granted MCP tools are merged into the client's set, the gateway executes only its own tools, and any client-owned tool call is handed straight back to the client untouched.
Billing. Each model round trip inside the loop is metered against the tenant as a tool_loop usage record. Tool-loop answers are never cached.

Enabling and granting it

From the control plane, open Settings → Model boons and turn on Enable gateway tool loop, optionally adjusting Max tool turns (1–8, default 4), Tool execution timeout, and the Tool nudge. Or via the Management API:

curl -X PUT http://localhost:9180/api/v1/settings/boons \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "tool_loop_enabled": true,
    "tool_loop_max_turns": 4,
    "tool_loop_tool_timeout_ms": 30000
  }'

Then grant the MCP servers to each model. In the dashboard, open the model and tick the servers under its Tools section; or set tool_servers via the API:

curl -X PUT http://localhost:9180/api/v1/models/$MODEL_ID \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{ "tool_servers": ["searxng"] }'

Send tool_loop_nudge as "" to reset it to the built-in default. See the Management API for the full settings shape.

Web search end to end

A common use of the tool loop is giving a model live web search. The examples/searxng/ compose file runs a private SearXNG metasearch instance fronted by an MCP server, both joined to obleth's docker network:

docker compose -f examples/searxng/docker-compose.yml up -d

Register the MCP server in obleth (MCP Servers → Register, or the API) with upstream URL http://mcp-searxng:8765/mcp, grant it to a function-calling model's tool_servers, and enable the loop. From then on the model can run live web searches mid-conversation — the client just asks a question and obleth handles the search, the tool call, and the follow-up turn. See the MCP Gateway guide for the full walkthrough.

The structured output boon

The structured_output boon enforces response_format JSON schemas at the gateway for a model without native support. The schema is rendered into the prompt, the reply is validated at the gateway, and invalid JSON is repaired by a configurable fixer model — so callers reliably get schema-conforming JSON even from a model that would otherwise return prose-wrapped or malformed output.

When it applies

The structured-output boon is enabled globally.
The target model has opted into structured_output and is not flagged supports_response_schema.
The request is a chat completion whose response_format.type is json_schema or json_object.

How it works

Request. obleth removes the response_format field and injects a system section instructing the model to reply with a single JSON document — the provided JSON Schema for json_schema, or a generic "valid JSON object" instruction for json_object. Schemas larger than 64 KB are rendered into the prompt but not validated (a guard against pathological documents).
Response. obleth extracts the JSON document from the reply (tolerating markdown fences and stray prose) and validates it against the schema. If it passes, the canonical JSON replaces the message content.
Repair. If validation fails, obleth re-prompts a fixer model (the configured one, or the request's own model when none is set) with the invalid output and the validation errors, up to max_repair_attempts times. Each repair call is billed to the tenant as a structured_output_boon record.
Fail-open. If every attempt still fails, the original completion passes through unchanged and the response carries x-obleth-boons-warning: structured_output_validation_failed.

Enabling it

From Settings → Model boons, turn on Enable structured output boon, choose a Fixer model, and set the repair attempts and timeout. Or via the Management API:

curl -X PUT http://localhost:9180/api/v1/settings/boons \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "structured_output_enabled": true,
    "structured_output_fixer_model": "qwen3-235b",
    "structured_output_max_repair_attempts": 1,
    "structured_output_timeout_ms": 30000
  }'

structured_output_max_repair_attempts is clamped to a maximum of 3. Send structured_output_fixer_model as "" to repair with the request's own model instead of a dedicated fixer. Then grant structured_output to each model that should be enforced.

The compression boon

The compression boon reduces the number of tokens a model has to read — tool outputs, JSON, code, and bloated replayed history — before the request is dispatched upstream. Savings are measured before upstream billing, so they cut cost and latency for every client with no client-side changes.

Compression has four pieces, layered from always-safe to opt-in:

Piece	What it does	Lossless?
Structural JSON	Rewrite a JSON array of like objects as a compact `OBLETH_TABLE` (schema header + CSV rows); fall back to whitespace minification	Lossless — reconstruct-validated, always on when the boon is enabled
Code compaction	Strip trailing whitespace and collapse blank-line runs in fenced code	Conservative, opt-in (`code_compaction`)
Cross-turn dedup	Replace a large block repeated across messages with a `[ref:HASH]` marker	Opt-in (`dedup`); original recoverable
Lossy text	Compact long prose (salience-based sentence extraction) and logs (template collapse). Prose uses a built-in deterministic heuristic by default, or a trained extractive scorer when the compressor sidecar is deployed	Lossy, opt-in (`allow_lossy`)

The dedup and lossy passes are deterministic and, out of the box, never call a helper model or touch the network on the request path. The one optional exception is neural prose scoring: if you deploy the compressor sidecar, the lossy prose pass makes a single in-cluster scoring call — still deterministic, still extractive (it only selects existing sentences, never rewrites them). The boon is fail-open: any error, an unreachable sidecar, or any segment that wouldn't actually get smaller leaves the content untouched.

When it applies

Like every boon, compression is opt-in twice — enabled globally and granted to the model via its boons list (compression). Beyond that, each piece is gated independently:

Structural JSON runs whenever the boon is enabled for the model and the tenant hasn't opted out (chat completions only; internal probe keys are exempt).
Code compaction runs only when code_compaction is on — globally as the default, or per tenant.
Cross-turn dedup and lossy text require the tenant to opt in (dedup / allow_lossy). They are model-free and run on any model — there is no function-calling or tool-loop requirement.

Compression targets any large segment: the latest user message (so a "here's a huge file, now answer" turn is compacted), older history, and tool outputs. The only thing never modified is a trailing assistant message. Segments below min_tokens are always left alone.

`retrieve_original` (a bonus, not a requirement)

When dedup or the lossy pass replaces a segment, obleth first stores the original in Redis (keyed by a content hash, with a TTL) and leaves a [ref:HASH] marker in its place. If the model supports function calling and the gateway tool loop is enabled, obleth also injects a gateway-executed retrieve_original tool plus a short system note, so the model can recover the full text on demand — the call is a Redis lookup executed at the gateway, never forwarded upstream or to the client. On a model without function calling the compaction still happens; only the recovery tool is omitted.

The original is always stashed before the segment is replaced (and only once the segment is confirmed to actually shrink); if the stash fails, the segment is left verbatim. A lookup miss (expired TTL, unknown hash) returns a clear "no longer available" message rather than failing the request.

A/B compression from the API

Two request headers let a client compare compression on a single request, without touching any settings:

x-obleth-boons: off — bypass all boons (the uncompressed baseline).
x-obleth-boons: lossy — force the lossy pass on for this request even where the tenant hasn't opted in. The boon must still be granted to the model — the header widens a granted boon, it never enables an ungranted one.

Whenever the compression boon ran, the response carries an x-obleth-compression header summarizing what it did:

x-obleth-compression: before=18423;after=11278;saved=7145

Diff the same request with and without the headers to measure savings back-to-back — no dashboard or trace round-trip needed. obench compression automates exactly this A/B across payload corpora.

Fail-open, billing, and observability

Fail-open. Redis unavailable or a parse error skips that segment; the boon never blocks or fails a request.
Billing. Compression makes no helper-model calls, so it adds no per-segment usage charges — the savings simply lower the upstream token bill.
Savings & spans. Tokens saved are counted in the obleth_compression_tokens_saved_total metric, and the tracer records a single boon:compression span with json_compacted, dedup_refs, lossy_segments, and before/after token totals.

Enabling it (global)

Compression settings live in the app_settings store (key boons) and are hot-reloadable. From the control plane, open the Settings → Compression tab — it also shows live neural sidecar status (configured / reachable, plus the model name and revision it reports) — or use the Management API:

curl -X PUT http://localhost:9180/api/v1/settings/boons \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "compression_enabled": true,
    "compression_min_tokens": 512,
    "compression_max_segments": 64,
    "compression_max_lossy_segments": 4,
    "compression_code_compaction": false,
    "compression_dedup": false,
    "compression_compact_logs": false,
    "compression_allow_lossy": false,
    "compression_original_ttl_secs": 3600,
    "compression_neural_keep_ratio": 0.5
  }'

compression_min_tokens (default 512) — skip segments smaller than this; the overhead isn't worth it.
compression_max_segments (default 64) — cap on lossless segments compacted per request.
compression_max_lossy_segments (default 4) — cap on dedup + lossy segments per request (a guard against over-rewriting a single request).
compression_code_compaction (default false) — the global default for code compaction; a tenant policy overrides it.
compression_dedup / compression_compact_logs / compression_allow_lossy (all default false) — global defaults for the three per-piece toggles (cross-turn dedup, near-lossless log template-collapse, and lossy prose compaction). A tenant with no policy inherits these; a per-tenant policy overrides them. Flip compression_compact_logs on to collapse verbose logs fleet-wide; compression_allow_lossy on to trim prose everywhere.
compression_original_ttl_secs (default 3600) — how long stashed originals live in Redis for retrieve_original.
compression_neural_keep_ratio (default 0.5, range (0.0, 1.0]) — fraction of sentences the lossy prose pass keeps; lower is more aggressive. Applies to both the built-in heuristic and the neural compressor sidecar. Out-of-range values are ignored (left unchanged).

Granting and tuning per tenant

Grant the boon to a model by adding compression to its boons list (dashboard: Models → a row → Boons → compression, or PUT /models/{id} with { "boons": ["compression"] }).

Each tenant then controls which pieces apply via a per-tenant policy. In the dashboard, open a tenant and use the Compression tab — four independent toggles: Enabled (master), Code compaction, Cross-turn dedup, and Allow lossy. Or via the Management API:

curl -X PATCH http://localhost:9180/api/v1/tenants/$ID/compression \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{ "policy": { "enabled": true, "code_compaction": true, "dedup": true, "allow_lossy": true } }'

Send { "policy": null } to clear a tenant's policy and fall back to the global defaults. A tenant with no policy inherits the global per-piece defaults (compression_dedup / compression_compact_logs / compression_allow_lossy / compression_code_compaction from Settings → Compression). All ship false, so out of the box a policy-less tenant gets lossless structural compaction only — nothing lossy happens until an operator flips a global default or the tenant opts in with its own policy.

Neural prose scoring (optional compressor sidecar)

The lossy prose pass ranks sentences by importance and keeps the top ones. By default it uses a built-in deterministic heuristic. If you'd rather use a trained extractive scorer, deploy the compressor sidecar — a small self-hosted service that scores sentence importance with a trained ModernBERT-based model (kompress-v2-base, Apache-2.0) served as pre-built ONNX on CPU. obleth still owns all the logic (sentence splitting, keep-selection, the token-gain gate, the Redis stash, and the retrieve_original machinery); the sidecar only scores sentences.

It keeps every guarantee of the boon:

Extractive — it selects existing sentences, never rewrites or fabricates text.
Reversible — the original is stashed in Redis exactly as with the heuristic path.
Fail-open — if the sidecar is unset, times out, or errors, obleth silently uses the built-in heuristic. Enabling it never fails or degrades a request.
On your own infrastructure — the model runs in a container you deploy; no request text ever leaves your cluster.

It is gated by the same tenant allow_lossy opt-in as the heuristic prose pass — there is no extra tenant toggle. The operator switch is simply deploying the sidecar and pointing obleth at it.

Docker Compose. Add compressor to COMPOSE_PROFILES and set the URL in deploy/docker/.env:

COMPOSE_PROFILES=benchmark,edge,observability,compressor
OBLETH_COMPRESSOR_URL=http://compressor:8080
# OBLETH_COMPRESSOR_TIMEOUT_MS=800   # optional; default 800ms

Then docker compose up -d --build from deploy/docker/. The first build is slow — the image bakes a ~600 MB ONNX model.

Kubernetes (Helm). Set compressor.enabled: true; the chart deploys the sidecar and wires OBLETH_COMPRESSOR_URL at obleth for you. The service is stateless, so turn on compressor.autoscaling.enabled to scale replicas under load. See Modular deploy → Neural prose compression sidecar.

The gateway makes one batched scoring call per request (all eligible prose segments together), with a short timeout; on any failure it falls straight back to the heuristic.

The vision boon is distinct from registering a natively vision-capable chat model. If a model can see images itself, give it the vision tag (which sets supports_vision: true) and the boon leaves its requests alone. The boon exists specifically to extend text-only models that opt in. For serving images, audio, and embeddings as first-class modalities, see Multi-modal Models.

PreviousAuto Model Routing

NextGuardrails

Getting Started

Concepts

Guides

Reference

Operations

Model Boons

How boons work

The vision boon

When it applies

Fail-open by design

Billing and attribution

Prerequisites

Enabling the boon

Granting the boon to a model

Using it from a client

Verifying

The gateway tool loop

When it applies

How it works

Streaming, errors, and billing

Enabling and granting it

Web search end to end

The structured output boon

When it applies

How it works

Enabling it

The compression boon

When it applies

`retrieve_original` (a bonus, not a requirement)

A/B compression from the API

Fail-open, billing, and observability

Enabling it (global)

Granting and tuning per tenant

Neural prose scoring (optional compressor sidecar)

Getting Started

Concepts

Guides

Reference

Operations

Model Boons

How boons work

The vision boon

When it applies

Fail-open by design

Billing and attribution

Prerequisites

Enabling the boon

Granting the boon to a model

Using it from a client

Verifying

The gateway tool loop

When it applies

How it works

Streaming, errors, and billing

Enabling and granting it

Web search end to end

The structured output boon

When it applies

How it works

Enabling it

The compression boon

When it applies

retrieve_original (a bonus, not a requirement)

A/B compression from the API

Fail-open, billing, and observability

Enabling it (global)

Granting and tuning per tenant

Neural prose scoring (optional compressor sidecar)

Relationship to multi-modal routes

`retrieve_original` (a bonus, not a requirement)