56 docs indexed

Auto Model Routing

How obleth picks a concrete model when a client sends model: "auto" — hard filters, capacity and cost scoring, and the optional intent classifier.

When a client sends model: "auto" instead of a registered model name, obleth chooses a concrete model for the request at admission time. This lets callers ask for "the right model for this prompt" without hard-coding a model name, while operators keep control over which models are eligible and how they are ranked.

Only chat routes participate in auto selection. Requests on other modalities (embeddings, audio, image) must name a registered model.

How selection works

Selection is a two-stage process over the live model registry.

1. Hard filters

A model is removed from consideration if it cannot serve the request at all:

  • It is disabled, unhealthy, or inside a maintenance window.
  • Its context_window is too small for the estimated prompt plus requested max_tokens.
  • It lacks a capability the request needs: function calling, tool_choice, or a JSON response schema.
  • The tenant has a model allowlist and this model is not on it.

2. Scoring

Survivors are ranked by a blend of:

  • Spare capacity (60%) — models that are not busy score higher, so auto traffic spreads across idle backends instead of piling onto one.
  • Cost (40%) — cheaper models (by per-token pricing) score higher.

When the request has desired tags (see below), a tag-match layer is mixed in at 50% weight, so a model carrying the right tags is preferred — but a busy or expensive tagged model can still lose to a cheaper idle one. Ties break deterministically on model name.

Routing tags

Each model can be tagged with a subset of a fixed vocabulary. These describe what a model is good at, and bias selection toward the best match:

coding · general · reasoning · math · vision · long-context · fast · creative

Tags are set per model from the dashboard (Models) or the Management API. Tagging is optional — with no tags anywhere, auto falls back to pure capacity-and-cost scoring.

The vision tag does double duty: besides biasing routing, it marks the model as natively image-capable (sets supports_vision), making it eligible to serve as a system-wide describer for the vision boon.

The intent classifier (optional)

To map a request to tags, obleth can consult a small, fast "brain" model whose only job is to classify the prompt into 1–3 tags from the vocabulary. The classifier is deliberately defensive:

  • It is hard-timeout bounded (classifier_timeout_ms, default 250ms).
  • Results are cached briefly per prompt.
  • On timeout, transport error, or unparseable output it returns no tags, and obleth falls back to cheap keyword heuristics. An auto request is never blocked or failed because the brain is slow or down.

Configuring the classifier

The classifier is seeded on first boot from environment variables, then managed at runtime through the Management API (the persisted setting is authoritative once saved).

Bootstrap env vars (see Environment Variables):

OBLETH_AUTO_CLASSIFIER_ENABLED=true
OBLETH_AUTO_CLASSIFIER_MODEL=qwen3-0.6b      # a registered, fast chat model
OBLETH_AUTO_CLASSIFIER_TIMEOUT_MS=250

Read and update it live:

# Read current settings
curl http://localhost:9180/api/v1/settings/auto-router \
  -H "Authorization: Bearer $TOKEN"

# Enable the classifier and point it at a small model
curl -X PUT http://localhost:9180/api/v1/settings/auto-router \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "classifier_enabled": true,
    "classifier_model": "qwen3-0.6b",
    "classifier_timeout_ms": 250
  }'

Set classifier_model to "" to clear it. Changes apply on the next admission decision without a restart.

Using it from a client

Use the data-plane URL for your deployment: http://localhost via HAProxy (Compose edge profile), http://localhost:8088 for the direct Compose host port, or http://localhost:8080 when running the gateway natively with cargo run.

curl http://localhost/v1/chat/completions \
  -H "Authorization: Bearer sk_..." \
  -H "Content-Type: application/json" \
  -d '{
    "model": "auto",
    "messages": [{"role": "user", "content": "Write a Python function to parse a CSV file."}]
  }'

obleth resolves auto to a concrete model (here likely one tagged coding), admits it through the normal fairshare pipeline, and records the chosen model in the usage ledger.