56 docs indexed
How obleth picks a concrete model when a client sends model: "auto" — hard filters, capacity and cost scoring, and the optional intent classifier.
When a client sends model: "auto" instead of a registered model name, obleth chooses a concrete model for the request at admission time. This lets callers ask for "the right model for this prompt" without hard-coding a model name, while operators keep control over which models are eligible and how they are ranked.
Only chat routes participate in auto selection. Requests on other modalities (embeddings, audio, image) must name a registered model.
Selection is a two-stage process over the live model registry.
A model is removed from consideration if it cannot serve the request at all:
context_window is too small for the estimated prompt plus requested max_tokens.tool_choice, or a JSON response schema.Survivors are ranked by a blend of:
auto traffic spreads across idle backends instead of piling onto one.When the request has desired tags (see below), a tag-match layer is mixed in at 50% weight, so a model carrying the right tags is preferred — but a busy or expensive tagged model can still lose to a cheaper idle one. Ties break deterministically on model name.
Each model can be tagged with a subset of a fixed vocabulary. These describe what a model is good at, and bias selection toward the best match:
coding · general · reasoning · math · vision · long-context · fast · creative
Tags are set per model from the dashboard (Models) or the Management API. Tagging is optional — with no tags anywhere, auto falls back to pure capacity-and-cost scoring.
The vision tag does double duty: besides biasing routing, it marks the model as natively image-capable (sets supports_vision), making it eligible to serve as a system-wide describer for the vision boon.
To map a request to tags, obleth can consult a small, fast "brain" model whose only job is to classify the prompt into 1–3 tags from the vocabulary. The classifier is deliberately defensive:
classifier_timeout_ms, default 250ms).auto request is never blocked or failed because the brain is slow or down.The classifier is seeded on first boot from environment variables, then managed at runtime through the Management API (the persisted setting is authoritative once saved).
Bootstrap env vars (see Environment Variables):
OBLETH_AUTO_CLASSIFIER_ENABLED=true
OBLETH_AUTO_CLASSIFIER_MODEL=qwen3-0.6b # a registered, fast chat model
OBLETH_AUTO_CLASSIFIER_TIMEOUT_MS=250
Read and update it live:
# Read current settings
curl http://localhost:9180/api/v1/settings/auto-router \
-H "Authorization: Bearer $TOKEN"
# Enable the classifier and point it at a small model
curl -X PUT http://localhost:9180/api/v1/settings/auto-router \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{
"classifier_enabled": true,
"classifier_model": "qwen3-0.6b",
"classifier_timeout_ms": 250
}'
Set classifier_model to "" to clear it. Changes apply on the next admission decision without a restart.
Use the data-plane URL for your deployment: http://localhost via HAProxy (Compose edge profile), http://localhost:8088 for the direct Compose host port, or http://localhost:8080 when running the gateway natively with cargo run.
curl http://localhost/v1/chat/completions \
-H "Authorization: Bearer sk_..." \
-H "Content-Type: application/json" \
-d '{
"model": "auto",
"messages": [{"role": "user", "content": "Write a Python function to parse a CSV file."}]
}'
obleth resolves auto to a concrete model (here likely one tagged coding), admits it through the normal fairshare pipeline, and records the chosen model in the usage ledger.