Base URL: https://api.lxg2it.com
All API requests require a Bearer token. Include your API key in the
Authorization header:
Generate keys at /profile. Keys start with
mr_sk_.
Request body:
| Parameter | Type | Description |
|---|---|---|
model |
string |
Required. Either a tier name (economy,
standard, premium, auto) or an
exact model ID to pin routing (e.g. gpt-4.1,
claude-sonnet-4-6). See Tiers and
Model pinning.
|
messages |
array | Required. Array of message objects with role
and content. Roles: system, user,
assistant. |
prefer |
string | Optimisation direction within the tier: cheap (lowest cost),
fast (lowest latency), balanced (default),
quality (highest quality score). |
stream |
boolean | Stream response chunks via SSE. Default: false. |
temperature |
number | Sampling temperature (0–2). Passed through to the provider. |
max_tokens |
integer | Maximum tokens to generate. Passed through to the provider. |
top_p |
number | Nucleus sampling parameter. Passed through to the provider. |
stop |
string | array | Stop sequence(s). Passed through to the provider. |
Response: Standard OpenAI chat completion object with
id, choices,
usage, etc. The model
field in the response contains the actual model that served the request.
Response headers:
| Header | Description |
|---|---|
X-Model-Router-Model | The model that served the request. |
X-Model-Router-Provider | The provider that served the request. |
X-Request-Id | Unique request ID. Use this to correlate with telemetry traces. |
X-Model-Router-Auto-Score | Auto-routing score (0–100). Only present when model: "auto". |
X-Model-Router-Auto-Tier | Tier selected by auto-routing. Only present when model: "auto". |
Tiers group models by capability. The router selects the best model within
the tier based on your prefer setting, provider
availability, and context-window fit.
| Tier | Description | Example models |
|---|---|---|
economy |
Fast, cheap, good for simple tasks | GPT-4.1 Mini, Claude 3.5 Haiku, Gemini 2.0 Flash |
standard |
Balanced capability and cost | GPT-4.1, Claude Sonnet 4, Gemini 2.5 Pro |
premium |
Maximum capability, reasoning models | GPT-4.5, Claude Opus 3, o1 |
auto |
Heuristic classifier analyses your full conversation context to select the appropriate tier. How it works → | Varies by context |
See the live model list at /v1/models.
Set model: "auto" to let the router infer the right tier from your
full conversation context. Unlike single-message classifiers, auto-routing analyses the entire
messages array — system prompt, conversation history, code blocks,
tool calls, and reasoning markers — then produces a complexity score from 0–100 that maps to a tier.
Every auto-routed response includes two extra headers:
| Header | Description |
|---|---|
X-Model-Router-Auto-Score | Complexity score 0–100 computed from your request context |
X-Model-Router-Auto-Tier | Tier selected by auto-routing (economy / standard / premium) |
The score is built from seven weighted signals:
| Signal | Weight | What it measures |
|---|---|---|
| Code blocks | 20% | Fenced code, inline code, and code-like lines across all messages |
| Technical keywords | 20% | Premium terms (consensus, compiler, theorem) and standard terms (API, database, function) |
| Reasoning markers | 15% | Phrases like "step by step", "trade-offs", "design a system", "prove that" |
| System prompt length | 15% | Longer system prompts indicate specialised agents |
| Conversation depth | 10% | Number of prior turns — accumulated context raises complexity |
| Tool usage | 10% | Presence of tool_calls and tool role messages |
| Message complexity | 10% | Maximum user message length |
The final score combines a weighted average with a strongest-signal boost
(score = weighted_avg × 0.6 + max_signal × 0.4), so a single strong
indicator is enough to push past a tier threshold even when other signals are zero.
Score thresholds: 0–20 → economy, 21–55 → standard, 56–100 → premium.
The economy ceiling is intentionally low — strong confidence is required before
routing to cheaper models.
Auto-routing is deterministic: the same input always produces the same score and tier. It adds under 1 ms of overhead (no ML model, no embeddings, no external calls).
Auto-routing analysis runs entirely in-process. No request content is stored, logged, or used for training — only the derived numeric score and selected tier are recorded for observability.
Pass an exact model ID in the model field to bypass
tier routing and target a specific model. The ID must match a model in our
catalog (visible at /v1/models).
When pinning, the prefer parameter is ignored.
If the pinned model’s provider is unavailable, the request fails rather
than falling back to another model.
The prefer field controls how the router ranks
models within the resolved tier. It does not change which tier is used.
| Value | Behaviour |
|---|---|
cheap | Lowest cost per token. |
fast | Lowest latency (time to first token). |
balanced | Default. Cheapest, break ties by quality. |
quality | Highest quality score, break ties by cost. |
coding | Highest SWE-bench score. Routes to models with the strongest software engineering performance. |
Tool calls work the same as the OpenAI API — pass a tools array
and the router handles the format translation to each provider automatically.
You never need to handle Anthropic’s tool_use blocks or
Google’s functionCall parts; everything comes back in
standard OpenAI format.
The response contains a standard tool_calls array. Submit tool results
back using role: "tool" messages as you normally would.
Several models in the router are reasoning models: they think through a problem internally before writing their response. By default this thinking is hidden — you only see the final answer.
Set "include_reasoning": true to receive the thinking alongside
the response. This works in both streaming and non-streaming modes, across all providers.
Economy tier reasoning models: grok-3-mini-beta, gemini-2.5-flash. Standard/premium: o4-mini, o3, gemini-2.5-pro, claude-opus-4-6 (extended thinking).
For streaming, reasoning_content arrives as delta chunks
before the regular content chunks. Filter by which field
is present to separate them:
Note: include_reasoning increases token usage and latency.
For models billed by output tokens, thinking tokens count toward your usage.
Use the same API key and base URL as chat completions. Billed at input tokens only — there are no output tokens for embeddings.
Request body:
| Parameter | Type | Description |
|---|---|---|
model |
string | Required. An embedding tier alias or exact model ID. See the table below for available tiers. |
input |
string | array | Required. Text to embed. Pass a single string or an array of strings for batch embedding. |
dimensions |
integer | Optional. Truncate output dimensions. Supported by embed-large
(up to 3072) and embed-titan (256, 512, or 1024). |
Embedding tiers:
| Alias | Model | Dimensions | Price | Best for |
|---|---|---|---|---|
embed-small |
text-embedding-3-small |
1536 | $0.02 / 1M tokens | High-volume, cost-sensitive workloads |
embed-large |
text-embedding-3-large |
up to 3072 | $0.13 / 1M tokens | Maximum retrieval accuracy |
embed-titan |
amazon.titan-embed-text-v2:0 |
256 / 512 / 1024 | $0.10 / 1M tokens | AWS-native workloads, flexible dimensions |
Example:
Response: Standard OpenAI embeddings object.
Most models route automatically through POST /v1/chat/completions.
Two models have different API surfaces and are excluded from auto-routing — they must be
pinned by name.
gpt-5.1-codex-mini.
Send a prompt string instead of a
messages array. The response shape is OpenAI's
text_completion object.
| Parameter | Type | Description |
|---|---|---|
model |
string | Required. Must be a completions-type model ID
(e.g. gpt-5.1-codex-mini). Chat models are rejected on this endpoint. |
prompt |
string | Required. Text prefix to complete. |
max_tokens |
integer | Maximum tokens to generate. |
temperature |
number | Sampling temperature, 0–2. |
stop |
string | array | Stop sequences. |
Example:
model: "gpt-5.3-codex".
The router handles format conversion (messages → Responses API input).
gpt-5.3-codex uses OpenAI’s
Responses API internally, which has a different request shape
to the chat completions API. The router converts your messages array
into the Responses API format automatically — but because this comes with limitations,
these models must be pinned explicitly and are never selected by auto-routing.
Limitations: stream: true is not supported
(returns 400). Auto-routing will not select these models
— you must specify the model by name.
Example:
The system message becomes OpenAI’s instructions field.
The response is a standard chat completion object.
Export request traces to your own observability platform. Model Router supports OTLP/HTTP — the OpenTelemetry standard — so you can use any compatible backend: Axiom, Grafana Cloud, Honeycomb, Datadog, and more.
Configure your OTLP endpoint and auth headers in your profile settings. Once enabled, every request generates a span with full routing metadata:
| Span attribute | Description |
|---|---|
model_router.request_id | Unique request ID (matches X-Request-Id response header) |
model_router.provider | Provider that served the request |
model_router.model | Model that served the request |
model_router.tier | Tier used for routing |
model_router.prefer | Prefer value used |
model_router.prompt_tokens | Input token count |
model_router.completion_tokens | Output token count |
model_router.cost_cents | Cost of the request in cents |
model_router.latency_ms | Total request latency |
model_router.streaming | Whether the request was streamed |
model_router.auto_score | Auto-routing score (when using auto) |
model_router.failover_from | Original provider if a failover occurred |
Telemetry export is fully async — it never adds latency to your API calls. If your OTLP endpoint is unreachable, requests proceed normally.
Use the X-Request-Id response header to correlate
any individual request with its trace in your observability platform.
Before routing, the router estimates your input token count and filters out any model whose context window is too small. You never get a “context length exceeded” error from the provider — the router handles it.
If a provider returns repeated errors, its circuit breaker opens and the router stops sending traffic to it. After a cooldown, one test request is allowed through. If it succeeds, the circuit closes and the provider is back in the pool.
This is automatic and invisible to clients. You get transparent failover across providers within a tier.
Rate limits are enforced per API key using a token bucket — tokens refill continuously rather than resetting at a hard window boundary, so bursts are handled smoothly.
The limit applied depends on your credit balance:
| Balance | Limit |
|---|---|
| ≥ $10.00 | 60 RPM |
| < $10.00 | 10 RPM |
Per-key overrides are available on request — contact support@api.lxg2it.com if you need a higher limit.
Every response includes rate limit headers so you can track consumption:
| Header | Description |
|---|---|
X-RateLimit-Limit | Your key's RPM limit |
X-RateLimit-Remaining | Tokens remaining in the current window |
X-RateLimit-Reset | Unix timestamp when the bucket is fully refilled |
Retry-After | Seconds to wait before retrying (only on 429 responses) |
When rate-limited, the response is HTTP 429:
{
"error": {
"message": "Rate limit exceeded. Your key is limited to 10 requests per minute.",
"type": "rate_limit_error",
"code": "rate_limit_exceeded"
}
}
Daily spend limits are a separate control. If your account has a daily spend cap
configured, requests made after hitting it return HTTP 429 with
code: "daily_spend_limit_exceeded" and reset at UTC midnight.
| Status | Meaning |
|---|---|
400 | Bad request — missing or invalid parameters. |
401 | Unauthorised — missing or invalid API key. |
402 | Insufficient credits. |
429 | Rate limited. |
502 | Provider error — upstream model returned an error. |
503 | No available model — all providers in the tier are down or context too large. |
Error responses follow the OpenAI format: