Overview API Reference Integrations

API Reference

Base URL: https://api.lxg2it.com

Authentication

All API requests require a Bearer token. Include your API key in the Authorization header:

Authorization: Bearer mr_sk_...

Generate keys at /profile. Keys start with mr_sk_.

Chat completions
POST /v1/chat/completions
Create a chat completion. OpenAI-compatible request and response format.

Request body:

ParameterTypeDescription
model string Required. Either a tier name (economy, standard, premium, auto) or an exact model ID to pin routing (e.g. gpt-4.1, claude-sonnet-4-6). See Tiers and Model pinning.
messages array Required. Array of message objects with role and content. Roles: system, user, assistant.
prefer string Optimisation direction within the tier: cheap (lowest cost), fast (lowest latency), balanced (default), quality (highest quality score).
stream boolean Stream response chunks via SSE. Default: false.
temperature number Sampling temperature (0–2). Passed through to the provider.
max_tokens integer Maximum tokens to generate. Passed through to the provider.
top_p number Nucleus sampling parameter. Passed through to the provider.
stop string | array Stop sequence(s). Passed through to the provider.

Response: Standard OpenAI chat completion object with id, choices, usage, etc. The model field in the response contains the actual model that served the request.

Response headers:

HeaderDescription
X-Model-Router-ModelThe model that served the request.
X-Model-Router-ProviderThe provider that served the request.
X-Request-IdUnique request ID. Use this to correlate with telemetry traces.
X-Model-Router-Auto-ScoreAuto-routing score (0–100). Only present when model: "auto".
X-Model-Router-Auto-TierTier selected by auto-routing. Only present when model: "auto".
// Example response { "id": "chatcmpl-abc123", "object": "chat.completion", "model": "gpt-4.1", "choices": [{ "index": 0, "message": { "role": "assistant", "content": "Hello! How can I help?" }, "finish_reason": "stop" }], "usage": { "prompt_tokens": 12, "completion_tokens": 8, "total_tokens": 20 } }
Tiers

Tiers group models by capability. The router selects the best model within the tier based on your prefer setting, provider availability, and context-window fit.

TierDescriptionExample models
economy Fast, cheap, good for simple tasks GPT-4.1 Mini, Claude 3.5 Haiku, Gemini 2.0 Flash
standard Balanced capability and cost GPT-4.1, Claude Sonnet 4, Gemini 2.5 Pro
premium Maximum capability, reasoning models GPT-4.5, Claude Opus 3, o1
auto Heuristic classifier analyses your full conversation context to select the appropriate tier. How it works → Varies by context

See the live model list at /v1/models.

Auto-routing

Set model: "auto" to let the router infer the right tier from your full conversation context. Unlike single-message classifiers, auto-routing analyses the entire messages array — system prompt, conversation history, code blocks, tool calls, and reasoning markers — then produces a complexity score from 0–100 that maps to a tier.

# Auto-routing — let the router choose the tier curl https://api.lxg2it.com/v1/chat/completions \ -H "Authorization: Bearer $MR_API_KEY" \ -H "Content-Type: application/json" \ -d '{ "model": "auto", "messages": [ { "role": "system", "content": "You are a senior software architect." }, { "role": "user", "content": "Design a distributed consensus algorithm for a financial ledger." } ] }'

Every auto-routed response includes two extra headers:

HeaderDescription
X-Model-Router-Auto-ScoreComplexity score 0–100 computed from your request context
X-Model-Router-Auto-TierTier selected by auto-routing (economy / standard / premium)

The score is built from seven weighted signals:

SignalWeightWhat it measures
Code blocks20%Fenced code, inline code, and code-like lines across all messages
Technical keywords20%Premium terms (consensus, compiler, theorem) and standard terms (API, database, function)
Reasoning markers15%Phrases like "step by step", "trade-offs", "design a system", "prove that"
System prompt length15%Longer system prompts indicate specialised agents
Conversation depth10%Number of prior turns — accumulated context raises complexity
Tool usage10%Presence of tool_calls and tool role messages
Message complexity10%Maximum user message length

The final score combines a weighted average with a strongest-signal boost (score = weighted_avg × 0.6 + max_signal × 0.4), so a single strong indicator is enough to push past a tier threshold even when other signals are zero. Score thresholds: 0–20 → economy, 21–55 → standard, 56–100 → premium. The economy ceiling is intentionally low — strong confidence is required before routing to cheaper models.

Auto-routing is deterministic: the same input always produces the same score and tier. It adds under 1 ms of overhead (no ML model, no embeddings, no external calls).

Auto-routing analysis runs entirely in-process. No request content is stored, logged, or used for training — only the derived numeric score and selected tier are recorded for observability.

Model pinning

Pass an exact model ID in the model field to bypass tier routing and target a specific model. The ID must match a model in our catalog (visible at /v1/models).

# Pin to Claude Sonnet 4 specifically { "model": "claude-sonnet-4-6", "messages": [...] }

When pinning, the prefer parameter is ignored. If the pinned model’s provider is unavailable, the request fails rather than falling back to another model.

Prefer parameter

The prefer field controls how the router ranks models within the resolved tier. It does not change which tier is used.

ValueBehaviour
cheapLowest cost per token.
fastLowest latency (time to first token).
balancedDefault. Cheapest, break ties by quality.
qualityHighest quality score, break ties by cost.
codingHighest SWE-bench score. Routes to models with the strongest software engineering performance.
Tool calls

Tool calls work the same as the OpenAI API — pass a tools array and the router handles the format translation to each provider automatically. You never need to handle Anthropic’s tool_use blocks or Google’s functionCall parts; everything comes back in standard OpenAI format.

// Tool call request (same across all tiers/providers) { "model": "standard", "messages": [{ "role": "user", "content": "What is the weather in Sydney?" }], "tools": [{ "type": "function", "function": { "name": "get_weather", "description": "Get the current weather for a city", "parameters": { "type": "object", "properties": { "city": { "type": "string" } }, "required": ["city"] } } }] }

The response contains a standard tool_calls array. Submit tool results back using role: "tool" messages as you normally would.

Reasoning / thinking

Several models in the router are reasoning models: they think through a problem internally before writing their response. By default this thinking is hidden — you only see the final answer.

Set "include_reasoning": true to receive the thinking alongside the response. This works in both streaming and non-streaming modes, across all providers.

Economy tier reasoning models: grok-3-mini-beta, gemini-2.5-flash. Standard/premium: o4-mini, o3, gemini-2.5-pro, claude-opus-4-6 (extended thinking).

// Non-streaming — reasoning_content in the message { "model": "economy", "include_reasoning": true, "messages": [{ "role": "user", "content": "Is 17 prime?" }] } // Response { "choices": [{ "message": { "role": "assistant", "content": "Yes, 17 is prime.", "reasoning_content": "17 is only divisible by 1 and itself..." } }] }

For streaming, reasoning_content arrives as delta chunks before the regular content chunks. Filter by which field is present to separate them:

// Streaming — two chunk types, reasoning arrives first for await (const chunk of stream) { const delta = chunk.choices[0]?.delta; if (delta?.reasoning_content) { process.stdout.write(`[thinking] ${delta.reasoning_content}`); } else if (delta?.content) { process.stdout.write(delta.content); } }

Note: include_reasoning increases token usage and latency. For models billed by output tokens, thinking tokens count toward your usage.

Embeddings
POST /v1/embeddings
Generate vector embeddings. OpenAI-compatible request and response format.

Use the same API key and base URL as chat completions. Billed at input tokens only — there are no output tokens for embeddings.

Request body:

ParameterTypeDescription
model string Required. An embedding tier alias or exact model ID. See the table below for available tiers.
input string | array Required. Text to embed. Pass a single string or an array of strings for batch embedding.
dimensions integer Optional. Truncate output dimensions. Supported by embed-large (up to 3072) and embed-titan (256, 512, or 1024).

Embedding tiers:

AliasModelDimensionsPriceBest for
embed-small text-embedding-3-small 1536 $0.02 / 1M tokens High-volume, cost-sensitive workloads
embed-large text-embedding-3-large up to 3072 $0.13 / 1M tokens Maximum retrieval accuracy
embed-titan amazon.titan-embed-text-v2:0 256 / 512 / 1024 $0.10 / 1M tokens AWS-native workloads, flexible dimensions

Example:

curl https://api.lxg2it.com/v1/embeddings \ -H "Authorization: Bearer $KEY" \ -H "Content-Type: application/json" \ -d '{ "model": "embed-small", "input": ["The quick brown fox", "jumps over the lazy dog"] }'

Response: Standard OpenAI embeddings object.

// Example response { "object": "list", "model": "text-embedding-3-small", "data": [{ "object": "embedding", "index": 0, "embedding": [0.0023, -0.0141, ...] }], "usage": { "prompt_tokens": 9, "total_tokens": 9 } }
Specialist models

Most models route automatically through POST /v1/chat/completions. Two models have different API surfaces and are excluded from auto-routing — they must be pinned by name.

POST /v1/completions
Legacy text-completion endpoint for models that complete a prompt rather than a conversation. Currently: gpt-5.1-codex-mini.

Send a prompt string instead of a messages array. The response shape is OpenAI's text_completion object.

ParameterTypeDescription
model string Required. Must be a completions-type model ID (e.g. gpt-5.1-codex-mini). Chat models are rejected on this endpoint.
prompt string Required. Text prefix to complete.
max_tokens integer Maximum tokens to generate.
temperature number Sampling temperature, 0–2.
stop string | array Stop sequences.

Example:

curl https://api.lxg2it.com/v1/completions \ -H "Authorization: Bearer $KEY" \ -H "Content-Type: application/json" \ -d '{ "model": "gpt-5.1-codex-mini", "prompt": "def fibonacci(n):", "max_tokens": 256, "temperature": 0 }'
// Response { "object": "text_completion", "model": "gpt-5.1-codex-mini", "choices": [{ "index": 0, "text": "\n if n <= 1:\n return n\n return fibonacci(n-1) + fibonacci(n-2)", "finish_reason": "stop" }], "usage": { "prompt_tokens": 8, "completion_tokens": 42, "total_tokens": 50 } }
POST /v1/chat/completions
Access Responses API models by pinning them explicitly with model: "gpt-5.3-codex". The router handles format conversion (messages → Responses API input).

gpt-5.3-codex uses OpenAI’s Responses API internally, which has a different request shape to the chat completions API. The router converts your messages array into the Responses API format automatically — but because this comes with limitations, these models must be pinned explicitly and are never selected by auto-routing.

Limitations: stream: true is not supported (returns 400). Auto-routing will not select these models — you must specify the model by name.

Example:

curl https://api.lxg2it.com/v1/chat/completions \ -H "Authorization: Bearer $KEY" \ -H "Content-Type: application/json" \ -d '{ "model": "gpt-5.3-codex", "messages": [ { "role": "system", "content": "You are an expert software engineer." }, { "role": "user", "content": "Implement a binary search tree in Python." } ] }'

The system message becomes OpenAI’s instructions field. The response is a standard chat completion object.

Observability

Export request traces to your own observability platform. Model Router supports OTLP/HTTP — the OpenTelemetry standard — so you can use any compatible backend: Axiom, Grafana Cloud, Honeycomb, Datadog, and more.

Configure your OTLP endpoint and auth headers in your profile settings. Once enabled, every request generates a span with full routing metadata:

Span attributeDescription
model_router.request_idUnique request ID (matches X-Request-Id response header)
model_router.providerProvider that served the request
model_router.modelModel that served the request
model_router.tierTier used for routing
model_router.preferPrefer value used
model_router.prompt_tokensInput token count
model_router.completion_tokensOutput token count
model_router.cost_centsCost of the request in cents
model_router.latency_msTotal request latency
model_router.streamingWhether the request was streamed
model_router.auto_scoreAuto-routing score (when using auto)
model_router.failover_fromOriginal provider if a failover occurred

Telemetry export is fully async — it never adds latency to your API calls. If your OTLP endpoint is unreachable, requests proceed normally.

Use the X-Request-Id response header to correlate any individual request with its trace in your observability platform.

Other endpoints
GET /v1/models
List all available models, tiers, and pricing. Public — no auth required.
GET /v1/account/credits
Check your current credit balance. Requires session auth.
GET /v1/account/usage
Usage history for the last 30 days, broken down by day and model. Requires session auth.
GET /health
Server health check. Returns provider status and open circuit breakers. No auth required.
Context-window guard

Before routing, the router estimates your input token count and filters out any model whose context window is too small. You never get a “context length exceeded” error from the provider — the router handles it.

Circuit breaker

If a provider returns repeated errors, its circuit breaker opens and the router stops sending traffic to it. After a cooldown, one test request is allowed through. If it succeeds, the circuit closes and the provider is back in the pool.

This is automatic and invisible to clients. You get transparent failover across providers within a tier.

Rate limits

Rate limits are enforced per API key using a token bucket — tokens refill continuously rather than resetting at a hard window boundary, so bursts are handled smoothly.

The limit applied depends on your credit balance:

BalanceLimit
≥ $10.0060 RPM
< $10.0010 RPM

Per-key overrides are available on request — contact support@api.lxg2it.com if you need a higher limit.

Every response includes rate limit headers so you can track consumption:

HeaderDescription
X-RateLimit-LimitYour key's RPM limit
X-RateLimit-RemainingTokens remaining in the current window
X-RateLimit-ResetUnix timestamp when the bucket is fully refilled
Retry-AfterSeconds to wait before retrying (only on 429 responses)

When rate-limited, the response is HTTP 429:

{
  "error": {
    "message": "Rate limit exceeded. Your key is limited to 10 requests per minute.",
    "type": "rate_limit_error",
    "code": "rate_limit_exceeded"
  }
}

Daily spend limits are a separate control. If your account has a daily spend cap configured, requests made after hitting it return HTTP 429 with code: "daily_spend_limit_exceeded" and reset at UTC midnight.

Error codes
StatusMeaning
400Bad request — missing or invalid parameters.
401Unauthorised — missing or invalid API key.
402Insufficient credits.
429Rate limited.
502Provider error — upstream model returned an error.
503No available model — all providers in the tier are down or context too large.

Error responses follow the OpenAI format:

{ "error": { "message": "Insufficient credits", "type": "billing_error", "code": "insufficient_credits" } }