Overview API Reference Integrations

API Reference

Base URL: https://api.lxg2it.com

Authentication

All API requests require a Bearer token. Include your API key in the Authorization header:

Authorization: Bearer mr_sk_...

Generate keys at /profile. Keys start with mr_sk_.

Chat completions

POST /v1/chat/completions

Create a chat completion. OpenAI-compatible request and response format.

Request body:

Parameter	Type	Description
`model`	string	Required. Either a tier name (`economy`, `standard`, `premium`, `auto`) or an exact model ID to pin routing (e.g. `gpt-4.1`, `claude-sonnet-4-6`). See Tiers and Model pinning.
`messages`	array	Required. Array of message objects with `role` and `content`. Roles: `system`, `user`, `assistant`.
`prefer`	string	Optimisation direction within the tier: `cheap` (lowest cost), `fast` (lowest latency), `balanced` (default), `quality` (highest quality score), `coding` (SWE-bench optimised).
`fallback`	string[]	Ordered fallback chain. Each entry is a model ID or tier name. Tried in order after the primary model fails. Replaces the default tier-internal fallback. Overrides the no-fallback behaviour of pinned models. See Fallback chains.
`stream`	boolean	Stream response chunks via SSE. Default: `false`.
`temperature`	number	Sampling temperature (0–2). Passed through to the provider.
`max_tokens`	integer	Maximum tokens to generate. Passed through to the provider.
`top_p`	number	Nucleus sampling parameter. Passed through to the provider.
`stop`	string \| array	Stop sequence(s). Passed through to the provider.

Response: Standard OpenAI chat completion object with id, choices, usage, etc. The model field in the response contains the actual model that served the request.

Anthropic-native Messages API

POST /v1/messages

Anthropic Messages API passthrough. Native Anthropic format for full-fidelity Claude access, including thinking blocks and interleaved reasoning. Routes through the same engine as chat completions.

This endpoint accepts standard Anthropic Messages API requests and returns standard Anthropic responses. For providers that natively support the Messages API (Anthropic, xAI/Grok), requests are forwarded directly with zero translation. For other providers, the router handles Anthropic ↔ OpenAI format translation transparently.

Requires the anthropic-version: 2023-06-01 header (same as the native Anthropic API). The x-api-key header is also supported as an alternative to Authorization: Bearer.

Example:

curl https://api.lxg2it.com/v1/messages \
  -H "Authorization: Bearer $KEY" \
  -H "Content-Type: application/json" \
  -H "anthropic-version: 2023-06-01" \
  -d '{
  "model": "claude-opus-4-8",
  "max_tokens": 1024,
  "messages": [{"role": "user", "content": "Hello!"}]
}'

Supported features: streaming (SSE), tool use, thinking blocks, interleaved reasoning, extended thinking, stop sequences, temperature, top_p, top_k. Model pinning semantics: when you pin a specific model, failed requests return an error instead of silently falling back to a different model.

Response headers:

Header	Description
`X-Model-Router-Model`	The model that served the request.
`X-Model-Router-Provider`	The provider that served the request.
`X-Request-Id`	Unique request ID. Use this to correlate with telemetry traces.
`X-Model-Router-Auto-Score`	Auto-routing score (0–100). Only present when `model: "auto"`.
`X-Model-Router-Auto-Tier`	Tier selected by auto-routing. Only present when `model: "auto"`.

// Example response
{
  "id": "chatcmpl-abc123",
  "object": "chat.completion",
  "model": "gpt-4.1",
  "choices": [{
    "index": 0,
    "message": {
      "role": "assistant",
      "content": "Hello! How can I help?"
    },
    "finish_reason": "stop"
  }],
  "usage": {
    "prompt_tokens": 12,
    "completion_tokens": 8,
    "total_tokens": 20
  }
}

Tiers

Tiers group models by capability. The router selects the best model within the tier based on your prefer setting, provider availability, and context-window fit.

Tier	Description	Example models
`economy`	Fast, cheap, good for simple tasks	GPT-4.1 Mini, Claude 3.5 Haiku, Gemini 2.0 Flash
`standard`	Balanced capability and cost	GPT-4.1, Claude Sonnet 4, Gemini 2.5 Pro
`premium`	Maximum capability, reasoning models	GPT-4.5, Claude Opus 3, o1
`auto`	Heuristic classifier analyses your full conversation context to select the appropriate tier. How it works →	Varies by context

See the live model list at /v1/models.

Auto-routing

Set model: "auto" to let the router infer the right tier from your full conversation context. Unlike single-message classifiers, auto-routing analyses the entire messages array — system prompt, conversation history, code blocks, tool calls, and reasoning markers — then produces a complexity score from 0–100 that maps to a tier.

# Auto-routing — let the router choose the tier
curl https://api.lxg2it.com/v1/chat/completions \
  -H "Authorization: Bearer $MR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "auto",
    "messages": [
      { "role": "system", "content": "You are a senior software architect." },
      { "role": "user",   "content": "Design a distributed consensus algorithm for a financial ledger." }
    ]
  }'

Every auto-routed response includes two extra headers:

Header	Description
`X-Model-Router-Auto-Score`	Complexity score 0–100 computed from your request context
`X-Model-Router-Auto-Tier`	Tier selected by auto-routing (`economy` / `standard` / `premium`)

The score is built from seven weighted signals:

Signal	Weight	What it measures
Code blocks	20%	Fenced code, inline code, and code-like lines across all messages
Technical keywords	20%	Premium terms (consensus, compiler, theorem) and standard terms (API, database, function)
Reasoning markers	15%	Phrases like "step by step", "trade-offs", "design a system", "prove that"
System prompt length	15%	Longer system prompts indicate specialised agents
Conversation depth	10%	Number of prior turns — accumulated context raises complexity
Tool usage	10%	Presence of `tool_calls` and `tool` role messages
Message complexity	10%	Maximum user message length

The final score combines a weighted average with a strongest-signal boost (score = weighted_avg × 0.6 + max_signal × 0.4), so a single strong indicator is enough to push past a tier threshold even when other signals are zero. Score thresholds: 0–20 → economy, 21–55 → standard, 56–100 → premium. The economy ceiling is intentionally low — strong confidence is required before routing to cheaper models.

Auto-routing is deterministic: the same input always produces the same score and tier. It adds under 1 ms of overhead (no ML model, no embeddings, no external calls).

Auto-routing analysis runs entirely in-process. No request content is stored, logged, or used for training — only the derived numeric score and selected tier are recorded for observability.

Model pinning

Pass an exact model ID in the model field to bypass tier routing and target a specific model. The ID must match a model in our catalog (visible at /v1/models).

# Pin to Claude Sonnet 4 specifically
{
  "model": "claude-sonnet-4-6",
  "messages": [...]
}

When pinning, the prefer parameter is ignored. If the pinned model’s provider is unavailable, the request fails with 502 rather than falling back to another model.

Pinning to a model ID that doesn’t exist in the catalog returns 400 with code: "unknown_model". The error message points to /v1/models for the available list.

To allow fallback from a pinned model, add a fallback chain — the chain is honoured even when pinning.

Fallback chains

By default, the router silently falls back to other models in the same tier when the primary model fails. You can override this with a fallback chain — an ordered list of model IDs and/or tier names to try instead.

# Try GPT-4.1 first, then Claude Sonnet, then any standard-tier model
{
  "model": "gpt-4.1",
  "fallback": ["claude-sonnet-4-6", "standard"],
  "messages": [...]
}

Each entry in the chain is resolved:

Model ID — resolves to that exact model (same as pinning)
Tier name — expands to all available models in that tier, ordered by cost. Circuit-broken and context-incompatible models are excluded

The chain is tried in order. If every entry fails, the request returns 503. The prefer parameter only affects the primary model — fallback steps always use balanced (cost-optimised) ranking within their tier.

When fallback is provided:

It replaces the default tier-internal fallback entirely
It works even when pinning a specific model (normally pinning disables fallback)
Unknown entries are silently skipped
Context-exceeded errors still use the user’s chain rather than the default cross-tier fallback

Prefer parameter

The prefer field controls how the router ranks models within the resolved tier. It does not change which tier is used.

Value	Behaviour
`cheap`	Lowest cost per token.
`fast`	Lowest latency (time to first token).
`balanced`	Default. Cheapest, break ties by quality.
`quality`	Highest quality score, break ties by cost.
`coding`	Highest SWE-bench score. Routes to models with the strongest software engineering performance.

Pro tip — coding workloads: tier:"premium" + prefer:"coding" often delivers better results at lower cost than tier:"standard" + prefer:"coding". Coding-specialist models (Devstral, Codestral) are in the premium tier but priced at commodity Bedrock rates — the tier’s quality ranking reflects coding performance, not general-purpose prestige.

Tool calls

Tool calls work the same as the OpenAI API — pass a tools array and the router handles the format translation to each provider automatically. You never need to handle Anthropic’s tool_use blocks or Google’s functionCall parts; everything comes back in standard OpenAI format.

// Tool call request (same across all tiers/providers)
{
  "model": "standard",
  "messages": [{ "role": "user", "content": "What is the weather in Sydney?" }],
  "tools": [{
    "type": "function",
    "function": {
      "name": "get_weather",
      "description": "Get the current weather for a city",
      "parameters": {
        "type": "object",
        "properties": { "city": { "type": "string" } },
        "required": ["city"]
      }
    }
  }]
}

The response contains a standard tool_calls array. Submit tool results back using role: "tool" messages as you normally would.

Reasoning / thinking

Several models in the router are reasoning models: they think through a problem internally before writing their response. By default this thinking is hidden — you only see the final answer.

Set "include_reasoning": true to receive the thinking alongside the response. This works in both streaming and non-streaming modes, across all providers.

Economy tier reasoning models: grok-3-mini-beta, gemini-2.5-flash. Standard/premium: o4-mini, o3, gemini-2.5-pro, claude-opus-4-6 (extended thinking).

// Non-streaming — reasoning_content in the message
{
  "model": "economy",
  "include_reasoning": true,
  "messages": [{ "role": "user", "content": "Is 17 prime?" }]
}

// Response
{
  "choices": [{
    "message": {
      "role": "assistant",
      "content": "Yes, 17 is prime.",
      "reasoning_content": "17 is only divisible by 1 and itself..."
    }
  }]
}

For streaming, reasoning_content arrives as delta chunks before the regular content chunks. Filter by which field is present to separate them:

// Streaming — two chunk types, reasoning arrives first
for await (const chunk of stream) {
  const delta = chunk.choices[0]?.delta;

  if (delta?.reasoning_content) {
    process.stdout.write(`[thinking] ${delta.reasoning_content}`);
  } else if (delta?.content) {
    process.stdout.write(delta.content);
  }
}

Note: include_reasoning increases token usage and latency. For models billed by output tokens, thinking tokens count toward your usage.

Embeddings

POST /v1/embeddings

Generate vector embeddings. OpenAI-compatible request and response format.

Use the same API key and base URL as chat completions. Billed at input tokens only — there are no output tokens for embeddings.

Request body:

Parameter	Type	Description
`model`	string	Required. An embedding tier alias or exact model ID. See the table below for available tiers.
`input`	string \| array	Required. Text to embed. Pass a single string or an array of strings for batch embedding.
`dimensions`	integer	Optional. Truncate output dimensions. Supported by `embed-large` (up to 3072) and `embed-titan` (256, 512, or 1024).

Embedding tiers:

Alias	Model	Dimensions	Price	Best for
`embed-small`	`text-embedding-3-small`	1536	$0.02 / 1M tokens	High-volume, cost-sensitive workloads
`embed-large`	`text-embedding-3-large`	up to 3072	$0.13 / 1M tokens	Maximum retrieval accuracy
`embed-titan`	`amazon.titan-embed-text-v2:0`	256 / 512 / 1024	$0.10 / 1M tokens	AWS-native workloads, flexible dimensions

Example:

curl https://api.lxg2it.com/v1/embeddings \
  -H "Authorization: Bearer $KEY" \
  -H "Content-Type: application/json" \
  -d '{
  "model": "embed-small",
  "input": ["The quick brown fox", "jumps over the lazy dog"]
}'

Response: Standard OpenAI embeddings object.

// Example response
{
  "object": "list",
  "model": "text-embedding-3-small",
  "data": [{
    "object": "embedding",
    "index": 0,
    "embedding": [0.0023, -0.0141, ...]
  }],
  "usage": { "prompt_tokens": 9, "total_tokens": 9 }
}

Specialist models

Most models route automatically through POST /v1/chat/completions. Two models have different API surfaces and are excluded from auto-routing — they must be pinned by name.

POST /v1/completions

Legacy text-completion endpoint for models that complete a prompt rather than a conversation. Currently: gpt-5.1-codex-mini.

Send a prompt string instead of a messages array. The response shape is OpenAI's text_completion object.

Parameter	Type	Description
`model`	string	Required. Must be a completions-type model ID (e.g. `gpt-5.1-codex-mini`). Chat models are rejected on this endpoint.
`prompt`	string	Required. Text prefix to complete.
`max_tokens`	integer	Maximum tokens to generate.
`temperature`	number	Sampling temperature, 0–2.
`stop`	string \| array	Stop sequences.

Example:

curl https://api.lxg2it.com/v1/completions \
  -H "Authorization: Bearer $KEY" \
  -H "Content-Type: application/json" \
  -d '{
  "model": "gpt-5.1-codex-mini",
  "prompt": "def fibonacci(n):",
  "max_tokens": 256,
  "temperature": 0
}'

// Response
{
  "object": "text_completion",
  "model": "gpt-5.1-codex-mini",
  "choices": [{
    "index": 0,
    "text": "\n    if n <= 1:\n        return n\n    return fibonacci(n-1) + fibonacci(n-2)",
    "finish_reason": "stop"
  }],
  "usage": { "prompt_tokens": 8, "completion_tokens": 42, "total_tokens": 50 }
}

POST /v1/chat/completions

Access Responses API models by pinning them explicitly with model: "gpt-5.3-codex". The router handles format conversion (messages → Responses API input).

gpt-5.3-codex uses OpenAI’s Responses API internally, which has a different request shape to the chat completions API. The router converts your messages array into the Responses API format automatically — but because this comes with limitations, these models must be pinned explicitly and are never selected by auto-routing.

Limitations: stream: true is not supported (returns 400). Auto-routing will not select these models — you must specify the model by name.

Example:

curl https://api.lxg2it.com/v1/chat/completions \
  -H "Authorization: Bearer $KEY" \
  -H "Content-Type: application/json" \
  -d '{
  "model": "gpt-5.3-codex",
  "messages": [
    { "role": "system", "content": "You are an expert software engineer." },
    { "role": "user", "content": "Implement a binary search tree in Python." }
  ]
}'

The system message becomes OpenAI’s instructions field. The response is a standard chat completion object.

Observability

Export request traces to your own observability platform. Model Router supports OTLP/HTTP — the OpenTelemetry standard — so you can use any compatible backend: Axiom, Grafana Cloud, Honeycomb, Datadog, and more.

Configure your OTLP endpoint and auth headers in your profile settings. Once enabled, every request generates a span with full routing metadata:

Span attribute	Description
`model_router.request_id`	Unique request ID (matches `X-Request-Id` response header)
`model_router.provider`	Provider that served the request
`model_router.model`	Model that served the request
`model_router.tier`	Tier used for routing
`model_router.prefer`	Prefer value used
`model_router.prompt_tokens`	Input token count
`model_router.completion_tokens`	Output token count
`model_router.cost_cents`	Cost of the request in cents
`model_router.latency_ms`	Total request latency
`model_router.streaming`	Whether the request was streamed
`model_router.auto_score`	Auto-routing score (when using `auto`)
`model_router.failover_from`	Original provider if a failover occurred

Telemetry export is fully async — it never adds latency to your API calls. If your OTLP endpoint is unreachable, requests proceed normally.

Use the X-Request-Id response header to correlate any individual request with its trace in your observability platform.

Other endpoints

GET /v1/models

List all available models, tiers, and pricing. Public — no auth required.

GET /v1/account/credits

Check your current credit balance. Requires session auth.

GET /v1/account/usage

Usage history for the last 30 days, broken down by day and model. Requires session auth.

GET /health

Server health check. Returns provider status and open circuit breakers. No auth required.

Context-window guard

Before routing, the router estimates your input token count and filters out any model whose context window is too small. You never get a “context length exceeded” error from the provider — the router handles it.

Circuit breaker

If a provider returns repeated errors, its circuit breaker opens and the router stops sending traffic to it. After a cooldown, one test request is allowed through. If it succeeds, the circuit closes and the provider is back in the pool.

This is automatic and invisible to clients. You get transparent failover across providers within a tier.

Rate limits

Rate limits are enforced per API key using a token bucket — tokens refill continuously rather than resetting at a hard window boundary, so bursts are handled smoothly.

The limit applied depends on your account tier:

Tier	Criteria	Limit
Paid	Added credits via Stripe	600 RPM
Elevated	Balance ≥ $10.00	60 RPM
Base	Balance < $10.00	10 RPM

Per-key overrides are available on request — contact support@api.lxg2it.com if you need a higher limit.

Every response includes rate limit headers so you can track consumption:

Header	Description
`X-RateLimit-Limit`	Your key's RPM limit
`X-RateLimit-Remaining`	Tokens remaining in the current window
`X-RateLimit-Reset`	Unix timestamp when the bucket is fully refilled
`Retry-After`	Seconds to wait before retrying (only on 429 responses)

When rate-limited, the response is HTTP 429:

{
  "error": {
    "message": "Rate limit exceeded. Your key is limited to 10 requests per minute.",
    "type": "rate_limit_error",
    "code": "rate_limit_exceeded"
  }
}

Daily spend limits are a separate control. By default, standard accounts are capped at $30/day and paid accounts (with credits purchased via Stripe) at $300/day. You can set a personal cap on your profile page — this overrides the tier default. Requests made after hitting the cap return HTTP 429 with code: "daily_spend_limit_exceeded" and reset at UTC midnight.

Error codes

Status	Meaning
`400`	Bad request — missing or invalid parameters, or unknown model ID (code: `unknown_model`).
`401`	Unauthorised — missing or invalid API key.
`402`	Insufficient credits.
`429`	Rate limited.
`502`	Provider error — upstream model returned an error.
`503`	No available model — all providers in the tier are down or context too large.

Error responses follow the OpenAI format:

{
  "error": {
    "message": "Insufficient credits",
    "type": "billing_error",
    "code": "insufficient_credits"
  }
}