LLM Gateway Overview

Supported regions

US & EU

Overview

AssemblyAI’s LLM Gateway is a unified interface that allows you to connect with multiple LLM providers including Claude, GPT, Gemini, and more. You can use the LLM Gateway to build sophisticated AI applications through a single API.

Endpoint	Base URL
US (default)	`https://llm-gateway.assemblyai.com/v1/chat/completions`
EU	`https://llm-gateway.eu.assemblyai.com/v1/chat/completions`

The LLM Gateway is available in both US and EU regions. Use the EU endpoint to ensure your data stays within the European Union. Currently, Anthropic Claude and most Google Gemini models are supported in the EU (except where otherwise noted). OpenAI models are only available in the US region. See Cloud Endpoints and Data Residency for more details.

The LLM Gateway provides access to 25+ models across major AI providers with support for:

Basic Chat Completions - Simple request/response interactions
Streamed Responses - Stream output as it’s generated (OpenAI models)
Multi-turn Conversations - Maintain context across multiple exchanges
Structured Outputs - Constrain responses to a specific JSON schema
Tool/Function Calling - Enable models to execute custom functions
Agentic Workflows - Multi-step reasoning with automatic tool chaining
Unified Interface - One API for Claude, GPT, Gemini, Qwen, Kimi, and more
Post-processing - Automatically repair malformed JSON responses with built-in JSON repair

Available models

By quality (LMArena Score)

Model	Provider	Parameter	LMArena Score	Latency per 10,000 tokens
Claude Opus 4.6	Anthropic	`claude-opus-4-6`	1498	7.4s
Claude Opus 4.7	Anthropic	`claude-opus-4-7`	1491	TBD
Gemini 3.5 Flash	Google	`gemini-3.5-flash`	1480	TBD
GPT-5.5	OpenAI	`gpt-5.5`	1475	TBD
Gemini 3 Flash Preview	Google	`gemini-3-flash-preview`	1474	4.2s
Claude Opus 4.5	Anthropic	`claude-opus-4-5-20251101`	1468	3.9s
Claude Sonnet 4.6	Anthropic	`claude-sonnet-4-6`	1466	7.2s
Claude 4.5 Sonnet	Anthropic	`claude-sonnet-4-5-20250929`	1453	5.6s
Gemini 2.5 Pro	Google	`gemini-2.5-pro`	1448	4.0s
GPT-5.1	OpenAI	`gpt-5.1`	1439	2.7s
Gemini 3.1 Flash Lite Preview	Google	`gemini-3.1-flash-lite-preview`	1438	TBD
GPT-5.2	OpenAI	`gpt-5.2`	1437	1.6s
GPT-5	OpenAI	`gpt-5`	1434	4.3s
Kimi K2.5	Moonshot AI	`kimi-k2.5`	1432	1.2s
GPT-4.1	OpenAI	`gpt-4.1`	1413	1.8s
Claude 4 Opus	Anthropic	`claude-opus-4-20250514`	1412	13.6s
Gemini 2.5 Flash	Google	`gemini-2.5-flash`	1411	2.6s
Claude 4.5 Haiku	Anthropic	`claude-haiku-4-5-20251001`	1409	4.1s
Qwen3 Next 80B A3B	Alibaba Cloud	`qwen3-next-80b-a3b`	1402	3.1s
GPT-5 mini	OpenAI	`gpt-5-mini`	1390	3.8s
Claude 4 Sonnet	Anthropic	`claude-sonnet-4-20250514`	1389	5.1s
Gemini 2.5 Flash-Lite	Google	`gemini-2.5-flash-lite`	1380	1.1s
gpt-oss-120b	OpenAI	`gpt-oss-120b`	1353	1.4s
Qwen3 32B	Alibaba Cloud	`qwen3-32B`	1347	3.7s
GPT-5 nano	OpenAI	`gpt-5-nano`	1337	3.2s
gpt-oss-20b	OpenAI	`gpt-oss-20b`	1317	1.1s

By latency (per 10,000 tokens)

Model	Provider	Parameter	Latency per 10,000 tokens	LMArena Score
Gemini 2.5 Flash-Lite	Google	`gemini-2.5-flash-lite`	1.1s	1380
gpt-oss-20b	OpenAI	`gpt-oss-20b`	1.1s	1317
Kimi K2.5	Moonshot AI	`kimi-k2.5`	1.2s	1432
gpt-oss-120b	OpenAI	`gpt-oss-120b`	1.4s	1353
GPT-5.2	OpenAI	`gpt-5.2`	1.6s	1437
GPT-4.1	OpenAI	`gpt-4.1`	1.8s	1413
Gemini 2.5 Flash	Google	`gemini-2.5-flash`	2.6s	1411
GPT-5.1	OpenAI	`gpt-5.1`	2.7s	1439
Qwen3 Next 80B A3B	Alibaba Cloud	`qwen3-next-80b-a3b`	3.1s	1402
GPT-5 nano	OpenAI	`gpt-5-nano`	3.2s	1337
Qwen3 32B	Alibaba Cloud	`qwen3-32B`	3.7s	1347
GPT-5 mini	OpenAI	`gpt-5-mini`	3.8s	1390
Claude Opus 4.5	Anthropic	`claude-opus-4-5-20251101`	3.9s	1468
Gemini 2.5 Pro	Google	`gemini-2.5-pro`	4.0s	1448
Claude 4.5 Haiku	Anthropic	`claude-haiku-4-5-20251001`	4.1s	1409
Gemini 3 Flash Preview	Google	`gemini-3-flash-preview`	4.2s	1474
GPT-5	OpenAI	`gpt-5`	4.3s	1434
Claude 4 Sonnet	Anthropic	`claude-sonnet-4-20250514`	5.1s	1389
Claude 4.5 Sonnet	Anthropic	`claude-sonnet-4-5-20250929`	5.6s	1453
Claude Sonnet 4.6	Anthropic	`claude-sonnet-4-6`	7.2s	1466
Claude Opus 4.6	Anthropic	`claude-opus-4-6`	7.4s	1498
Claude 4 Opus	Anthropic	`claude-opus-4-20250514`	13.6s	1412
Claude Opus 4.7	Anthropic	`claude-opus-4-7`	TBD	1491
GPT-5.5	OpenAI	`gpt-5.5`	TBD	1475
Gemini 3.1 Flash Lite Preview	Google	`gemini-3.1-flash-lite-preview`	TBD	1438
Gemini 3.5 Flash	Google	`gemini-3.5-flash`	TBD	1480

By provider

Anthropic Claude

Model	Parameter	LMArena Score	Latency per 10,000 tokens
Claude Opus 4.7	`claude-opus-4-7`	1491	TBD
Claude Opus 4.6	`claude-opus-4-6`	1498	7.4s
Claude Sonnet 4.6	`claude-sonnet-4-6`	1466	7.2s
Claude Opus 4.5	`claude-opus-4-5-20251101`	1468	3.9s
Claude 4.5 Sonnet	`claude-sonnet-4-5-20250929`	1453	5.6s
Claude 4.5 Haiku	`claude-haiku-4-5-20251001`	1409	4.1s
Claude 4 Opus	`claude-opus-4-20250514`	1412	13.6s
Claude 4 Sonnet	`claude-sonnet-4-20250514`	1389	5.1s

OpenAI GPT

Model	Parameter	LMArena Score	Latency per 10,000 tokens
GPT-5.5	`gpt-5.5`	1475	TBD
GPT-5.2	`gpt-5.2`	1437	1.6s
GPT-5.1	`gpt-5.1`	1439	2.7s
GPT-5	`gpt-5`	1434	4.3s
GPT-5 nano	`gpt-5-nano`	1337	3.2s
GPT-5 mini	`gpt-5-mini`	1390	3.8s
GPT-4.1	`gpt-4.1`	1413	1.8s
gpt-oss-120b	`gpt-oss-120b`	1353	1.4s
gpt-oss-20b	`gpt-oss-20b`	1317	1.1s

Google Gemini

Model	Parameter	LMArena Score	Latency per 10,000 tokens
Gemini 3.5 Flash	`gemini-3.5-flash`	1480	TBD
Gemini 3 Flash Preview	`gemini-3-flash-preview`	1474	4.2s
Gemini 3.1 Flash Lite Preview	`gemini-3.1-flash-lite-preview`	1438	TBD
Gemini 2.5 Pro	`gemini-2.5-pro`	1448	4.0s
Gemini 2.5 Flash	`gemini-2.5-flash`	1411	2.6s
Gemini 2.5 Flash-Lite	`gemini-2.5-flash-lite`	1380	1.1s

Gemini 3.1 Flash Lite Preview is currently available in the US region only.

Alibaba Cloud Qwen

Model	Parameter	LMArena Score	Latency per 10,000 tokens
Qwen3 Next 80B A3B	`qwen3-next-80b-a3b`	1402	3.1s
Qwen3 32B	`qwen3-32B`	1347	3.7s

Moonshot AI Kimi

Model	Parameter	LMArena Score	Latency per 10,000 tokens
Kimi K2.5	`kimi-k2.5`	1432	1.2s

Claude Opus 4.5 and Claude Opus 4.6 currently support context windows under 200k tokens via the LLM Gateway.

For information on data retention and model training policies for each provider, see Data Retention and Model Training.

Head to our Playground to test out LLM Gateway without having to write any code!

Select a model

You can specify which model to use in your request by setting the model parameter. Here are examples showing how to use Claude 4.5 Sonnet:

Python
JavaScript

import requests

headers = {
  "authorization": "<YOUR_API_KEY>"
}

response = requests.post(
    "https://llm-gateway.assemblyai.com/v1/chat/completions",
    headers = headers,
    json = {
        "model": "claude-sonnet-4-6",
        "messages": [
            {"role": "user", "content": "What is the capital of France?"}
        ],
        "max_tokens": 1000
    }
)

result = response.json()
print(result["choices"][0]["message"]["content"])

const response = await fetch(
  "https://llm-gateway.assemblyai.com/v1/chat/completions",
  {
    method: "POST",
    headers: {
      authorization: "<YOUR_API_KEY>",
      "content-type": "application/json",
    },
    body: JSON.stringify({
      model: "claude-sonnet-4-6",
      messages: [{ role: "user", content: "What is the capital of France?" }],
      max_tokens: 1000,
    }),
  }
);

const result = await response.json();
console.log(result.choices[0].message.content);

Simply change the model parameter to use any of the available models listed in the Available models section above.

Want to compare models side-by-side? Try the Model Comparison Tool, a Lovable application, to test different LLM models and see how they perform.

Logging and troubleshooting

Every LLM Gateway response includes a request_id field — a unique identifier for that request. Persist it (along with the model, the API region, and a timestamp) for every call you make, not just when something goes wrong. If you contact support@assemblyai.com about a specific request (latency spikes, unexpected output, rate-limit errors, content moderation surprises), this ID lets us locate the exact request in our logs immediately. We recommend logging, at minimum:

request_id from the response body
The model parameter used
The API region (US: llm-gateway.assemblyai.com, EU: llm-gateway.eu.assemblyai.com)
A timestamp for when the request was sent
The full error response body when a non-2xx status code is returned

For details on debugging specific status codes (400/401/403/429/5xx) and what information to include when filing a support request, see the Troubleshooting page.

Next steps

Basic Chat Completions - Learn how to send simple messages and receive responses
Multi-turn Conversations - Maintain context across multiple exchanges
Structured Outputs - Constrain model responses to follow a specific JSON schema
Tool Calling - Enable models to execute custom functions
Agentic Workflows - Build multi-step reasoning applications
Post-processing - Automatically repair malformed JSON in model responses

The LLM Gateway API is separate from the Speech-to-Text and Speech Understanding APIs. It provides a unified interface to work with large language models across multiple providers.

Documentation Index

​Overview

​Available models

​By quality (LMArena Score)

​By latency (per 10,000 tokens)

​By provider

​Anthropic Claude

​OpenAI GPT

​Google Gemini

​Alibaba Cloud Qwen

​Moonshot AI Kimi

​Select a model

​Logging and troubleshooting

​Next steps

Overview

Available models

By quality (LMArena Score)

By latency (per 10,000 tokens)

By provider

Anthropic Claude

OpenAI GPT

Google Gemini

Alibaba Cloud Qwen

Moonshot AI Kimi

Select a model

Logging and troubleshooting

Next steps