Prompt caching lets you avoid reprocessing the same prompt content on every request. When you send a long system prompt, tool definitions, or conversation history repeatedly, the LLM provider can cache that content and reuse it on subsequent requests — reducing both latency and cost.The LLM Gateway supports prompt caching across all major providers, with each provider using its own caching mechanism:
Provider
Caching behavior
Configuration required?
Claude
Explicit opt-in
Yes — add cache_control to messages
OpenAI
Automatic
No — caching happens implicitly
Gemini
Automatic
No — caching happens implicitly
Kimi
Automatic
No — caching happens implicitly
Cached input tokens are billed at a discounted rate compared to regular input tokens. The exact discount depends on the model and provider.
Prompt caching only activates when the cacheable portion of your prompt meets a minimum token threshold. Claude’s minimum varies by model — see Minimum cacheable prompt length for the per-model limits. OpenAI requires 1,024 tokens. Shorter prompts won’t benefit from caching.
Claude models require you to explicitly mark which content blocks to cache using the cache_control field. Add cache_control with type set to "ephemeral" on any message you want cached.
Python
JavaScript
import osimport requestsheaders = { "authorization": os.environ["ASSEMBLYAI_API_KEY"]}system_prompt = ( "You are a customer support agent for Acme Corp. " "You have access to our full product catalog, pricing, " "and policy documentation. Always be helpful and concise.")response = requests.post( "https://llm-gateway.assemblyai.com/v1/chat/completions", headers=headers, json={ "model": "claude-sonnet-4-6", "messages": [ { "role": "system", "content": system_prompt, "cache_control": {"type": "ephemeral"} }, { "role": "user", "content": "What is your return policy?" } ], "max_tokens": 1000 })result = response.json()print(result["choices"][0]["message"]["content"])# Check cache usage in the responseusage = result["usage"]cache_details = usage.get("prompt_tokens_details", {})print(f"Cached tokens: {cache_details.get('cached_tokens', 0)}")
const response = await fetch( "https://llm-gateway.assemblyai.com/v1/chat/completions", { method: "POST", headers: { authorization: process.env.ASSEMBLYAI_API_KEY, "content-type": "application/json", }, body: JSON.stringify({ model: "claude-sonnet-4-6", messages: [ { role: "system", content: "You are a customer support agent for Acme Corp. " + "You have access to our full product catalog, pricing, " + "and policy documentation. Always be helpful and concise.", cache_control: { type: "ephemeral" }, }, { role: "user", content: "What is your return policy?", }, ], max_tokens: 1000, }), });const result = await response.json();console.log(result.choices[0].message.content);// Check cache usage in the responseconst cacheDetails = result.usage?.prompt_tokens_details;console.log(`Cached tokens: ${cacheDetails?.cached_tokens ?? 0}`);
You can also set cache_control on tool result messages to cache tool interaction history in multi-turn agentic conversations.
Claude only caches prompts that meet a minimum token threshold, and the threshold depends on the model. If the cacheable portion of your prompt falls below this threshold, the request is processed without caching and no error is returned.
The LLM Gateway extends Anthropic’s native cache_control with an optional ttl field for specifying cache duration. This is a Gateway-specific parameter — Anthropic’s native API does not support it.
System messages — Cache long system prompts that don’t change between requests
User and assistant messages — Cache conversation history in multi-turn flows
Tool result messages — Cache tool call outputs in agentic workflows
For best results with Claude, place cache_control on the content that stays the same across requests — typically the system prompt and any static context. Content after the last cache breakpoint is not cached.
OpenAI models cache prompts automatically. No configuration is needed — the gateway passes your requests through and caching happens on OpenAI’s infrastructure.
Python
JavaScript
import osimport requestsheaders = { "authorization": os.environ["ASSEMBLYAI_API_KEY"]}# OpenAI models cache automatically — no cache_control neededresponse = requests.post( "https://llm-gateway.assemblyai.com/v1/chat/completions", headers=headers, json={ "model": "gpt-4.1", "messages": [ { "role": "system", "content": "You are a customer support agent..." }, { "role": "user", "content": "What is your return policy?" } ], "max_tokens": 1000 })result = response.json()# Cached tokens still appear in usagecache_details = result["usage"].get("prompt_tokens_details", {})print(f"Cached tokens: {cache_details.get('cached_tokens', 0)}")
// OpenAI models cache automatically — no cache_control neededconst response = await fetch( "https://llm-gateway.assemblyai.com/v1/chat/completions", { method: "POST", headers: { authorization: process.env.ASSEMBLYAI_API_KEY, "content-type": "application/json", }, body: JSON.stringify({ model: "gpt-4.1", messages: [ { role: "system", content: "You are a customer support agent...", }, { role: "user", content: "What is your return policy?", }, ], max_tokens: 1000, }), });const result = await response.json();const cacheDetails = result.usage?.prompt_tokens_details;console.log(`Cached tokens: ${cacheDetails?.cached_tokens ?? 0}`);
You can optionally configure cache behavior with two additional request-level fields:
Field
Type
Description
prompt_cache_retention
string
Controls how long cached content is retained on OpenAI’s infrastructure. These values are passed through to OpenAI’s API — refer to OpenAI’s documentation for current allowed values.
prompt_cache_key
string
A custom key to group related requests for caching. Requests with the same key are more likely to share cached content.
Number of input tokens read from cache (cost savings).
cache_creation.ephemeral_5m_input_tokens
Tokens written to a 5-minute ephemeral cache (Claude only).
cache_creation.ephemeral_1h_input_tokens
Tokens written to a 1-hour ephemeral cache (Claude only).
When cached_tokens is greater than zero, those tokens were served from cache and billed at the discounted cached input rate rather than the standard input rate.
Cache your system prompt — System prompts are the best candidates for caching since they stay the same across requests. Place cache_control on the system message for Claude, or rely on automatic caching for OpenAI and Gemini.
Cache tool definitions — If you use the same tools across multiple requests, the tool definitions in your prompt are automatically eligible for caching.
Order messages for maximum cache hits — Put static content (system prompt, tool definitions) at the beginning of the message array. Content before the cache breakpoint is more likely to match across requests.
Monitor cache metrics — Check prompt_tokens_details.cached_tokens in responses to verify caching is working and estimate your cost savings.
These fields are set at the top level of the request body:
Key
Type
Required?
Description
cache_control
object
No
Default cache control applied to the entire request (Claude models). When set at the request level, it acts as a default for all messages. Contains type and optional ttl.
prompt_cache_retention
string
No
Controls cache retention duration (OpenAI models). Passed through to OpenAI’s API.
prompt_cache_key
string
No
Custom cache key for grouping requests (OpenAI models).
The cache_control field can also be set on individual messages. Message-level cache_control lets you mark specific cache breakpoints — the provider caches all content up to and including the marked message. This is the recommended approach for Claude models.
Key
Type
Required?
Description
cache_control
object
No
Cache control for this specific message. Marks a cache breakpoint at this position in the conversation.