Use this file to discover all available pages before exploring further.
Every message exchanged over the Voice Agent API WebSocket, grouped by direction. You’ll send session.update to configure, input.audio to stream mic audio, and tool.result to respond to tool calls. The server streams everything else back.
Configure the session. Send immediately on WebSocket connect (before session.ready). Can also be sent mid-conversation to update most fields. See Mutability after session.ready for which fields can change once the session is established.
All fields are optional. Include only what you want to set or change. After session.ready, only a subset of fields can be changed; changing greeting, session.output.voice, or session.output.format raises immutable_field. session.output.volume is mutable mid-session.
Sessions are preserved for 30 seconds after every disconnection before expiring. If the session has expired, the server returns a session.error with code session_not_found or session_forbidden. Start a fresh connection without session.resume.
Example. Capture session_id from session.ready on the first connection, then send session.resume as the first message when reconnecting:
import jsonimport websocketssession_id: str | None = Noneasync def connect(): global session_id async with websockets.connect(URL, additional_headers={"Authorization": f"Bearer {API_KEY}"}) as ws: # If we already have a session_id from a previous connection, resume it. if session_id: await ws.send(json.dumps({"type": "session.resume", "session_id": session_id})) else: await ws.send(json.dumps({"type": "session.update", "session": {...}})) async for raw in ws: event = json.loads(raw) if event["type"] == "session.ready": session_id = event["session_id"] # save for next reconnect elif event["type"] == "session.error" and event["code"] in ("session_not_found", "session_forbidden"): session_id = None # session expired - start fresh next time # ... handle other events# On disconnect, call connect() again within 30 seconds to resume.
Send a tool result back to the agent. Send this when reply.done is the latest event you’ve received (and nothing has happened since). The simplest pattern is to accumulate on tool.call and drain inside the reply.done handler. See Tool calling.
Ask the agent to generate a reply right now, optionally with custom instructions. Useful for delivering status updates during long-running hold-mode tool calls, or any time you want the agent to speak without a user utterance triggering it.
{ "type": "reply.create", "instructions": "Let the customer know we're still processing the transfer."}
Field
Type
Description
instructions
string
Optional. One-shot instruction the agent uses to compose this reply. Does not modify system_prompt.
The agent generates a normal reply (reply.started → reply.audio → transcript.agent → reply.done) using the provided instructions on top of the existing system prompt and conversation history.
Partial transcript of what the user is saying, updating in real-time.
{ "type": "transcript.user.delta", "text": "What's the weather in"}
Live user transcripts pause while a hold-mode tool is in flight and resume once the hold ends. Anything the user said during the hold is preserved in the conversation context.
Full text of the agent’s response, sent after all audio for the response has been delivered. If the agent was interrupted, interrupted is true and text contains only what was actually spoken before the interruption.
{ "type": "transcript.agent", "text": "It's currently 22°C and sunny in Tokyo.", "reply_id": "reply_abc123", "item_id": "item_abc123", "interrupted": false}
Field
Type
Description
text
string
What the agent said (trimmed to interruption point if interrupted)
Session or protocol error. The payload always includes type, timestamp, code, and message. Some errors (like session.update validation failures) also include a param field naming the offending field.
Client message errorsSent on the open socket when an inbound message is invalid. The session stays alive (except session_expired).
Code
Description
invalid_format
Bad JSON, missing or unknown type, validation failure, or missing audio field on input.audio
invalid_audio
input.audio payload failed base64 decode or PCM conversion
invalid_value
session.update with an invalid voice or field type
immutable_field
session.update tried to change greeting, output.voice, or output.format after the first update was applied. output.volume is mutable and does not raise this error.
invalid_config
session.update raised a validation error
server_error
Unexpected exception while applying session.update
Live session errors
Code
Close code
Description
session_expired
1008
Session duration TTL reached. There is no separate “closing soon” warning event before this, so run a client-side timer if you need to wrap up gracefully.
If the server cancels the session due to an internal error, the WebSocket closes with code 1011 without any session.error payload. In browsers, pre-handshake failures (like UNAUTHORIZED) surface as a close event with code 1006. You won’t receive a session.error. Always fetch a fresh token immediately before each connection attempt.
When the user speaks mid-response (barge-in), the server stops the agent and emits reply.done with status: "interrupted" and transcript.agent with interrupted: true. The decision is semantic. Back-channels like “uh-huh” don’t trigger an interruption.On reply.done with status: "interrupted":
Flush your local audio playback buffer.
Discard any pending tool.result accumulators from the just-ended reply.
Restart the playback stream so it’s ready for the next response.