Skip to main content

Documentation Index

Fetch the complete documentation index at: https://assemblyai.com/docs/llms.txt

Use this file to discover all available pages before exploring further.

Every message exchanged over the Voice Agent API WebSocket, grouped by direction. You’ll send session.update to configure, input.audio to stream mic audio, and tool.result to respond to tool calls. The server streams everything else back.

Event flow

A typical voice agent session moves through the events in this order:
Client                              Server
  │                                   │
  │── WebSocket connect ─────────────►│
  │── session.update ────────────────►│  (system prompt + tools + greeting)
  │                                   │
  │◄─── session.ready ────────────────│  (save session_id)
  │                                   │
  │── input.audio (stream) ──────────►│  (only after session.ready)
  │── input.audio (stream) ──────────►│
  │                                   │
  │◄─── input.speech.started ─────────│
  │◄─── transcript.user.delta ────────│
  │◄─── input.speech.stopped ─────────│
  │◄─── transcript.user ──────────────│
  │                                   │
  │◄─── reply.started ────────────────│
  │◄─── reply.audio ──────────────────│
  │◄─── transcript.agent ─────────────│
  │◄─── reply.done ───────────────────│
  │                                   │
  │  [tool call flow]                 │
  │◄─── tool.call ────────────────────│  (arguments is a dict)
  │◄─── reply.done ───────────────────│  ← send tool.result here
  │── tool.result ───────────────────►│
  │◄─── reply.started ────────────────│
  │◄─── reply.audio ──────────────────│
  │◄─── reply.done ───────────────────│

Client → Server

input.audio

Stream PCM16 audio to the agent.
{
  "type": "input.audio",
  "audio": "<base64-encoded PCM16>"
}
FieldTypeDescription
audiostringBase64-encoded PCM16 mono 24kHz audio
See Audio format for the full format specification.

session.update

Configure the session. Send immediately on WebSocket connect (before session.ready). Can also be sent mid-conversation to update most fields. See Mutability after session.ready for which fields can change once the session is established.
{
  "type": "session.update",
  "session": {
    "system_prompt": "You are a concise assistant.",
    "greeting": "Hi! How can I help?",
    "input": {
      "format": { "encoding": "audio/pcm" },
      "turn_detection": { "vad_threshold": 0.5 }
    },
    "output": {
      "voice": "ivy",
      "format": { "encoding": "audio/pcm" },
      "volume": 100
    },
    "tools": [
      {
        "type": "function",
        "name": "get_weather",
        "description": "Get weather for a city",
        "parameters": {
          "type": "object",
          "properties": { "city": { "type": "string" } },
          "required": ["city"]
        }
      }
    ]
  }
}
All fields are optional. Include only what you want to set or change. After session.ready, only a subset of fields can be changed; changing greeting, session.output.voice, or session.output.format raises immutable_field. session.output.volume is mutable mid-session.
FieldTypeDescription
session.system_promptstringSets the agent’s personality and context
session.greetingstringSpoken aloud at the start of the conversation
session.input.formatobjectInput audio format (encoding). See Audio format
session.input.keytermsarrayList of strings to boost in transcription. See Key terms
session.input.turn_detectionobjectTurn detection configuration. See Session configuration
session.output.voicestringThe voice used for the agent’s speech. See Voices
session.output.formatobjectOutput audio format (encoding). See Audio format
session.output.volumenumberPlayback volume for the agent’s speech, 0 (silent) to 100 (loudest). Mutable mid-session. See Output volume
session.toolsarrayTool definitions. See Tool calling

session.resume

Reconnect to an existing session using the session_id from a previous session.ready. Preserves conversation context across dropped connections.
{
  "type": "session.resume",
  "session_id": "sess_abc123"
}
Sessions are preserved for 30 seconds after every disconnection before expiring. If the session has expired, the server returns a session.error with code session_not_found or session_forbidden. Start a fresh connection without session.resume.
Example. Capture session_id from session.ready on the first connection, then send session.resume as the first message when reconnecting:
import json
import websockets

session_id: str | None = None

async def connect():
    global session_id
    async with websockets.connect(URL, additional_headers={"Authorization": f"Bearer {API_KEY}"}) as ws:
        # If we already have a session_id from a previous connection, resume it.
        if session_id:
            await ws.send(json.dumps({"type": "session.resume", "session_id": session_id}))
        else:
            await ws.send(json.dumps({"type": "session.update", "session": {...}}))

        async for raw in ws:
            event = json.loads(raw)
            if event["type"] == "session.ready":
                session_id = event["session_id"]  # save for next reconnect
            elif event["type"] == "session.error" and event["code"] in ("session_not_found", "session_forbidden"):
                session_id = None  # session expired - start fresh next time
            # ... handle other events

# On disconnect, call connect() again within 30 seconds to resume.

tool.result

Send a tool result back to the agent. Send this when reply.done is the latest event you’ve received (and nothing has happened since). The simplest pattern is to accumulate on tool.call and drain inside the reply.done handler. See Tool calling.
{
  "type": "tool.result",
  "call_id": "call_abc123",
  "result": "{\"temp_c\": 22, \"description\": \"Sunny\"}"
}
FieldTypeDescription
call_idstringThe call_id from the tool.call event
resultstringJSON string containing the tool result

reply.create

Ask the agent to generate a reply right now, optionally with custom instructions. Useful for delivering status updates during long-running hold-mode tool calls, or any time you want the agent to speak without a user utterance triggering it.
{
  "type": "reply.create",
  "instructions": "Let the customer know we're still processing the transfer."
}
FieldTypeDescription
instructionsstringOptional. One-shot instruction the agent uses to compose this reply. Does not modify system_prompt.
The agent generates a normal reply (reply.startedreply.audiotranscript.agentreply.done) using the provided instructions on top of the existing system prompt and conversation history.

Server → Client

session.ready

Session is established and ready to receive audio. Save session_id for reconnection. Start sending input.audio only after this event.
{
  "type": "session.ready",
  "session_id": "sess_abc123"
}
FieldTypeDescription
session_idstringAlways present. Save this value to reconnect with session.resume.

session.updated

Sent after session.update is applied successfully.
{ "type": "session.updated" }

input.speech.started

Turn detection determined the user has started speaking.
{ "type": "input.speech.started" }

input.speech.stopped

Turn detection determined the user has stopped speaking.
{ "type": "input.speech.stopped" }

transcript.user.delta

Partial transcript of what the user is saying, updating in real-time.
{
  "type": "transcript.user.delta",
  "text": "What's the weather in"
}
Live user transcripts pause while a hold-mode tool is in flight and resume once the hold ends. Anything the user said during the hold is preserved in the conversation context.

transcript.user

Final transcript of the user’s utterance.
{
  "type": "transcript.user",
  "text": "What's the weather in Tokyo?",
  "item_id": "item_abc123"
}

reply.started

Agent has begun generating a response.
{
  "type": "reply.started",
  "reply_id": "reply_abc123"
}

reply.audio

A chunk of the agent’s spoken response as base64 PCM16. Decode and play immediately.
{
  "type": "reply.audio",
  "data": "<base64-encoded PCM16>"
}
See Audio format for playback guidance.

transcript.agent

Full text of the agent’s response, sent after all audio for the response has been delivered. If the agent was interrupted, interrupted is true and text contains only what was actually spoken before the interruption.
{
  "type": "transcript.agent",
  "text": "It's currently 22°C and sunny in Tokyo.",
  "reply_id": "reply_abc123",
  "item_id": "item_abc123",
  "interrupted": false
}
FieldTypeDescription
textstringWhat the agent said (trimmed to interruption point if interrupted)
reply_idstringID of the reply
item_idstringConversation item ID
interruptedbooleantrue if the user interrupted mid-response

reply.done

Agent has finished speaking. The optional status field indicates why the reply ended.
{ "type": "reply.done" }
{ "type": "reply.done", "status": "interrupted" }
FieldTypeDescription
statusstring"interrupted" if the user barged in, absent for normal completion

tool.call

Agent wants to call a registered tool. arguments is a dict, ready to use directly as-is.
{
  "type": "tool.call",
  "call_id": "call_abc123",
  "name": "get_weather",
  "arguments": { "location": "Tokyo" }
}
FieldTypeDescription
call_idstringInclude this in tool.result
namestringTool name to call
argumentsobjectArguments as a dict (use directly)
See Tool calling for the full pattern.

session.error

Session or protocol error. The payload always includes type, timestamp, code, and message. Some errors (like session.update validation failures) also include a param field naming the offending field.
{
  "type": "session.error",
  "code": "invalid_format",
  "message": "Invalid message format",
  "timestamp": "2025-01-01T00:00:00Z"
}
Connection and handshake errors Sent before or instead of session.ready. The WebSocket closes after these with the indicated close code.
CodeClose codeDescription
UNAUTHORIZED1008Missing or invalid Authorization token
FORBIDDEN1008Valid token, but insufficient permissions
server_error1008Service at capacity (try again later)
INTERNAL_ERROR1011Unexpected exception during connection setup
Session resume errors Sent when session.resume fails. The WebSocket closes after these.
CodeClose codeDescription
session_not_found1008The session_id is unknown or the 30-second grace window expired
session_forbidden1008The session_id belongs to a different account
session_expired1008Session TTL elapsed during the grace window
Agent startup errors Sent after the WebSocket is accepted but before session.ready.
CodeDescription
agent_init_failedVoice agent worker reported initialization failure
agent_timeoutAgent did not signal ready within 10 seconds
Client message errors Sent on the open socket when an inbound message is invalid. The session stays alive (except session_expired).
CodeDescription
invalid_formatBad JSON, missing or unknown type, validation failure, or missing audio field on input.audio
invalid_audioinput.audio payload failed base64 decode or PCM conversion
invalid_valuesession.update with an invalid voice or field type
immutable_fieldsession.update tried to change greeting, output.voice, or output.format after the first update was applied. output.volume is mutable and does not raise this error.
invalid_configsession.update raised a validation error
server_errorUnexpected exception while applying session.update
Live session errors
CodeClose codeDescription
session_expired1008Session duration TTL reached. There is no separate “closing soon” warning event before this, so run a client-side timer if you need to wrap up gracefully.
If the server cancels the session due to an internal error, the WebSocket closes with code 1011 without any session.error payload. In browsers, pre-handshake failures (like UNAUTHORIZED) surface as a close event with code 1006. You won’t receive a session.error. Always fetch a fresh token immediately before each connection attempt.

Interruptions

When the user speaks mid-response (barge-in), the server stops the agent and emits reply.done with status: "interrupted" and transcript.agent with interrupted: true. The decision is semantic. Back-channels like “uh-huh” don’t trigger an interruption. On reply.done with status: "interrupted":
  1. Flush your local audio playback buffer.
  2. Discard any pending tool.result accumulators from the just-ended reply.
  3. Restart the playback stream so it’s ready for the next response.
See Turn detection and interruptions for how the model decides what counts as an interruption, and Handling interruptions for the platform-specific flush pattern.