Documentation Index
Fetch the complete documentation index at: https://assemblyai.com/docs/llms.txt
Use this file to discover all available pages before exploring further.
session.update is the central configuration surface for the Voice Agent API. Everything that shapes how your agent sounds and behaves lives here: its personality, its greeting, the tools it can call, how it interprets user speech, and what voice it speaks in. Getting this right is the single highest-leverage thing you do when building a voice agent.
Send a session.update as your first WebSocket message, and any time after, to update most fields. Some fields can only be set in the first update; see Mutability after session.ready below.
Here’s a full configuration showing every available field:
Mutability after session.ready
The first session.update you send before session.ready initializes the session. After session.ready, only a subset of fields can be changed — changing one of the immutable fields raises a session.error with code immutable_field and the rejected change is ignored.
| Field | Mutable after session.ready? |
|---|---|
session.system_prompt | Yes. Send a new prompt at any time to change the agent’s behavior on the next turn. |
session.input.turn_detection | Yes. Adjust VAD thresholds, silence windows, and barge-in on the fly. |
session.input.keyterms | Yes. Replace the keyterms list at any time. The new list takes effect on the next user utterance. |
session.output.volume | Yes. Adjust playback volume on the fly. |
session.greeting | No. Raises immutable_field. The greeting is spoken once at session start. |
session.output.voice | No. Raises immutable_field. The voice is bound to the TTS connection at session start. |
session.output.format | No. Raises immutable_field. The output audio encoding is fixed for the session. |
session.tools, session.input.format) are also accepted in subsequent session.update messages and don’t raise immutable_field.
System prompt
Set the agent’s personality and behavior. Can be updated mid-session with anothersession.update.
- Ban specific phrases:
"Never say 'Certainly' or 'Absolutely'" - Enforce brevity:
"Max 2 sentences per turn" - Tell the agent when to use each tool
Greeting
What the agent says at the start of the conversation, spoken aloud. If omitted, the agent waits silently for the user to speak first.The greeting is sent straight to the TTS engine. It is not run through the LLM first. Whatever string you put here is exactly what the user hears, word for word.
- Don’t write a meta-greeting like “Greet the user warmly and ask how you can help”. The TTS will literally speak that sentence. Write the exact words you want spoken.
- The system prompt’s tone/persona rules do not get applied to the greeting. Match the tone yourself.
- Short. One sentence. Voice users get impatient with anything longer.
- Conversational. The greeting sets the user’s expectation for how the agent talks. If the greeting is stiff, they’ll match stiff.
- Set the agent’s role in the first few words: “Hey, this is Riley from Acme support. What’s going on?”.
- Avoid “How can I help you today?”. It’s a chatbot tell. Try “What’s on your mind?”, “How can I help?”, or jump straight to a purposeful prompt: “Got your callback request. What did you want to go over?”.
- You want the agent to listen first (e.g. inbound call where the caller speaks first).
- Some out-of-band channel handles the greeting (e.g. an IVR menu).
session.ready. Set it on your first session.update. Trying to change it mid-session returns immutable_field.
Voice and audio format
Choose a voice and configure the input/output audio encoding undersession.output and session.input. The encoding determines the sample rate. Input and output encodings can differ. Both default to audio/pcm (24 kHz) if omitted.
session.output.voice, session.output.format, and session.greeting are locked after the first session.update is applied. Later attempts to change them return an immutable_field error. Set the voice and output format on your first session.update. session.output.volume is the exception: it can be updated mid-session.Output volume
Adjust the playback volume of the agent’s speech viasession.output.volume. Accepts a number from 0 (silent) to 100 (loudest). If omitted, the voice plays at its native level.
voice and format, volume can be updated mid-session. Send another session.update with a new value at any time; the change applies to subsequent reply.audio chunks.
Key terms
If your conversation involves rare or domain-specific words, like a person’s name, company name, or product, add them tosession.input.keyterms to improve transcription accuracy. This works like a word boost, biasing the speech recognition model toward these terms.
session.input.keyterms accepts up to 100 strings. The list can be updated mid-session by sending another session.update; the new list replaces the previous one and takes effect on the next user utterance. Passing null or [] clears the boost.
What to add:
- Brand, product, and feature names that aren’t in everyday English (
"AssemblyAI","Lemur","Ozempic"). - Proper nouns specific to this caller: their full name, their account holder’s name, the agent’s name if it’s unusual.
- Domain jargon that the model might otherwise transcribe as a common-word homophone (
"hemochromatosis","polysomnography"). - Acronyms you want spelled in full (
"PCI DSS","FedRAMP").
- Common English words. Each entry boosts that string, and adding common words at the same weight as your rare terms dilutes the boost.
- Whole sentences or phrases. The boost is per-term, not per-phrase.
- Punctuation, formatting, or instructions. The list is treated as transcription hints, not as prompt context.
Turn detection
Turn detection and interruption handling are intelligent and semantic out of the box: back-channels like “uh-huh” don’t interrupt, but “wait, stop” does. This works with no configuration. See Turn detection and interruptions for the full explanation. If you do want to customize sensitivity or disable barge-in, override the underlying VAD knobs undersession.input.turn_detection. All fields are optional. Only include the ones you want to change. Settings can be updated mid-session.
| Field | Type | Default | Description |
|---|---|---|---|
vad_threshold | float | 0.5 | Speech detection sensitivity. Range 0.0–1.0. Lower is more sensitive to speech. |
min_silence | integer | 1000 | Minimum silence to consider a confident end-of-turn, in milliseconds. Range 50–10000. Must be less than max_silence. |
max_silence | integer | 3000 | Maximum silence before forcing end-of-turn, in milliseconds. Range 50–10000. Must be greater than min_silence. |
interrupt_response | boolean | true | Whether user speech interrupts the agent. Set false to disable barge-in. |
[50, 10000] ms, or with min_silence >= max_silence, returns a session.error with code invalid_value and the change is ignored.
When to override the defaults
Defaults are tuned for general conversational use. Override only when you have a specific user experience to solve for:| Use case | Suggested change |
|---|---|
| Snappy back-and-forth (sales, short Q&A) | Lower min_silence (e.g. 500) for faster turn end |
| Accessibility, slower speakers, elderly callers | Raise min_silence (e.g. 1500) to avoid cutting off mid-thought |
| Noisy environments (call centers, drive-thru, factory floor) | Raise vad_threshold (e.g. 0.6–0.7) so background noise doesn’t trigger speech detection |
| Whisper-quiet input or distant mics | Lower vad_threshold (e.g. 0.3–0.4) |
| Fixed-script IVR flows where you don’t want the user to interrupt | interrupt_response: false |