Skip to main content

Documentation Index

Fetch the complete documentation index at: https://assemblyai.com/docs/llms.txt

Use this file to discover all available pages before exploring further.

session.update is the central configuration surface for the Voice Agent API. Everything that shapes how your agent sounds and behaves lives here: its personality, its greeting, the tools it can call, how it interprets user speech, and what voice it speaks in. Getting this right is the single highest-leverage thing you do when building a voice agent. Send a session.update as your first WebSocket message, and any time after, to update most fields. Some fields can only be set in the first update; see Mutability after session.ready below. Here’s a full configuration showing every available field:
{
  "type": "session.update",
  "session": {
    "system_prompt": "You are a friendly support agent. Keep responses under 2 sentences.",
    "greeting": "Hi! How can I help you today?",
    "tools": [],
    "input": {
      "format": { "encoding": "audio/pcm" },
      "keyterms": ["AssemblyAI", "Universal"],
      "turn_detection": {
        "vad_threshold": 0.5,
        "min_silence": 1000,
        "max_silence": 3000,
        "interrupt_response": true
      }
    },
    "output": {
      "voice": "ivy",
      "format": { "encoding": "audio/pcm" },
      "volume": 100
    }
  }
}
Every field is optional. Include only what you want to set or change. Jump to any section below for details.

Mutability after session.ready

The first session.update you send before session.ready initializes the session. After session.ready, only a subset of fields can be changed — changing one of the immutable fields raises a session.error with code immutable_field and the rejected change is ignored.
FieldMutable after session.ready?
session.system_promptYes. Send a new prompt at any time to change the agent’s behavior on the next turn.
session.input.turn_detectionYes. Adjust VAD thresholds, silence windows, and barge-in on the fly.
session.input.keytermsYes. Replace the keyterms list at any time. The new list takes effect on the next user utterance.
session.output.volumeYes. Adjust playback volume on the fly.
session.greetingNo. Raises immutable_field. The greeting is spoken once at session start.
session.output.voiceNo. Raises immutable_field. The voice is bound to the TTS connection at session start.
session.output.formatNo. Raises immutable_field. The output audio encoding is fixed for the session.
Other fields (session.tools, session.input.format) are also accepted in subsequent session.update messages and don’t raise immutable_field.

System prompt

Set the agent’s personality and behavior. Can be updated mid-session with another session.update.
{
  "type": "session.update",
  "session": {
    "system_prompt": "You are a friendly support agent. Keep responses under 2 sentences. Never make up information."
  }
}
Tips for voice-first prompts:
  • Ban specific phrases: "Never say 'Certainly' or 'Absolutely'"
  • Enforce brevity: "Max 2 sentences per turn"
  • Tell the agent when to use each tool
Prompt engineering for voice agents is iterative. Test your prompt in a live conversation, listen to how the agent responds, and refine it until the tone, length, and behavior match your use case. See the Prompting guide for patterns that improve instruction following, conversationality, and voice output quality.

Greeting

What the agent says at the start of the conversation, spoken aloud. If omitted, the agent waits silently for the user to speak first.
{
  "type": "session.update",
  "session": {
    "system_prompt": "You are a helpful assistant.",
    "greeting": "Hi there! How can I help you today?"
  }
}
The greeting is sent straight to the TTS engine. It is not run through the LLM first. Whatever string you put here is exactly what the user hears, word for word.
This has two consequences:
  • Don’t write a meta-greeting like “Greet the user warmly and ask how you can help”. The TTS will literally speak that sentence. Write the exact words you want spoken.
  • The system prompt’s tone/persona rules do not get applied to the greeting. Match the tone yourself.
What makes a good greeting:
  • Short. One sentence. Voice users get impatient with anything longer.
  • Conversational. The greeting sets the user’s expectation for how the agent talks. If the greeting is stiff, they’ll match stiff.
  • Set the agent’s role in the first few words: “Hey, this is Riley from Acme support. What’s going on?”.
  • Avoid “How can I help you today?”. It’s a chatbot tell. Try “What’s on your mind?”, “How can I help?”, or jump straight to a purposeful prompt: “Got your callback request. What did you want to go over?”.
Omit the greeting entirely when:
  • You want the agent to listen first (e.g. inbound call where the caller speaks first).
  • Some out-of-band channel handles the greeting (e.g. an IVR menu).
The greeting is immutable after session.ready. Set it on your first session.update. Trying to change it mid-session returns immutable_field.

Voice and audio format

Choose a voice and configure the input/output audio encoding under session.output and session.input. The encoding determines the sample rate. Input and output encodings can differ. Both default to audio/pcm (24 kHz) if omitted.
session.output.voice, session.output.format, and session.greeting are locked after the first session.update is applied. Later attempts to change them return an immutable_field error. Set the voice and output format on your first session.update. session.output.volume is the exception: it can be updated mid-session.
{
  "type": "session.update",
  "session": {
    "input": {
      "format": { "encoding": "audio/pcm" }
    },
    "output": {
      "voice": "ivy",
      "format": { "encoding": "audio/pcm" }
    }
  }
}
See Voices for the voice catalog and Audio format for supported encodings and playback details.

Output volume

Adjust the playback volume of the agent’s speech via session.output.volume. Accepts a number from 0 (silent) to 100 (loudest). If omitted, the voice plays at its native level.
{
  "type": "session.update",
  "session": {
    "output": {
      "volume": 60
    }
  }
}
Unlike voice and format, volume can be updated mid-session. Send another session.update with a new value at any time; the change applies to subsequent reply.audio chunks.

Key terms

If your conversation involves rare or domain-specific words, like a person’s name, company name, or product, add them to session.input.keyterms to improve transcription accuracy. This works like a word boost, biasing the speech recognition model toward these terms.
{
  "type": "session.update",
  "session": {
    "input": {
      "keyterms": ["AssemblyAI", "Universal", "Ozempic"]
    }
  }
}
session.input.keyterms accepts up to 100 strings. The list can be updated mid-session by sending another session.update; the new list replaces the previous one and takes effect on the next user utterance. Passing null or [] clears the boost. What to add:
  • Brand, product, and feature names that aren’t in everyday English ("AssemblyAI", "Lemur", "Ozempic").
  • Proper nouns specific to this caller: their full name, their account holder’s name, the agent’s name if it’s unusual.
  • Domain jargon that the model might otherwise transcribe as a common-word homophone ("hemochromatosis", "polysomnography").
  • Acronyms you want spelled in full ("PCI DSS", "FedRAMP").
What NOT to add:
  • Common English words. Each entry boosts that string, and adding common words at the same weight as your rare terms dilutes the boost.
  • Whole sentences or phrases. The boost is per-term, not per-phrase.
  • Punctuation, formatting, or instructions. The list is treated as transcription hints, not as prompt context.
When to refresh the list: when the conversation enters a new domain. If an inbound support call switches from “billing” to “technical”, swap the toolset and the keyterms list together.

Turn detection

Turn detection and interruption handling are intelligent and semantic out of the box: back-channels like “uh-huh” don’t interrupt, but “wait, stop” does. This works with no configuration. See Turn detection and interruptions for the full explanation. If you do want to customize sensitivity or disable barge-in, override the underlying VAD knobs under session.input.turn_detection. All fields are optional. Only include the ones you want to change. Settings can be updated mid-session.
{
  "type": "session.update",
  "session": {
    "input": {
      "turn_detection": {
        "vad_threshold": 0.5,
        "min_silence": 1000,
        "max_silence": 3000,
        "interrupt_response": true
      }
    }
  }
}
FieldTypeDefaultDescription
vad_thresholdfloat0.5Speech detection sensitivity. Range 0.01.0. Lower is more sensitive to speech.
min_silenceinteger1000Minimum silence to consider a confident end-of-turn, in milliseconds. Range 5010000. Must be less than max_silence.
max_silenceinteger3000Maximum silence before forcing end-of-turn, in milliseconds. Range 5010000. Must be greater than min_silence.
interrupt_responsebooleantrueWhether user speech interrupts the agent. Set false to disable barge-in.
Setting any silence value outside [50, 10000] ms, or with min_silence >= max_silence, returns a session.error with code invalid_value and the change is ignored.

When to override the defaults

Defaults are tuned for general conversational use. Override only when you have a specific user experience to solve for:
Use caseSuggested change
Snappy back-and-forth (sales, short Q&A)Lower min_silence (e.g. 500) for faster turn end
Accessibility, slower speakers, elderly callersRaise min_silence (e.g. 1500) to avoid cutting off mid-thought
Noisy environments (call centers, drive-thru, factory floor)Raise vad_threshold (e.g. 0.60.7) so background noise doesn’t trigger speech detection
Whisper-quiet input or distant micsLower vad_threshold (e.g. 0.30.4)
Fixed-script IVR flows where you don’t want the user to interruptinterrupt_response: false
Setting min_silence or max_silence explicitly turns off the server’s adaptive end-of-turn behavior for the rest of the session. The values you provide are used as-is. If you want the agent to keep adapting to the user’s pace, leave both fields unset.
interrupt_response: false disables barge-in entirely. The user can yell “stop” and the agent will keep speaking. Only set this for use cases where you genuinely do not want the user to be able to interrupt, like reading legal disclosures.
Not sure which turn detection settings to use? Check out the quick start configurations for turn detection to find the best preset for your use case.