Skip to main content

Documentation Index

Fetch the complete documentation index at: https://assemblyai.com/docs/llms.txt

Use this file to discover all available pages before exploring further.

Overview

This guide is for voice agents that connect AssemblyAI’s Universal-3 Pro Streaming WebSocket directly to a custom LLM and TTS, with no LiveKit, Pipecat, or other orchestrator in the loop. Universal-3 Pro Streaming is optimized for real-time audio under 10 seconds with low-latency turn detection, native multilingual code switching, and prompting support. The protocol is documented in detail on the Universal-3 Pro overview and message sequence pages. This guide focuses on the voice-agent loop and how to handle barge-in and interruptions correctly.
If you’re building on AssemblyAI’s Voice Agent API (a managed endpoint with built-in LLM and turn detection), see Turn detection and interruptions instead. Semantic interruption handling is built in there.

Quickstart

A minimal Python consumer that connects to the streaming WebSocket and reacts to Begin, Turn, SpeechStarted, and Termination events:
import json
import websocket
from urllib.parse import urlencode

API_KEY = "YOUR_API_KEY"
SAMPLE_RATE = 16000

CONNECTION_PARAMS = {
    "sample_rate": SAMPLE_RATE,
    "speech_model": "u3-rt-pro",
    "min_turn_silence": 100,
    "max_turn_silence": 1000,
}

API_ENDPOINT = (
    "wss://streaming.assemblyai.com/v3/ws?" + urlencode(CONNECTION_PARAMS)
)


def on_message(ws, message):
    data = json.loads(message)
    msg_type = data.get("type")

    if msg_type == "Begin":
        print(f"Session started: {data.get('id')}")

    elif msg_type == "Turn":
        transcript = data.get("transcript", "")
        end_of_turn = data.get("end_of_turn", False)
        if end_of_turn:
            # Final transcript - send to your LLM
            print(f"Final: {transcript}")
        else:
            # Partial - optionally start pre-emptive LLM generation
            print(f"Partial: {transcript}")

    elif msg_type == "SpeechStarted":
        # User started speaking - interrupt the agent's TTS if it's playing
        print("Speech detected, interrupt agent if speaking")

    elif msg_type == "Termination":
        print("Session ended")


ws = websocket.WebSocketApp(
    API_ENDPOINT,
    header={"Authorization": API_KEY},
    on_message=on_message,
)
ws.run_forever()
For the full message protocol, including all event fields, audio framing, and termination, see the Universal-3 Pro message sequence reference.

Turn detection

Universal-3 Pro Streaming uses punctuation-based turn detection controlled by two parameters:
ParameterDefaultDescription
min_turn_silence100 msSilence before a speculative end-of-turn check fires.
max_turn_silence1000 msMaximum silence before forcing the turn to end.
Lower values produce faster transcripts at the cost of occasional entity splits across turns. See the Universal-3 Pro overview for tuning guidance and the message sequence reference for the full event protocol.

Interruption handling

While the agent is speaking, users often produce backchannel utterances (“mhm”, “yeah”, “um”, “okay”) that you don’t want to treat as interruptions. A barge-in trigger that fires on every SpeechStarted (or every short Turn) will cause the agent to stop mid-sentence even though the user didn’t intend to interrupt. The recommended fix is a single combined filter applied to each Turn event during agent speech: skip the barge-in if the transcript is short or if every token is a known backchannel. Reset the filter once the agent has finished speaking.
import json
import string
import time
import websocket
from urllib.parse import urlencode


# "yes" / "no" deliberately omitted - in a booking flow a bare "yes"
# is a real confirmation. Edit for your domain.
BACKCHANNELS = frozenset({
    "mhm", "mm", "mmhm", "mmhmm",
    "uh", "uhhuh", "huh",
    "um", "umm", "uhm",
    "er", "erm",
    "hmm", "hm",
    "ah", "oh",
    "yeah", "yep", "yup",
    "okay", "ok",
    "right", "alright", "gotcha",
})

_PUNCT_STRIP = str.maketrans("", "", string.punctuation)
MIN_WORDS = 2          # Utterances below this are treated as filler
FILTER_GRACE_S = 1.0   # Keep filtering for 1s after agent stops speaking


# These flags are owned by your TTS layer.
agent_speaking = False
last_speaking_at = 0.0


def _is_all_backchannel(text: str) -> bool:
    tokens = text.lower().translate(_PUNCT_STRIP).split()
    return bool(tokens) and all(tok in BACKCHANNELS for tok in tokens)


def _should_suppress_interrupt(text: str) -> bool:
    now = time.monotonic()
    if agent_speaking:
        globals()["last_speaking_at"] = now
    elif now - last_speaking_at > FILTER_GRACE_S:
        return False

    word_count = len(text.split())
    return word_count < MIN_WORDS or _is_all_backchannel(text)


def on_message(ws, message):
    data = json.loads(message)
    msg_type = data.get("type")

    if msg_type == "Turn":
        transcript = data.get("transcript", "")
        end_of_turn = data.get("end_of_turn", False)

        if _should_suppress_interrupt(transcript):
            # Backchannel during agent speech - drop it.
            return

        if end_of_turn:
            handle_user_turn(transcript)  # send to LLM
        else:
            handle_partial(transcript)

    elif msg_type == "SpeechStarted":
        if agent_speaking:
            # Don't interrupt yet - wait for the Turn event,
            # which is gated by _should_suppress_interrupt above.
            return
        # Otherwise: normal barge-in path.
How it works:
  1. While the agent is speaking (plus a 1-second grace window after speech ends), each Turn event is checked.
  2. _should_suppress_interrupt returns True when the transcript has fewer than MIN_WORDS tokens or when every token is a known backchannel. Either condition drops the event.
  3. Utterances with any non-filler content past the threshold (e.g., “yeah I’d like the suite”) always pass through.
  4. SpeechStarted is gated through the same Turn-level check rather than firing barge-in directly. This prevents a race where a backchannel triggers SpeechStarted before the gating logic sees the transcript.
The BACKCHANNELS set is domain-customizable. “yes” and “no” are deliberately excluded because they often represent genuine confirmations in booking or IVR scenarios. Edit the set for your use case. MIN_WORDS = 2 is a reasonable default. Raise it if you see frequent two-word filler (“uh okay”, “yeah right”) slipping through.
If you’re building on LiveKit, prefer LiveKit’s adaptive interruption handling. See LiveKit > Interruption handling. The strategy on this page is the equivalent for direct-WebSocket integrations.
Three presets covering most voice-agent use cases:
# Fast - quick confirmations, IVR, yes/no questions
fast_params = {
    "speech_model": "u3-rt-pro",
    "min_turn_silence": 100,
    "max_turn_silence": 800,
}

# Balanced - most voice agent conversations (recommended)
balanced_params = {
    "speech_model": "u3-rt-pro",
    "min_turn_silence": 100,
    "max_turn_silence": 1000,
}

# Patient - entity dictation, complex instructions
patient_params = {
    "speech_model": "u3-rt-pro",
    "min_turn_silence": 200,
    "max_turn_silence": 2000,
}
For cross-cutting topics like dynamic configuration updates, scaling, latency budgeting, and evals, see the voice agent best practices guide.