Build a custom voice agent on AssemblyAI’s streaming WebSocket with turn detection, barge-in, and interruption handling without a third-party orchestrator.
This guide is for voice agents that connect AssemblyAI’s Universal-3 Pro Streaming WebSocket directly to a custom LLM and TTS, with no LiveKit, Pipecat, or other orchestrator in the loop.Universal-3 Pro Streaming is optimized for real-time audio under 10 seconds with low-latency turn detection, native multilingual code switching, and prompting support. The protocol is documented in detail on the Universal-3 Pro overview and message sequence pages. This guide focuses on the voice-agent loop and how to handle barge-in and interruptions correctly.
If you’re building on AssemblyAI’s Voice Agent API (a managed endpoint with built-in LLM and turn detection), see Turn detection and interruptions instead. Semantic interruption handling is built in there.
Universal-3 Pro Streaming uses punctuation-based turn detection controlled by two parameters:
Parameter
Default
Description
min_turn_silence
100 ms
Silence before a speculative end-of-turn check fires.
max_turn_silence
1000 ms
Maximum silence before forcing the turn to end.
Lower values produce faster transcripts at the cost of occasional entity splits across turns. See the Universal-3 Pro overview for tuning guidance and the message sequence reference for the full event protocol.
While the agent is speaking, users often produce backchannel utterances (“mhm”, “yeah”, “um”, “okay”) that you don’t want to treat as interruptions. A barge-in trigger that fires on every SpeechStarted (or every short Turn) will cause the agent to stop mid-sentence even though the user didn’t intend to interrupt.The recommended fix is a single combined filter applied to each Turn event during agent speech: skip the barge-in if the transcript is short or if every token is a known backchannel. Reset the filter once the agent has finished speaking.
import jsonimport stringimport timeimport websocketfrom urllib.parse import urlencode# "yes" / "no" deliberately omitted - in a booking flow a bare "yes"# is a real confirmation. Edit for your domain.BACKCHANNELS = frozenset({ "mhm", "mm", "mmhm", "mmhmm", "uh", "uhhuh", "huh", "um", "umm", "uhm", "er", "erm", "hmm", "hm", "ah", "oh", "yeah", "yep", "yup", "okay", "ok", "right", "alright", "gotcha",})_PUNCT_STRIP = str.maketrans("", "", string.punctuation)MIN_WORDS = 2 # Utterances below this are treated as fillerFILTER_GRACE_S = 1.0 # Keep filtering for 1s after agent stops speaking# These flags are owned by your TTS layer.agent_speaking = Falselast_speaking_at = 0.0def _is_all_backchannel(text: str) -> bool: tokens = text.lower().translate(_PUNCT_STRIP).split() return bool(tokens) and all(tok in BACKCHANNELS for tok in tokens)def _should_suppress_interrupt(text: str) -> bool: now = time.monotonic() if agent_speaking: globals()["last_speaking_at"] = now elif now - last_speaking_at > FILTER_GRACE_S: return False word_count = len(text.split()) return word_count < MIN_WORDS or _is_all_backchannel(text)def on_message(ws, message): data = json.loads(message) msg_type = data.get("type") if msg_type == "Turn": transcript = data.get("transcript", "") end_of_turn = data.get("end_of_turn", False) if _should_suppress_interrupt(transcript): # Backchannel during agent speech - drop it. return if end_of_turn: handle_user_turn(transcript) # send to LLM else: handle_partial(transcript) elif msg_type == "SpeechStarted": if agent_speaking: # Don't interrupt yet - wait for the Turn event, # which is gated by _should_suppress_interrupt above. return # Otherwise: normal barge-in path.
How it works:
While the agent is speaking (plus a 1-second grace window after speech ends), each Turn event is checked.
_should_suppress_interrupt returns True when the transcript has fewer than MIN_WORDS tokens or when every token is a known backchannel. Either condition drops the event.
Utterances with any non-filler content past the threshold (e.g., “yeah I’d like the suite”) always pass through.
SpeechStarted is gated through the same Turn-level check rather than firing barge-in directly. This prevents a race where a backchannel triggers SpeechStarted before the gating logic sees the transcript.
The BACKCHANNELS set is domain-customizable. “yes” and “no” are deliberately excluded because they often represent genuine confirmations in booking or IVR scenarios. Edit the set for your use case. MIN_WORDS = 2 is a reasonable default. Raise it if you see frequent two-word filler (“uh okay”, “yeah right”) slipping through.
If you’re building on LiveKit, prefer LiveKit’s adaptive interruption handling. See LiveKit > Interruption handling. The strategy on this page is the equivalent for direct-WebSocket integrations.