Turn detection and barge-in are on by default. Decisions are semantic, based on what the user actually said, not just silence or volume. You don’t need to wire anything in or configure it on your end. This page covers the behaviors you can rely on and the events that go with them.Documentation Index
Fetch the complete documentation index at: https://assemblyai.com/docs/llms.txt
Use this file to discover all available pages before exploring further.
Semantic interruptions
While the agent is speaking, the API classifies user speech as either a back-channel or a true interruption.Back-channeling
Short verbal acknowledgements that show the user is engaged but not trying to take the floor. The agent keeps speaking and the API does not emit an interruption. Examples:- “Uh-huh”
- “Okay”
- “Awesome”
- “Yeah, makes sense”
- “Mm-hmm”
True interruptions
Phrases that signal the user wants the agent to stop. The API immediately interrupts the agent. Examples:- “Wait, stop”
- “Sorry, that’s not right”
- “Okay, wait a minute”
- “Hold on”
reply.donewithstatus: "interrupted"transcript.agentwithinterrupted: trueandtexttrimmed to what the user actually heard before being cut off.
Semantic turn detection
The API also decides when the user has finished a turn based on what they said, not just on silence. Instead of waiting for a fixed silence window, it uses the meaning of the user’s speech to decide whether they’re done, so the agent doesn’t cut you off mid-thought, and doesn’t sit on long pauses after you’ve clearly finished. A typical user turn produces:input.speech.startedwhen the user begins speaking.transcript.user.deltaevents with partial transcripts as the user keeps talking.input.speech.stoppedwhen the turn is detected as ended.transcript.userwith the final transcript.reply.startedas the agent begins generating a response.
Smart end-of-turn for structured answers
When the agent has just asked for something specific (a phone number, an email, a date, a name, a yes/no, a digit sequence, a choice from a list), turn detection adapts to the kind of answer expected. The agent doesn’t cut users off mid-answer when they pause inside a long string of digits, and it doesn’t sit waiting for more once a clean answer has clearly landed. You’ll notice this most on:- Phone numbers, account numbers, and other digit sequences
- Email addresses
- Dates
- Yes/no questions and choices from a short list
- Names, places, companies, and other named entities
Adaptive endpointing
End-of-turn timing adapts to the user during a conversation. When a user tends to pause mid-thought, the agent learns to give them more room before responding. When the user speaks more crisply, the agent responds more quickly. You don’t have to tune anything; it just gets better as the conversation goes on. This is on by default. The moment you setmin_silence or max_silence explicitly in session.input.turn_detection, the server respects your values and stops adapting for the rest of the session.
Configuration
Semantic turn detection and interruption handling are on by default and tuned for typical conversational use cases. For most agents, the right move is to leave them alone. If you do need to adjust sensitivity, for example to be more patient in a noisy environment or to disable barge-in entirely, you can override the underlying VAD knobs viasession.input.turn_detection:
| Field | Description |
|---|---|
vad_threshold | Speech detection sensitivity (0.0–1.0). Lower = more sensitive to speech. |
min_silence | Minimum silence to consider a confident end-of-turn, in milliseconds. |
max_silence | Maximum silence before forcing end-of-turn, in milliseconds. |
interrupt_response | Whether user speech can interrupt the agent. Set false to disable barge-in. |
If the agent keeps interrupting itself, the microphone is picking up the agent’s own TTS output. Use headphones or switch to a browser-based client (which provides echo cancellation). See Troubleshooting for more detail.