Documentation Index
Fetch the complete documentation index at: https://assemblyai.com/docs/llms.txt
Use this file to discover all available pages before exploring further.
Overview
Traditional streaming models (like Universal-Streaming-English and Universal-Streaming-Multilingual) emit partials word-by-word as audio is processed. Each word can be revised until it’s marked final, after which it’s then immutable. Universal-3 Pro takes a different approach: an early partial is emitted after 750ms of continuous speech, followed by silence-based partials as the speaker pauses. For long, uninterrupted turns, you can also opt in tocontinuous_partials for a steady stream of mid-turn partials regardless of silence. Each partial is a stable, fully transcribed segment rather than an incremental word-by-word update. All words in partials are marked word_is_final: false.
While the segments are stable, the final end-of-turn transcript may differ from earlier partials as the model refines its output with full turn context. On the final end-of-turn transcript, all words are marked word_is_final: true.
Universal-3 Pro partials
U3 Pro emits partials in three ways:Early partial (during continuous speech)
When a speaker is talking continuously without pausing, an early partial is emitted after 750ms of continuous speech by default. This provides a transcript signal for barge-in and speculative inference without waiting for the speaker to pause. If the first attempt returns empty, it retries at 1500ms, 2250ms, and so on until text is produced. Only one early partial is emitted per turn, but additional partials can be produced when the speaker pauses. You can tune the early partial timing with theinterruption_delay connection parameter (range: 0–1000ms, default: 500ms). The server adds a minimum of 300ms on top, so interruption_delay: 0 produces the first partial at ~300ms and interruption_delay: 500 (default) produces it at ~800ms. Lower values give faster time to first token (TTFT) for barge-in and speculative inference; higher values produce more confident first partials. See Tuning early partial timing for full configuration details.
Silence-based partials
U3 Pro uses a punctuation-based turn detection system. When the speaker pauses, the model transcribes the buffered audio and checks for terminal punctuation (. ? !):
- No terminal punctuation: a partial is emitted (
end_of_turn: false) and the turn continues waiting until speech continues ormax_turn_silenceis reached. - Terminal punctuation found: the turn ends and is emitted as a final transcript (
end_of_turn: true).
| Parameter | Default | Description |
|---|---|---|
min_turn_silence | 100 ms | Silence duration before a speculative end-of-turn (EOT) check fires. |
max_turn_silence | 1000 ms | Maximum silence before a turn is forced to end. |
Continuous partials
For long, uninterrupted turns, such as a caller reading out a credit card number, address, or giving a detailed explanation, silence-based partials may not fire often enough for your downstream consumers (LLMs, UI, eager inference) to keep up. Enable thecontinuous_partials connection parameter to receive a steady stream of non-final transcripts approximately every 3 seconds while speech continues, regardless of silence.
Each continuous partial is non-final (end_of_turn: false) and covers the full transcript for the current turn so far. The first early partial at 750ms is unaffected, and the final end-of-turn transcript is emitted as normal once the turn ends.
You can also toggle continuous_partials on or off mid-session via UpdateConfiguration:
Real-world example
This is an example of what partials might look like in a voice agent scenario where a user is reading out a credit card number:Speculative inference
When receiving a Universal-3 ProTurn event, use end_of_turn to determine the transcript’s finality:
If end_of_turn is false (partial):
- Begin speculative (also known as eager or preemptive) LLM inference
- Warm TTS or prepare context
end_of_turn is true (final):
- Commit to full LLM + TTS generation
Advantages over traditional streaming partials
Fewer, higher-quality partials
Traditional streaming models emit a partial on every audio frame, frequently revising previous words. U3 Pro emits an early partial after 750ms of continuous speech, then additional partials during silence periods. Each one is processed by a full speech LLM rather than a lightweight RNN-T. This means fewer partials, but ones that are significantly more accurate. Each partial contains the full cumulative transcription of the turn so far. Earlier words may be refined as more context becomes available, but updates only happen during silence (not on every frame), so the transcript is typically far more stable than traditional streaming models.Last word accuracy
Speculative inference based on noisy partials can be counterproductive. The final word of a turn often carries critical semantic weight:- “I want to cancel.” (word-by-word, wrong)
- “I want to continue.” (full partial after silence, correct)
ms earlier.
Latency performance
After silence detection:| Metric | Latency |
|---|---|
| P50 inference latency | ~121ms |
| P90 inference latency | ~212ms |
Latency vs. entity splitting trade-off
Settingmin_turn_silence too low can split entities like phone numbers and emails for speakers with slow speech patterns. The accuracy is often still high enough for LLMs to piece together the broken entities, but we recommend testing carefully with your use case.
Setting max_turn_silence too low can have the same impact, but entity splitting is less likely since max_turn_silence is typically a greater value than min_turn_silence and a forced end-of-turn only triggers when terminal punctuation is not detected. If you have audio with very long (>1s) pauses and you’d like to keep these utterances as a single turn, you may want to increase max_turn_silence to avoid cutting off the turn too early.
Tuning for your use case
For eager LLM inference on partials, we recommend settingmin_turn_silence to 100 (default value).
You can also adjust min_turn_silence (and potentially max_turn_silence for very long pauses) for specific moments mid-stream via UpdateConfiguration. For example, increase it when a caller is about to read out a credit card, ID number, or email, and you’d prefer to wait for a longer silence before checking for an end of turn and potentially emitting a partial.