Documentation Index
Fetch the complete documentation index at: https://assemblyai.com/docs/llms.txt
Use this file to discover all available pages before exploring further.
For a description of each message field, refer to our Turn
object explanation.
Understanding transcript vs utterance
Before walking through the message sequence, it’s important to understand the difference between the transcript and utterance fields:
transcript— The full transcript of the current turn up to this point in time.utterance— Only populated on theend_of_turn: truemessage, where it always equalstranscript. On all other Turn messages,utteranceis an empty string"".
Key takeaway: For Universal-3 Pro Streaming, you can always use
transcript — the utterance field provides no additional information beyond
what transcript already contains. This field exists for API consistency with
Universal-Streaming, where utterance boundaries can fire independently of turn
boundaries, typically for the purposes of eager LLM inference.- Early partial + silence-based partials — an early partial is emitted after 750ms of continuous speech to provide a fast transcript signal for barge-in and speculative inference. After that, additional partials are emitted when the speaker pauses.
- Formatting is built in —
turn_is_formattedistrueon end-of-turn transcripts. There is no separate formatting step. - Punctuation-based turn detection — turns end when terminal punctuation (
.?!) is detected, not based on a confidence threshold. end_of_turn_confidenceis always1when triggered by terminal punctuation.
My name is Sonny.
The speaker pauses briefly mid-sentence (after “is”) before 750ms of continuous speech has elapsed, so the first partial is a silence-based partial rather than an early partial. The speaker then finishes the sentence, producing a final end-of-turn transcript.
If the speaker had spoken continuously for 750ms or more without pausing, an early partial would have been emitted first. See Turn Detection and Partials for details on early partials.
Session initialization
When the session begins, you receive aBegin message with the session ID and expiration time.
Speech detected
Before any Turn messages are sent, the server sends aSpeechStarted message indicating that speech has been detected. The timestamp field indicates when the speech was detected, in milliseconds relative to the beginning of the audio stream. The confidence field is the confidence score that speech has started.
SpeechStarted is only emitted when the model produces a transcript.
Partial transcript
The speaker says “My name is” and pauses briefly. Because the speaker has stopped talking but no terminal punctuation has been detected, Universal-3 Pro Streaming emits a partial transcript. Notice that:end_of_turnisfalse— the turn has not ended yet.turn_is_formattedisfalse— this is not a finalized transcript.end_of_turn_confidenceis0— no terminal punctuation detected.- All words have
word_is_final: false— the transcript may be revised in the final message. - The
transcriptends with an em dash (—), indicating the utterance is incomplete. - The
utterancefield is an empty string because the turn has not ended. Usetranscriptto access the current partial text.
End of turn (Final transcript)
The speaker continues and says “Sonny.” — completing the sentence with a period. Universal-3 Pro Streaming detects the terminal punctuation and ends the turn with a fully formatted final transcript. Notice how the final transcript differs from the partial:end_of_turnis nowtrue— the turn has ended.turn_is_formattedistrue— this is a finalized, formatted transcript.end_of_turn_confidenceis1— terminal punctuation triggered the end of turn.- All words now have
word_is_final: true— the transcript is final and will not be revised. - The word timestamps and confidences have been refined compared to the partial.
- The
utterancefield now contains the complete finalized text. - The incomplete “is—” from the partial has been resolved to “is” and “Sonny.” in the final transcript.
Keep alive
KeepAlive messages are not required. By default, sessions remain open until explicitly terminated or until the 3-hour maximum session duration is reached.
KeepAlive is only relevant if you have configured the inactivity_timeout connection parameter, which closes the session after a period of no audio or messages being sent. If you are using inactivity_timeout and want to keep the session open during periods where no audio is being sent, send a KeepAlive message to reset the inactivity timer:
Session termination
To end a session, the client must send aTerminate message. The server then responds with a Termination message containing the total audio and session durations, and closes the connection.
Client sends:
Summary
The complete message flow for this example is:- Begin — session initialized
- SpeechStarted — speech detected at 1216ms
- Turn (partial) — speaker pauses mid-sentence;
end_of_turn: false,turn_is_formatted: false - Turn (final) — speaker finishes with terminal punctuation;
end_of_turn: true,turn_is_formatted: true - Termination — session ended
Comparison with Universal Streaming
| Behavior | Universal-3 Pro Streaming | Universal Streaming |
|---|---|---|
| Partial frequency | One early partial after 750ms of continuous speech, plus at most one per silence period | Every audio frame (word-by-word) |
| Formatting | Built in to every end-of-turn transcript | Separate turn_is_formatted message when format_turns=true |
| Turn detection | Punctuation-based (min_turn_silence / max_turn_silence) | Confidence-based (end_of_turn_confidence_threshold) |
end_of_turn_confidence | Always 1 when triggered by punctuation | Varies based on model confidence |
| Words in partials | All word_is_final: false | Mix of true and false as words are finalized incrementally |