Documentation Index
Fetch the complete documentation index at: https://assemblyai.com/docs/llms.txt
Use this file to discover all available pages before exploring further.
Pipecat
View Pipecat’s AssemblyAI STT plugin documentation.
Overview
This guide covers integrating AssemblyAI’s Universal-3 Pro Streaming (u3-rt-pro) speech-to-text model into Pipecat voice agents. Universal-3 Pro Streaming is optimized for real-time audio utterances typically under 10 seconds, with special efficiencies built in for low-latency turn detection and voice agent workflows. It provides the highest accuracy with native multilingual code switching, entity accuracy, and prompting support. Universal-3 Pro Streaming delivers exceptional entity and alphanumeric accuracy, including credit card numbers, cell phone numbers, email addresses, physical addresses, and names. All with sub-300ms time to complete transcript latency.Key features
- Two-mode turn detection: Choose between Pipecat-controlled (VAD + Smart Turn) or AssemblyAI’s built-in turn detection (STT-based)
- Keyterms boosting: Improve recognition of specific words and names
- Dynamic parameter updates: Change configuration mid-conversation without reconnection
- Speaker diarization: Identify and label different speakers in multi-party conversations
Examples
Complete working examples are available in the Pipecat repository:- voice-assemblyai.py: Voice agent with U3-Pro using Pipecat-based turn detection (VAD + Smart Turn)
- voice-assemblyai-turn-detection.py: Voice agent with U3-Pro using AssemblyAI’s built-in turn detection (STT-based)
.env file:
The
vad_force_turn_endpoint parameter controls which turn detection mode is
used. It defaults to True (Pipecat mode), which sends a ForceEndpoint
message to AssemblyAI when the local VAD detects silence. Set it to False to
use AssemblyAI’s built-in turn detection instead. Choosing the right mode is
critical for balancing responsiveness and turn accuracy in your voice agent.Installation
Install Pipecat with all required dependencies:assemblyai: AssemblyAI U3-Pro STT serviceopenai: OpenAI LLM service (used in the examples)cartesia: Cartesia TTS service (used in the examples)
Authentication
Set your API keys in a.env file:
Two-mode turn detection
Within Pipecat, you have two distinct approaches to turn detection with AssemblyAI’s U3-Pro model.Pipecat mode (default, recommended)
When to use: Most voice agent applications requiring responsive interruptions.- VAD + Smart Turn analyzer controls when the user is done speaking
ForceEndpointmessage sent to AssemblyAI on VAD silence detectionmax_turn_silenceautomatically synchronized withmin_turn_silence- Best for low-latency, responsive voice agents
AssemblyAI’s built-in turn detection (STT mode)
When to use: When you want AssemblyAI’s built-in turn detection to control turn endings. This mode is configurable within the settings. See Configuring turn detection to understand how it works.- AssemblyAI’s built-in turn detection controls when the user is done speaking
- All timing parameters are respected as configured
- Emits
UserStartedSpeakingFrame/UserStoppedSpeakingFrame - Uses
SpeechStartedevents for fast barge-in - Only available with
u3-rt-pro(other models require Pipecat mode)
AssemblyAI’s built-in turn detection uses the STT model’s understanding of
speech patterns to determine turn boundaries, rather than relying on local VAD
silence detection.
Keyterms boosting
Improve recognition of specific words or names:Dynamic parameter updates
Change configuration mid-conversation without reconnection. See stt-assemblyai.py for a complete working example.Speaker diarization
Identify different speakers in multi-party conversations.Basic diarization
"A", "B", "C") are included in final transcripts and logged.
With custom formatting
Format transcripts with speaker labels for LLM context:| Style | Format string |
|---|---|
| XML | <{speaker}>{text}</{speaker}> |
| Markdown | **{speaker}**: {text} |
| Bracket | [{speaker}] {text} |
Daily transport
For production deployments, use the Daily transport for WebRTC-based real-time audio/video communication.Parameters reference
U3-Pro specific parameters
The speech model to use. Defaults to
"u3-rt-pro" (Universal-3 Pro
Streaming).Milliseconds of silence before ending a turn when model is confident. Set to
100 for best latency. (Formerly min_end_of_turn_silence_when_confident,
which is deprecated but still supported with a warning.)Maximum silence before forced turn end. Auto-synced in Pipecat mode; respected
in AssemblyAI’s built-in turn detection (STT mode).
List of terms to boost recognition for. Cannot be used with
prompt.Enable speaker diarization.
Custom transcription instructions. Cannot be used with
keyterms_prompt.
Prompting is currently a beta feature: see
Prompting for more information.Whether to emit additional partial transcripts during long turns at a steady
~3 second cadence. When disabled, only one early partial is emitted near turn
start. When enabled (default), additional partials covering the full turn
transcript are emitted approximately every 3 seconds while speech continues.
The first partial (at 750ms) is unaffected. Useful when downstream consumers
(LLMs, UI, eager inference) need frequent updates during long, uninterrupted
turns. Defaults to
True in Pipecat, but False when using the API
directly. See
Continuous partials
for details.How soon the first partial transcript is emitted during a turn, in
milliseconds. Range:
0–1000. Lower values produce faster time to first
token (TTFT) for barge-in and speculative inference; higher values produce
more confident first partials. The server adds a minimum of 300ms on top of
the configured value (interruption_delay=0 → ~300ms effective,
interruption_delay=500 → ~800ms effective). See
Tuning early partial timing
for details.General parameters
Your AssemblyAI API key.
True for Pipecat mode; False for AssemblyAI’s built-in turn detection (STT
mode).Template string for formatting speaker labels (e.g.,
"[{speaker}] {text}").Running your agent
Development mode (local audio)
Production with Daily
Deploy to Daily.co rooms using the Daily transport. Your agent joins as a participant and handles audio I/O through Daily’s infrastructure.Speech model comparison
Interested in using a different model?| Feature | u3-rt-pro | universal-streaming-english | universal-streaming-multilingual |
|---|---|---|---|
| Turn Detection Modes | |||
| Pipecat mode (VAD + Smart Turn) | ✅ | ✅ | ✅ |
| AssemblyAI turn detection mode | ✅ | ❌ | ❌ |
| Turn Detection Parameters | |||
min_turn_silence | ✅ | ✅ | ✅ |
max_turn_silence | ✅ | ✅ | ✅ |
end_of_turn_confidence_threshold | ❌ | ✅ (1.0) | ✅ (1.0) |
continuous_partials | ✅ | ❌ | ❌ |
interruption_delay | ✅ | ❌ | ❌ |
| Advanced Features | |||
| Keyterms boosting | ✅ | ✅ | ✅ |
| Custom prompting (beta) | ✅ | ❌ | ❌ |
| Speaker diarization | ✅ | ✅ | ✅ |
| Dynamic parameter updates | ✅ | ✅ | ✅ |
| Language Support | |||
| Multilingual code switching | ✅ | ❌ | ✅ |
| Language detection | ✅ | ❌ | ✅ |
- ✅ Fully supported and recommended
- ❌ Not supported / Not used
u3-rt-pro is the recommended model for all new voice agent
implementations. The universal-streaming models are maintained for backward
compatibility but lack the optimizations and features specifically designed
for real-time conversational AI.