Skip to main content

Documentation Index

Fetch the complete documentation index at: https://assemblyai.com/docs/llms.txt

Use this file to discover all available pages before exploring further.

Pipecat

View Pipecat’s AssemblyAI STT plugin documentation.

Overview

This guide covers integrating AssemblyAI’s Universal-3 Pro Streaming (u3-rt-pro) speech-to-text model into Pipecat voice agents. Universal-3 Pro Streaming is optimized for real-time audio utterances typically under 10 seconds, with special efficiencies built in for low-latency turn detection and voice agent workflows. It provides the highest accuracy with native multilingual code switching, entity accuracy, and prompting support. Universal-3 Pro Streaming delivers exceptional entity and alphanumeric accuracy, including credit card numbers, cell phone numbers, email addresses, physical addresses, and names. All with sub-300ms time to complete transcript latency.

Key features

  • Two-mode turn detection: Choose between Pipecat-controlled (VAD + Smart Turn) or AssemblyAI’s built-in turn detection (STT-based)
  • Keyterms boosting: Improve recognition of specific words and names
  • Dynamic parameter updates: Change configuration mid-conversation without reconnection
  • Speaker diarization: Identify and label different speakers in multi-party conversations

Examples

Complete working examples are available in the Pipecat repository: You can run any example directly as long as your API keys are saved in a .env file:
python voice-assemblyai.py
The vad_force_turn_endpoint parameter controls which turn detection mode is used. It defaults to True (Pipecat mode), which sends a ForceEndpoint message to AssemblyAI when the local VAD detects silence. Set it to False to use AssemblyAI’s built-in turn detection instead. Choosing the right mode is critical for balancing responsiveness and turn accuracy in your voice agent.

Installation

Install Pipecat with all required dependencies:
pip install "pipecat-ai[assemblyai,openai,cartesia]"
What’s included:
  • assemblyai: AssemblyAI U3-Pro STT service
  • openai: OpenAI LLM service (used in the examples)
  • cartesia: Cartesia TTS service (used in the examples)
The examples use OpenAI and Cartesia, but you can use any LLM or TTS you want that’s supported by Pipecat. Just swap out the extras in the install command (e.g., pipecat-ai[assemblyai,anthropic,elevenlabs]).

Authentication

Set your API keys in a .env file:
ASSEMBLYAI_API_KEY=your_assemblyai_key
OPENAI_API_KEY=your_openai_key
CARTESIA_API_KEY=your_cartesia_key
You can obtain an AssemblyAI API key by signing up here.

Two-mode turn detection

Within Pipecat, you have two distinct approaches to turn detection with AssemblyAI’s U3-Pro model. When to use: Most voice agent applications requiring responsive interruptions.
stt = AssemblyAISTTService(
    api_key=os.getenv("ASSEMBLYAI_API_KEY"),
    settings=AssemblyAISTTService.Settings(
        model="u3-rt-pro",
        min_turn_silence=100,
        # continuous_partials is True by default — steady ~3s partials during long turns.
        # interruption_delay=0,  # Optional: faster first partial (~300ms effective). Default: 500 (~800ms effective).
    ),
    vad_force_turn_endpoint=True,  # Default (Pipecat mode)
)
How it works:
  • VAD + Smart Turn analyzer controls when the user is done speaking
  • ForceEndpoint message sent to AssemblyAI on VAD silence detection
  • max_turn_silence automatically synchronized with min_turn_silence
  • Best for low-latency, responsive voice agents

AssemblyAI’s built-in turn detection (STT mode)

When to use: When you want AssemblyAI’s built-in turn detection to control turn endings. This mode is configurable within the settings. See Configuring turn detection to understand how it works.
stt = AssemblyAISTTService(
    api_key=os.getenv("ASSEMBLYAI_API_KEY"),
    settings=AssemblyAISTTService.Settings(
        model="u3-rt-pro",
        min_turn_silence=100,
        max_turn_silence=1000,  # Now respected
    ),
    vad_force_turn_endpoint=False,  # AssemblyAI's built-in turn detection (STT mode)
)
How it works:
  • AssemblyAI’s built-in turn detection controls when the user is done speaking
  • All timing parameters are respected as configured
  • Emits UserStartedSpeakingFrame / UserStoppedSpeakingFrame
  • Uses SpeechStarted events for fast barge-in
  • Only available with u3-rt-pro (other models require Pipecat mode)
AssemblyAI’s built-in turn detection uses the STT model’s understanding of speech patterns to determine turn boundaries, rather than relying on local VAD silence detection.

Keyterms boosting

Improve recognition of specific words or names:
stt = AssemblyAISTTService(
    api_key=os.getenv("ASSEMBLYAI_API_KEY"),
    settings=AssemblyAISTTService.Settings(
        model="u3-rt-pro",
        min_turn_silence=100,
        keyterms_prompt=["Xiomara", "Saoirse", "Pipecat", "AssemblyAI"],
    ),
)

Dynamic parameter updates

Change configuration mid-conversation without reconnection. See stt-assemblyai.py for a complete working example.
from pipecat.frames.frames import STTUpdateSettingsFrame
from pipecat.services.assemblyai.stt import AssemblyAISTTService

# Update keyterms during conversation
await task.queue_frame(
    STTUpdateSettingsFrame(
        delta=AssemblyAISTTService.Settings(
            keyterms_prompt=["NewName", "NewCompany"]
        )
    )
)

# Update silence thresholds
await task.queue_frame(
    STTUpdateSettingsFrame(
        delta=AssemblyAISTTService.Settings(
            min_turn_silence=200,
            max_turn_silence=3000,  # Only respected in AssemblyAI's built-in turn detection (STT mode)
        )
    )
)

# Tune first-partial timing for faster barge-in
await task.queue_frame(
    STTUpdateSettingsFrame(
        delta=AssemblyAISTTService.Settings(
            interruption_delay=0,  # ~300ms effective TTFT (default: 500 → ~800ms)
        )
    )
)

Speaker diarization

Identify different speakers in multi-party conversations.

Basic diarization

stt = AssemblyAISTTService(
    api_key=os.getenv("ASSEMBLYAI_API_KEY"),
    settings=AssemblyAISTTService.Settings(
        model="u3-rt-pro",
        speaker_labels=True,
    ),
)
Speaker labels (e.g., "A", "B", "C") are included in final transcripts and logged.

With custom formatting

Format transcripts with speaker labels for LLM context:
stt = AssemblyAISTTService(
    api_key=os.getenv("ASSEMBLYAI_API_KEY"),
    settings=AssemblyAISTTService.Settings(
        model="u3-rt-pro",
        speaker_labels=True,
    ),
    speaker_format="<{speaker}>{text}</{speaker}>",
)
Format options:
StyleFormat string
XML<{speaker}>{text}</{speaker}>
Markdown**{speaker}**: {text}
Bracket[{speaker}] {text}

Daily transport

For production deployments, use the Daily transport for WebRTC-based real-time audio/video communication.

Parameters reference

U3-Pro specific parameters

model
str
default:"u3-rt-pro"
The speech model to use. Defaults to "u3-rt-pro" (Universal-3 Pro Streaming).
min_turn_silence
int
default:"100"
Milliseconds of silence before ending a turn when model is confident. Set to 100 for best latency. (Formerly min_end_of_turn_silence_when_confident, which is deprecated but still supported with a warning.)
max_turn_silence
int
default:"1000"
Maximum silence before forced turn end. Auto-synced in Pipecat mode; respected in AssemblyAI’s built-in turn detection (STT mode).
keyterms_prompt
list[str]
List of terms to boost recognition for. Cannot be used with prompt.
speaker_labels
bool
default:"False"
Enable speaker diarization.
prompt
str
Custom transcription instructions. Cannot be used with keyterms_prompt. Prompting is currently a beta feature: see Prompting for more information.
continuous_partials
bool
default:"True"
Whether to emit additional partial transcripts during long turns at a steady ~3 second cadence. When disabled, only one early partial is emitted near turn start. When enabled (default), additional partials covering the full turn transcript are emitted approximately every 3 seconds while speech continues. The first partial (at 750ms) is unaffected. Useful when downstream consumers (LLMs, UI, eager inference) need frequent updates during long, uninterrupted turns. Defaults to True in Pipecat, but False when using the API directly. See Continuous partials for details.
interruption_delay
int
default:"500"
How soon the first partial transcript is emitted during a turn, in milliseconds. Range: 01000. Lower values produce faster time to first token (TTFT) for barge-in and speculative inference; higher values produce more confident first partials. The server adds a minimum of 300ms on top of the configured value (interruption_delay=0 → ~300ms effective, interruption_delay=500 → ~800ms effective). See Tuning early partial timing for details.

General parameters

api_key
str
required
Your AssemblyAI API key.
vad_force_turn_endpoint
bool
default:"True"
True for Pipecat mode; False for AssemblyAI’s built-in turn detection (STT mode).
speaker_format
str
Template string for formatting speaker labels (e.g., "[{speaker}] {text}").

Running your agent

Development mode (local audio)

python your_agent.py
Speak into your microphone after hearing the greeting.

Production with Daily

Deploy to Daily.co rooms using the Daily transport. Your agent joins as a participant and handles audio I/O through Daily’s infrastructure.

Speech model comparison

Interested in using a different model?
Featureu3-rt-prouniversal-streaming-englishuniversal-streaming-multilingual
Turn Detection Modes
Pipecat mode (VAD + Smart Turn)
AssemblyAI turn detection mode
Turn Detection Parameters
min_turn_silence
max_turn_silence
end_of_turn_confidence_threshold✅ (1.0)✅ (1.0)
continuous_partials
interruption_delay
Advanced Features
Keyterms boosting
Custom prompting (beta)
Speaker diarization
Dynamic parameter updates
Language Support
Multilingual code switching
Language detection
Legend:
  • ✅ Fully supported and recommended
  • ❌ Not supported / Not used
u3-rt-pro is the recommended model for all new voice agent implementations. The universal-streaming models are maintained for backward compatibility but lack the optimizations and features specifically designed for real-time conversational AI.
The end_of_turn_confidence_threshold parameter is not used with u3-rt-pro (it won’t affect behavior). For universal-streaming models, Pipecat automatically sets it to 1.0 in Pipecat mode to disable semantic turn detection and ensure fast responses. You don’t need to configure this parameter manually.