Universal-3 Pro Streaming on Pipecat

Pipecat

View Pipecat’s AssemblyAI STT plugin documentation.

Overview

This guide covers integrating AssemblyAI’s Universal-3 Pro Streaming (u3-rt-pro) speech-to-text model into Pipecat voice agents. Universal-3 Pro Streaming is optimized for real-time audio utterances typically under 10 seconds, with special efficiencies built in for low-latency turn detection and voice agent workflows. It provides the highest accuracy with native multilingual code switching, entity accuracy, and prompting support. Universal-3 Pro Streaming delivers exceptional entity and alphanumeric accuracy, including credit card numbers, cell phone numbers, email addresses, physical addresses, and names. All with sub-300ms time to complete transcript latency.

Key features

Two-mode turn detection: Choose between Pipecat-controlled (VAD + Smart Turn) or AssemblyAI’s built-in turn detection (STT-based)
Keyterms boosting: Improve recognition of specific words and names
Dynamic parameter updates: Change configuration mid-conversation without reconnection
Speaker diarization: Identify and label different speakers in multi-party conversations

Examples

Complete working examples are available in the Pipecat repository:

voice-assemblyai.py: Voice agent with U3-Pro using Pipecat-based turn detection (VAD + Smart Turn)
voice-assemblyai-turn-detection.py: Voice agent with U3-Pro using AssemblyAI’s built-in turn detection (STT-based)

You can run any example directly as long as your API keys are saved in a .env file:

python voice-assemblyai.py

The vad_force_turn_endpoint parameter controls which turn detection mode is used. It defaults to True (Pipecat mode), which sends a ForceEndpoint message to AssemblyAI when the local VAD detects silence. Set it to False to use AssemblyAI’s built-in turn detection instead. Choosing the right mode is critical for balancing responsiveness and turn accuracy in your voice agent.

Installation

Install Pipecat with all required dependencies:

pip install "pipecat-ai[assemblyai,openai,cartesia]"

What’s included:

assemblyai: AssemblyAI U3-Pro STT service
openai: OpenAI LLM service (used in the examples)
cartesia: Cartesia TTS service (used in the examples)

The examples use OpenAI and Cartesia, but you can use any LLM or TTS you want that’s supported by Pipecat. Just swap out the extras in the install command (e.g., pipecat-ai[assemblyai,anthropic,elevenlabs]).

Authentication

Set your API keys in a .env file:

ASSEMBLYAI_API_KEY=your_assemblyai_key
OPENAI_API_KEY=your_openai_key
CARTESIA_API_KEY=your_cartesia_key

You can obtain an AssemblyAI API key by signing up here.

Two-mode turn detection

Within Pipecat, you have two distinct approaches to turn detection with AssemblyAI’s U3-Pro model.

Pipecat mode (default, recommended)

When to use: Most voice agent applications requiring responsive interruptions.

stt = AssemblyAISTTService(
    api_key=os.getenv("ASSEMBLYAI_API_KEY"),
    settings=AssemblyAISTTService.Settings(
        model="u3-rt-pro",
        min_turn_silence=100,
        # continuous_partials is True by default — steady ~3s partials during long turns.
        # interruption_delay=0,  # Optional: faster first partial (~300ms effective). Default: 500 (~800ms effective).
    ),
    vad_force_turn_endpoint=True,  # Default (Pipecat mode)
)

How it works:

VAD + Smart Turn analyzer controls when the user is done speaking
ForceEndpoint message sent to AssemblyAI on VAD silence detection
max_turn_silence automatically synchronized with min_turn_silence
Best for low-latency, responsive voice agents

AssemblyAI’s built-in turn detection (STT mode)

When to use: When you want AssemblyAI’s built-in turn detection to control turn endings. This mode is configurable within the settings. See Configuring turn detection to understand how it works.

stt = AssemblyAISTTService(
    api_key=os.getenv("ASSEMBLYAI_API_KEY"),
    settings=AssemblyAISTTService.Settings(
        model="u3-rt-pro",
        min_turn_silence=100,
        max_turn_silence=1000,  # Now respected
    ),
    vad_force_turn_endpoint=False,  # AssemblyAI's built-in turn detection (STT mode)
)

How it works:

AssemblyAI’s built-in turn detection controls when the user is done speaking
All timing parameters are respected as configured
Emits UserStartedSpeakingFrame / UserStoppedSpeakingFrame
Uses SpeechStarted events for fast barge-in
Only available with u3-rt-pro (other models require Pipecat mode)

AssemblyAI’s built-in turn detection uses the STT model’s understanding of speech patterns to determine turn boundaries, rather than relying on local VAD silence detection.

Keyterms boosting

Improve recognition of specific words or names:

stt = AssemblyAISTTService(
    api_key=os.getenv("ASSEMBLYAI_API_KEY"),
    settings=AssemblyAISTTService.Settings(
        model="u3-rt-pro",
        min_turn_silence=100,
        keyterms_prompt=["Xiomara", "Saoirse", "Pipecat", "AssemblyAI"],
    ),
)

Dynamic parameter updates

Change configuration mid-conversation without reconnection. See stt-assemblyai.py for a complete working example.

from pipecat.frames.frames import STTUpdateSettingsFrame
from pipecat.services.assemblyai.stt import AssemblyAISTTService

# Update keyterms during conversation
await task.queue_frame(
    STTUpdateSettingsFrame(
        delta=AssemblyAISTTService.Settings(
            keyterms_prompt=["NewName", "NewCompany"]
        )
    )
)

# Update silence thresholds
await task.queue_frame(
    STTUpdateSettingsFrame(
        delta=AssemblyAISTTService.Settings(
            min_turn_silence=200,
            max_turn_silence=3000,  # Only respected in AssemblyAI's built-in turn detection (STT mode)
        )
    )
)

# Tune first-partial timing for faster barge-in
await task.queue_frame(
    STTUpdateSettingsFrame(
        delta=AssemblyAISTTService.Settings(
            interruption_delay=0,  # ~300ms effective TTFT (default: 500 → ~800ms)
        )
    )
)

Speaker diarization

Identify different speakers in multi-party conversations.

Basic diarization

stt = AssemblyAISTTService(
    api_key=os.getenv("ASSEMBLYAI_API_KEY"),
    settings=AssemblyAISTTService.Settings(
        model="u3-rt-pro",
        speaker_labels=True,
    ),
)

Speaker labels (e.g., "A", "B", "C") are included in final transcripts and logged.

With custom formatting

Format transcripts with speaker labels for LLM context:

stt = AssemblyAISTTService(
    api_key=os.getenv("ASSEMBLYAI_API_KEY"),
    settings=AssemblyAISTTService.Settings(
        model="u3-rt-pro",
        speaker_labels=True,
    ),
    speaker_format="<{speaker}>{text}</{speaker}>",
)

Format options:

Style	Format string
XML	`<{speaker}>{text}</{speaker}>`
Markdown	`{speaker}: {text}`
Bracket	`[{speaker}] {text}`

Daily transport

For production deployments, use the Daily transport for WebRTC-based real-time audio/video communication.

Parameters reference

U3-Pro specific parameters

model

str

default:"u3-rt-pro"

The speech model to use. Defaults to "u3-rt-pro" (Universal-3 Pro Streaming).

min_turn_silence

int

default:"100"

Milliseconds of silence before ending a turn when model is confident. Set to 100 for best latency. (Formerly min_end_of_turn_silence_when_confident, which is deprecated but still supported with a warning.)

max_turn_silence

int

default:"1000"

Maximum silence before forced turn end. Auto-synced in Pipecat mode; respected in AssemblyAI’s built-in turn detection (STT mode).

keyterms_prompt

list[str]

List of terms to boost recognition for. Cannot be used with prompt.

speaker_labels

bool

default:"False"

Enable speaker diarization.

prompt

str

Custom transcription instructions. Cannot be used with keyterms_prompt. Prompting is currently a beta feature: see Prompting for more information.

continuous_partials

bool

default:"True"

Whether to emit additional partial transcripts during long turns at a steady ~3 second cadence. When disabled, only one early partial is emitted near turn start. When enabled (default), additional partials covering the full turn transcript are emitted approximately every 3 seconds while speech continues. The first partial (at 750ms) is unaffected. Useful when downstream consumers (LLMs, UI, eager inference) need frequent updates during long, uninterrupted turns. Defaults to True in Pipecat, but False when using the API directly. See Continuous partials for details.

interruption_delay

int

default:"500"

How soon the first partial transcript is emitted during a turn, in milliseconds. Range: 0–1000. Lower values produce faster time to first token (TTFT) for barge-in and speculative inference; higher values produce more confident first partials. The server adds a minimum of 300ms on top of the configured value (interruption_delay=0 → ~300ms effective, interruption_delay=500 → ~800ms effective). See Tuning early partial timing for details.

General parameters

api_key

str

required

Your AssemblyAI API key.

vad_force_turn_endpoint

bool

default:"True"

True for Pipecat mode; False for AssemblyAI’s built-in turn detection (STT mode).

speaker_format

str

Template string for formatting speaker labels (e.g., "[{speaker}] {text}").

Running your agent

Development mode (local audio)

python your_agent.py

Speak into your microphone after hearing the greeting.

Production with Daily

Deploy to Daily.co rooms using the Daily transport. Your agent joins as a participant and handles audio I/O through Daily’s infrastructure.

Speech model comparison

Interested in using a different model?

Feature	u3-rt-pro	universal-streaming-english	universal-streaming-multilingual
Turn Detection Modes
Pipecat mode (VAD + Smart Turn)	✅	✅	✅
AssemblyAI turn detection mode	✅	❌	❌
Turn Detection Parameters
`min_turn_silence`	✅	✅	✅
`max_turn_silence`	✅	✅	✅
`end_of_turn_confidence_threshold`	❌	✅ (1.0)	✅ (1.0)
`continuous_partials`	✅	❌	❌
`interruption_delay`	✅	❌	❌
Advanced Features
Keyterms boosting	✅	✅	✅
Custom prompting (beta)	✅	❌	❌
Speaker diarization	✅	✅	✅
Dynamic parameter updates	✅	✅	✅
Language Support
Multilingual code switching	✅	❌	✅
Language detection	✅	❌	✅

Legend:

✅ Fully supported and recommended
❌ Not supported / Not used

u3-rt-pro is the recommended model for all new voice agent implementations. The universal-streaming models are maintained for backward compatibility but lack the optimizations and features specifically designed for real-time conversational AI.

The end_of_turn_confidence_threshold parameter is not used with u3-rt-pro (it won’t affect behavior). For universal-streaming models, Pipecat automatically sets it to 1.0 in Pipecat mode to disable semantic turn detection and ensure fast responses. You don’t need to configure this parameter manually.

Documentation Index

Pipecat

​Overview

​Key features

​Examples

​Installation

​Authentication

​Two-mode turn detection

​Pipecat mode (default, recommended)

​AssemblyAI’s built-in turn detection (STT mode)

​Keyterms boosting

​Dynamic parameter updates

​Speaker diarization

​Basic diarization

​With custom formatting

​Daily transport

​Parameters reference

​U3-Pro specific parameters

​General parameters

​Running your agent

​Development mode (local audio)

​Production with Daily

​Speech model comparison

Overview

Key features

Examples

Installation

Authentication

Two-mode turn detection

Pipecat mode (default, recommended)

AssemblyAI’s built-in turn detection (STT mode)

Keyterms boosting

Dynamic parameter updates

Speaker diarization

Basic diarization

With custom formatting

Daily transport

Parameters reference

U3-Pro specific parameters

General parameters

Running your agent

Development mode (local audio)

Production with Daily

Speech model comparison