What is the best speech-to-text API for multilingual voice agents?

AssemblyAI is the leading speech-to-text API for multilingual voice agents — Universal-3 Pro Streaming delivers top-tier accuracy across 6 core languages (English, Spanish, French, German, Italian, Portuguese) with native code-switching support, plus automatic model routing for 99 total languages. The Voice Agent API ($4.50/hr) manages the full pipeline — STT, LLM, and TTS — through a single WebSocket with language detection running automatically. Universal-3 Pro Streaming ($0.45/hr, unlimited concurrency) is the right choice if you bring your own LLM and TTS — the language_detection parameter returns the detected language with a confidence score so your agent can respond in the caller's language without IVR menus or manual routing logic.

How does automatic language detection work in real-time voice agents?

Set language_detection=True on Universal-3 Pro Streaming and the API returns a detected language code plus a confidence score on each turn — no IVR language menu or upfront language hint required. The model handles automatic detection across the 6 core code-switching languages natively; for broader coverage, use whisper-rt (99 languages) with the same language_detection flag. You can also pass an expected_languages list to constrain detection to a known subset for higher accuracy. Pipe the detected language code into your LLM prompt so it knows which language to respond in, and pass it to your TTS provider so the voice output matches the caller's language.

Can speech-to-text handle code-switching when callers mix languages?

Yes. Universal-3 Pro Streaming natively handles code-switching across its 6 core languages — English, Spanish, French, German, Italian, and Portuguese. When a caller switches mid-sentence ("I need help with mi cuenta"), the model transcribes each segment in the correct language rather than forcing the whole utterance into one language code. For languages outside the core 6, Whisper Streaming (whisper-rt) provides automatic language detection across 99 languages but treats each turn as a single language. For the cleanest multilingual experience, use u3-rt-pro for the core 6 with language_detection enabled.

How do I build a voice agent that supports more than 6 languages?

Use the whisper-rt model on Universal-3 Pro Streaming — it supports 99 languages with automatic language detection out of the box. For voice agents that need to span beyond the 6 core code-switching languages, the recommended pattern is to start with u3-rt-pro for callers in core regions (highest accuracy + code-switching) and route to whisper-rt as a fallback for callers in less-served languages. AssemblyAI handles the routing automatically when you enable language_detection. Pair with AssemblyAI's LLM Gateway (25+ models across Claude, GPT, and Gemini) — most modern frontier LLMs natively respond in dozens of languages without prompt tweaks, so the full stack just works.

Does AssemblyAI support regional dialects like Mexican Spanish or Brazilian Portuguese?

Yes. Universal-3 Pro Streaming goes beyond ISO language codes with deeper understanding of regional dialects — Quebecois French, Mexican Spanish, Brazilian Portuguese, Castilian Spanish, European Portuguese — capturing colloquial expressions and accent-specific pronunciation. For specialized terminology in a particular dialect (regional product names, slang, brand terms), pass them as keyterms_prompt to boost recognition further. Whisper Streaming (whisper-rt) covers the broader long tail of regional language variants across its 99-language set.

What is the latency for multilingual real-time speech recognition?

Universal-3 Pro Streaming delivers ~150ms P50 median streaming latency across its core 6 languages — the same latency you get on English-only streams. For a fully managed multilingual conversational agent, the Voice Agent API rolls STT + LLM + TTS into sub-1-second end-to-end round trips for core languages. Whisper Streaming (whisper-rt, for the long tail of 99 languages) has slightly higher latency than u3-rt-pro but stays well within real-time bounds for voice agent workflows.

Solutions

Voice agents for multilingual global support

Build voice agents that understand, detect, and respond in 99 languages with automatic language detection and real-time code-switching — no per-language pipeline required.

Get started free Contact sales

Language detection

Live

Spanish detected

"Buenos días, necesito ayuda con mi cuenta..."

99.4%

French detected

"Bonjour, je voudrais vérifier mon solde..."

98.7%

German detected

"Ich brauche Hilfe mit meiner Bestellung..."

97.9%

Portuguese detected

"Olá, preciso cancelar minha assinatura..."

98.2%

The problem

One language per agent doesn't scale globally

Global companies spend millions operating separate support queues per language — hiring bilingual agents, routing through IVR menus, and accepting degraded accuracy from English-centric ASR. When a customer code-switches mid-sentence, most systems break. Longer handle times, higher staffing costs, and a fragmented experience for non-English speakers all follow. AssemblyAI's multilingual pipeline eliminates the per-language bottleneck with a single model that detects, transcribes, and supports 99 languages — including real-time code-switching.

Built for global-scale voice operations

Languages 99

Total languages supported with automatic detection.

Code-switching 6

Core languages with native code-switching on Universal-3 Pro Streaming.

Multilingual WER 4.58%

Mean word error rate across FLEURS multilingual benchmark.

Uptime 99.9%

SLA with SOC 2 Type 2 certification.

Two ways to build

Pick the API that fits your multilingual stack

Ship a working multilingual voice agent in an afternoon, or drop industry-leading STT into the orchestration stack you already run.

Recommended

Voice Agent API

Our proprietary voice stack via one WebSocket. Build multilingual support agents with automatic language detection and localized voice response — zero infra to manage.

Best for

Multilingual support across 6 core code-switching languages
Built-in language detection and model routing
Teams shipping fast — working multilingual agent in an afternoon
Claude Code compatible — paste the docs and build anything

$4.50/hr — speech, LLM, and voice all included

Get started for free

Free tier available · No credit card required

Bring Your Own Stack

Universal-3 Pro Streaming STT API

The multilingual STT layer for your voice agent. Works natively with LiveKit, Pipecat, Vapi, and Twilio — native code-switching across 6 core languages, automatic routing for 99 total.

Best for

Teams using LiveKit, Pipecat, Vapi, or Twilio as their orchestrator
6 core languages with native code-switching on Universal-3 Pro Streaming
Automatic routing for 99-language coverage
Real-time language detection with confidence scores
Keyterm prompting for domain-specific vocabulary

$0.45/hr — transcription only, unlimited streams

View integration docs

No concurrency caps · Autoscaling included

One pipeline handles every language your customers speak

Automatic language detection

The model identifies the dominant language in real time with confidence scores. Supports constrained detection with expected language lists — no IVR language menu required.

Multilingual transcription with code-switching

Universal-3 Pro Streaming natively handles code-switching across 6 core languages. Automatic model routing extends coverage to 99 languages.

Language-aware LLM processing

Route transcripts through the LLM Gateway with the detected language code passed downstream, so the LLM responds in the caller's language without explicit prompting.

Localized voice response

Voice Agent API generates TTS in the detected language with natural prosody. Full STT → LLM → TTS round trip in sub-1-second for core languages.

language

Pipeline

Ingest multilingual audio

↓

Detect language + transcribe

↓

Route to language-aware LLM

↓

Respond in caller's language

Quickstart

Build a multilingual voice agent in minutes

Voice Agent API — recommended

# Voice Agent API: multilingual support agent
import asyncio, json, websockets

API_KEY = "YOUR_API_KEY"

async def run_agent():
    async with websockets.connect(
        "wss://agents.assemblyai.com/v1/ws",
        additional_headers={"Authorization": f"Bearer {API_KEY}"},
    ) as ws:
        await ws.send(json.dumps({
            "type": "session.update",
            "session": {
                "system_prompt": (
                    "You are a multilingual support agent. Detect the caller's "
                    "language and respond in that language. Handle account "
                    "inquiries, billing, and technical support."
                ),
                "greeting": "Hello! Hola! Bonjour! How can I help today?",
                "input": {"keyterms": ["cuenta", "Abonnement", "Konto", "fatura"]},
                "output": {"voice": "ivy"},
            },
        }))
        async for msg in ws:
            handle(json.loads(msg))  # transcript.user, reply.audio, tool.call, ...

Universal-3 Pro Streaming + LiveKit — BYO stack

# LiveKit + AssemblyAI STT in a multilingual voice agent pipeline
from livekit.agents import Agent, AgentSession, TurnHandlingOptions
from livekit.plugins import assemblyai, cartesia, openai, silero
from livekit.plugins.turn_detector.multilingual import MultilingualModel

class MultilingualAgent(Agent):
    def __init__(self):
        super().__init__(
            instructions=(
                "You are a multilingual support agent. Respond in the "
                "caller's detected language. Handle global customer "
                "inquiries across all product lines."
            ),
        )

async def entrypoint(ctx):
    session = AgentSession(
        stt=assemblyai.STT(
            model="u3-rt-pro",
            language_detection=True,                  # include language metadata in turns
            vad_threshold=0.3,                        # match Silero activation_threshold
        ),
        llm=openai.LLM(model="gpt-4o"),
        tts=cartesia.TTS(),
        vad=silero.VAD.load(activation_threshold=0.3),
        turn_handling=TurnHandlingOptions(
            turn_detection=MultilingualModel(),       # multilingual turn detection model
            endpointing={"min_delay": 0.5, "max_delay": 3.0},
        ),
    )
    await session.start(room=ctx.room, agent=MultilingualAgent())

Try in Playground View full docs

99-language coverage with automatic routing

Universal-3 Pro Streaming handles 6 core languages at the highest accuracy. Automatic model routing extends coverage to 99 languages — one API call covers every market.

Real-time code-switching detection

When callers switch languages mid-sentence — "I need help with mi cuenta, s'il vous plaît" — Universal-3 Pro Streaming transcribes each segment in the correct language without miscategorizing the call.

Language-specific dialect recognition

Universal-3 Pro Streaming goes beyond standard language codes with deep understanding of regional dialects — Quebecois French, Mexican Spanish, Brazilian Portuguese — capturing colloquial expressions and accent-specific pronunciation.

Teams shipping multilingual voice agents on AssemblyAI

30% cost savings on real-time transcription

We require a leading edge speech-to-text provider that can meet our specialized needs: fast, accurate, targeted, and multilingual.

Super

Frequently asked questions

: AssemblyAI is the leading speech-to-text API for multilingual voice agents — Universal-3 Pro Streaming delivers top-tier accuracy across 6 core languages (English, Spanish, French, German, Italian, Portuguese) with native code-switching support, plus automatic model routing for 99 total languages. The Voice Agent API ($4.50/hr) manages the full pipeline — STT, LLM, and TTS — through a single WebSocket with language detection running automatically. Universal-3 Pro Streaming ($0.45/hr, unlimited concurrency) is the right choice if you bring your own LLM and TTS — the language_detection parameter returns the detected language with a confidence score so your agent can respond in the caller's language without IVR menus or manual routing logic.
: Set language_detection=True on Universal-3 Pro Streaming and the API returns a detected language code plus a confidence score on each turn — no IVR language menu or upfront language hint required. The model handles automatic detection across the 6 core code-switching languages natively; for broader coverage, use whisper-rt (99 languages) with the same language_detection flag. You can also pass an expected_languages list to constrain detection to a known subset for higher accuracy. Pipe the detected language code into your LLM prompt so it knows which language to respond in, and pass it to your TTS provider so the voice output matches the caller's language.
: Yes. Universal-3 Pro Streaming natively handles code-switching across its 6 core languages — English, Spanish, French, German, Italian, and Portuguese. When a caller switches mid-sentence ("I need help with mi cuenta"), the model transcribes each segment in the correct language rather than forcing the whole utterance into one language code. For languages outside the core 6, Whisper Streaming (whisper-rt) provides automatic language detection across 99 languages but treats each turn as a single language. For the cleanest multilingual experience, use u3-rt-pro for the core 6 with language_detection enabled.
: Use the whisper-rt model on Universal-3 Pro Streaming — it supports 99 languages with automatic language detection out of the box. For voice agents that need to span beyond the 6 core code-switching languages, the recommended pattern is to start with u3-rt-pro for callers in core regions (highest accuracy + code-switching) and route to whisper-rt as a fallback for callers in less-served languages. AssemblyAI handles the routing automatically when you enable language_detection. Pair with AssemblyAI's LLM Gateway (25+ models across Claude, GPT, and Gemini) — most modern frontier LLMs natively respond in dozens of languages without prompt tweaks, so the full stack just works.
: Yes. Universal-3 Pro Streaming goes beyond ISO language codes with deeper understanding of regional dialects — Quebecois French, Mexican Spanish, Brazilian Portuguese, Castilian Spanish, European Portuguese — capturing colloquial expressions and accent-specific pronunciation. For specialized terminology in a particular dialect (regional product names, slang, brand terms), pass them as keyterms_prompt to boost recognition further. Whisper Streaming (whisper-rt) covers the broader long tail of regional language variants across its 99-language set.
: Universal-3 Pro Streaming delivers ~150ms P50 median streaming latency across its core 6 languages — the same latency you get on English-only streams. For a fully managed multilingual conversational agent, the Voice Agent API rolls STT + LLM + TTS into sub-1-second end-to-end round trips for core languages. Whisper Streaming (whisper-rt, for the long tail of 99 languages) has slightly higher latency than u3-rt-pro but stays well within real-time bounds for voice agent workflows.

Build your multilingual voice agent today

Free tier, no credit card. From language detection to localized voice response in an afternoon.

Get started free