What is the best API for building a customer support voice agent?

The AssemblyAI Voice Agent API is purpose-built for support and contact-center voice agents — one WebSocket replaces separate STT, LLM, and TTS providers at a flat $4.50/hr per session. Universal-3 Pro Streaming powers transcription with industry-leading entity accuracy on noisy phone audio, so callers don't have to repeat order numbers, account IDs, or product names. Teams that already run LiveKit or Pipecat can also use Universal-3 Pro Streaming STT standalone at $0.45/hr.

Can AI voice agents replace legacy IVR systems?

Yes. AI voice agents resolve calls that touch-tone IVR can't — natural-language intent classification, account lookups, and conditional routing happen inside a single conversation instead of nested menu trees. The Voice Agent API runs the full stack (STT → LLM → TTS) so one agent handles intent routing, status updates, and escalation without forcing callers to navigate a directory. Most contact-center teams pilot AI agents on high-volume tier-1 intents first, then expand coverage from there.

Can a voice agent handle tier-1 ticket deflection and escalate to a human when needed?

Tier-1 deflection is one of the highest-ROI use cases for AI voice agents — order status checks, password resets, appointment scheduling, and account lookups can all run end-to-end without a human. The Voice Agent API supports tool calls into your CRM and preserves conversation context, so when the agent transfers to a person, the live agent receives the full transcript and detected intent. Transfers can happen over Twilio, SIP, or any orchestrator your contact center already runs.

Should I use the Voice Agent API or Universal-3 Pro Streaming STT for contact-center calls?

Use the Voice Agent API if you want a working support agent in an afternoon — one WebSocket handles STT, LLM, and TTS at a flat $4.50/hr per session with no orchestrator to manage. Use Universal-3 Pro Streaming STT ($0.45/hr) when you already run LiveKit, Pipecat, or another cascading pipeline and want to bring your own LLM and TTS. Both share the same Universal-3 Pro speech foundation, so transcription quality stays consistent no matter which path you pick.

How does AssemblyAI handle PII redaction and HIPAA compliance for customer calls?

PII redaction runs in-stream — names, card numbers, addresses, and account IDs are masked before transcripts leave the pipeline, so sensitive fields never reach your CRM, QA tooling, or analytics warehouse. AssemblyAI supports HIPAA-ready deployments with BAAs available for healthcare workloads, plus SOC 2 Type 2 certification, AES-128/256 encryption at rest, and TLS 1.2+ in transit. Redaction and retention policies can be configured per session, so different call types run under different rules.

What languages does AssemblyAI support for live contact-center calls?

AssemblyAI's streaming models support English by default, with multilingual streaming covering Spanish, French, German, Italian, and Portuguese (beta) — addressing the majority of North American and European call volume. The pre-recorded API supports a substantially wider language set for asynchronous workloads like voicemail, post-call analysis, and QA scoring. Additional streaming languages are added throughout the year — check the changelog for the latest coverage.

Solutions

Voice agents for customer support & contact centers

Replace legacy IVR with AI voice agents powered by the fastest, most accurate speech-to-text. Build end-to-end with our Voice Agent API, or drop Universal-3 Pro Streaming into your existing stack.

Get started free Contact sales

The problem

Legacy IVR is costing you customers

Touch-tone menus and brittle keyword bots strand callers in loops that end in a hang-up or an escalation. Modern voice agents — built on accurate streaming STT, a managed LLM, and natural TTS — resolve more calls before a human ever picks up.

Built for contact center performance

Latency ~150ms

P50 median streaming latency for Universal-3 Pro Streaming.

Entity 43%

Better alphanumeric accuracy than other providers.

Uptime 99.9%

SLA with SOC 2 Type 2 certification.

Scale 40TB+

Audio processed daily in production.

Two ways to build

Pick the API that fits your support stack

Ship a working support agent in an afternoon, or drop industry-leading STT into the orchestrator you already run.

Recommended

Voice Agent API

Our proprietary voice stack via one WebSocket. Connect, stream audio in, get audio back — we handle the rest.

Best for

Best-in-class voice agents — the preferred way to build with AssemblyAI
Customer support agents, AI companions, clinical intake, language learning
Teams shipping fast — working agent in an afternoon, no infra to manage
Claude Code compatible — paste the docs and build anything

$4.50/hr — speech, LLM, and voice all included

Get started for free

Free tier available · No credit card required

Bring Your Own Stack

Universal-3 Pro Streaming STT API

The STT layer for your cascading voice agent architecture. Works natively with your preferred orchestrator.

Best for

Teams already using LiveKit, Pipecat, or Vapi as their orchestration layer
Teams running cascading architectures (STT → LLM → TTS)
High-scale deployments where margin and full control matter
Complex workflows with RAG, custom tooling, or proprietary LLMs
HIPAA, SOC 2 — bring your own compliance infrastructure

$0.45/hr — transcription only, unlimited concurrent streams

View integration docs

No concurrency caps · Autoscaling included

Your support agent pipeline

Ingest caller audio

Voice Agent API: single WebSocket. Or Twilio Media Streams → U3 Pro Streaming for BYO stack.

Real-time transcription

Punctuation-based turn detection at ~150ms P50. Keyterm boosting for your product vocabulary.

LLM reasoning

Intent classification, KB lookup, and response generation. Managed (Voice Agent API) or BYO.

Voice response

TTS audio streamed back to caller. Full round-trip under 1 second.

Quickstart

Get a working agent in minutes

Voice Agent API — recommended

# Voice Agent API: one WebSocket, full pipeline
import asyncio, json, websockets

API_KEY = "YOUR_API_KEY"

async def run_agent():
    async with websockets.connect(
        "wss://agents.assemblyai.com/v1/ws",
        extra_headers={"Authorization": f"Bearer {API_KEY}"},
    ) as ws:
        await ws.send(json.dumps({
            "type": "session.update",
            "session": {
                "system_prompt": "You are a helpful support agent for Acme Corp.",
                "greeting": "Hi, this is Acme support — how can I help?",
                "output": {"voice": "ivy"},
            },
        }))
        # Stream audio in, get audio + transcript back
        async for msg in ws:
            handle(json.loads(msg))  # transcript.user, audio.delta, tool.call, ...

Universal-3 Pro Streaming + LiveKit — BYO stack

# LiveKit + AssemblyAI STT in a cascading pipeline
from livekit.agents import Agent, AgentSession
from livekit.plugins import assemblyai, cartesia, openai, silero

class SupportAgent(Agent):
    def __init__(self):
        super().__init__(
            instructions="You are a support agent for Acme Corp. Be concise.",
        )

async def entrypoint(ctx):
    session = AgentSession(
        stt=assemblyai.STT(
            model="universal-streaming-english",
            keyterms_prompt=["Acme Pro", "tier-2", "premium plan"],
        ),
        llm=openai.LLM(model="gpt-4o"),
        tts=cartesia.TTS(),
        vad=silero.VAD.load(),
    )
    await session.start(room=ctx.room, agent=SupportAgent())

Try in Playground View full docs

Resolution-grade accuracy

Universal-3 Pro Streaming transcribes 94%+ on noisy contact-center audio — the difference between a deflected ticket and an angry escalation.

PII redaction by default

Names, card numbers, addresses, and account IDs masked before transcripts hit your CRM, data warehouse, or QA stack.

Real-time intelligence

Topic detection, sentiment, and call outcomes available on the live stream — coach agents in the moment, not the next day.

Teams shipping support agents on AssemblyAI

25% reduction in handling time

Edgetier's customer CarTrawler reduced chat handling time by 25% through enhanced insights and agent optimization.

Edgetier

Frequently asked questions

: The AssemblyAI Voice Agent API is purpose-built for support and contact-center voice agents — one WebSocket replaces separate STT, LLM, and TTS providers at a flat $4.50/hr per session. Universal-3 Pro Streaming powers transcription with industry-leading entity accuracy on noisy phone audio, so callers don't have to repeat order numbers, account IDs, or product names. Teams that already run LiveKit or Pipecat can also use Universal-3 Pro Streaming STT standalone at $0.45/hr.
: Yes. AI voice agents resolve calls that touch-tone IVR can't — natural-language intent classification, account lookups, and conditional routing happen inside a single conversation instead of nested menu trees. The Voice Agent API runs the full stack (STT → LLM → TTS) so one agent handles intent routing, status updates, and escalation without forcing callers to navigate a directory. Most contact-center teams pilot AI agents on high-volume tier-1 intents first, then expand coverage from there.
: Tier-1 deflection is one of the highest-ROI use cases for AI voice agents — order status checks, password resets, appointment scheduling, and account lookups can all run end-to-end without a human. The Voice Agent API supports tool calls into your CRM and preserves conversation context, so when the agent transfers to a person, the live agent receives the full transcript and detected intent. Transfers can happen over Twilio, SIP, or any orchestrator your contact center already runs.
: Use the Voice Agent API if you want a working support agent in an afternoon — one WebSocket handles STT, LLM, and TTS at a flat $4.50/hr per session with no orchestrator to manage. Use Universal-3 Pro Streaming STT ($0.45/hr) when you already run LiveKit, Pipecat, or another cascading pipeline and want to bring your own LLM and TTS. Both share the same Universal-3 Pro speech foundation, so transcription quality stays consistent no matter which path you pick.
: PII redaction runs in-stream — names, card numbers, addresses, and account IDs are masked before transcripts leave the pipeline, so sensitive fields never reach your CRM, QA tooling, or analytics warehouse. AssemblyAI supports HIPAA-ready deployments with BAAs available for healthcare workloads, plus SOC 2 Type 2 certification, AES-128/256 encryption at rest, and TLS 1.2+ in transit. Redaction and retention policies can be configured per session, so different call types run under different rules.
: AssemblyAI's streaming models support English by default, with multilingual streaming covering Spanish, French, German, Italian, and Portuguese (beta) — addressing the majority of North American and European call volume. The pre-recorded API supports a substantially wider language set for asynchronous workloads like voicemail, post-call analysis, and QA scoring. Additional streaming languages are added throughout the year — check the changelog for the latest coverage.

Build your support voice agent today

Free tier, no credit card. Get a working agent on real call audio in an afternoon.

Get started free