The AssemblyAI Self-Hosted Streaming Solution provides a secure, low-latency real-time transcription service that can be deployed within your own infrastructure. Audio, transcripts, and PII never leave your network — only license validation and usage metadata are transmitted back to AssemblyAI.Documentation Index
Fetch the complete documentation index at: https://assemblyai.com/docs/llms.txt
Use this file to discover all available pages before exploring further.
Self-hosted streaming requires an upfront commercial commitment of $20,000. Contact our sales team to discuss your needs and learn more about our self-hosted offering.
streaming-self-hosting-stack repository. This page covers what self-hosted streaming is, what you need to run it, and how the stack is shaped. Go to the repo for the actual setup steps.
What you can self-host
Self-hosted streaming ships as two separate stacks. Each stack serves one model family, runs from its own Docker Compose file, and uses its own GPU. You pick the stack that matches the model you want to serve — they are not designed to run side by side.| Stack | Model(s) served | Compose file | Best for |
|---|---|---|---|
| Universal Streaming | Universal Streaming English + Multilingual | docker-compose.yml | English and multilingual transcription workloads, telephony, captioning |
| Universal-3 Pro Streaming | Universal-3 Pro | docker-compose.u3pro.yml | Voice agents — short utterances, low end-of-turn latency, continuous partials |
Core principle
- Complete data isolation. Audio, transcripts, and PII stay inside your infrastructure. The only outbound traffic is license validation and (for usage-based contracts) usage metadata to
https://usage-tracker.assemblyai.com.
System requirements
Hardware
- Universal Streaming. NVIDIA T4 or newer per ASR container. We recommend at least 4 CPU and 16 GB RAM per ASR container.
- Universal-3 Pro Streaming. NVIDIA L4, A10, A100, L40S, H100, or equivalent with at least 24 GB VRAM. The container also bundles ~14 GB of model weights, so plan disk accordingly. T4 GPUs are not sufficient for U3 Pro.
Software
- Operating system. Linux
- Container runtime. Docker and Docker Compose (v2 — the
docker composecommand, notdocker-compose) - NVIDIA Container Toolkit. Required for Docker to access the GPU
- AWS credentials. AssemblyAI provisions a scoped AWS access key for your team so you can pull container images from our private ECR registry
Architecture
Both stacks share the same gateway, load balancer, and license proxy — they only differ in the ASR backend.Shared services (both stacks)
streaming-api— Gateway WebSocket service that clients connect to. Handles session lifecycle, audio framing, and routing to the ASR backend.license-and-usage-proxy— Validates the license file at startup and reports usage metadata (for usage-based contracts).streaming-asr-lb—nginx:alpineload balancer that routes ASR gRPC requests to the right backend based on theX-Model-Versionheader.
Universal Streaming stack
Adds two ASR backends:streaming-asr-english— English speech recognition.streaming-asr-multilang— Multilingual speech recognition.
Universal-3 Pro Streaming stack
Adds a single ASR backend:streaming-asr-u3pro— Universal-3 Pro speech recognition. Available as of v0.6.0.
Connection flow
en-default and ml-default, U3 Pro routes for u3-pro.
Getting started
Follow the upstream repo’s README for the actual setup steps. At a high level:- Get credentials and a license file from your AssemblyAI representative — an AWS access key scoped to ECR, and a
license.jwtfile. The same license file works for both stacks. - Install Docker, Docker Compose, and the NVIDIA Container Toolkit. See the README’s setup section for verification commands.
- Authenticate to ECR with the provided AWS credentials.
- Pick a stack and configure
.envwith the image references from the repo’s.env.example. - Start the stack with
docker compose up -d(Universal Streaming) ordocker compose -f docker-compose.u3pro.yml up -d(Universal-3 Pro Streaming).
Universal Streaming ASR containers take roughly 2 minutes to become ready and log
Ready to serve!. The Universal-3 Pro Streaming ASR container takes roughly 5 minutes and logs U3Pro ASR Server ready!. Health checks may report unhealthy during startup — that is expected.Running a test client
The repo ships an example Python client understreaming_example that streams a pre-recorded WAV file to the WebSocket endpoint. It supports all three speech models via the --speech-model flag:
universal-streaming-english— Universal Streaming, Englishuniversal-streaming-multilingual— Universal Streaming, multilingualu3-rt-pro— Universal-3 Pro Streaming
X-Model-Version header. Make sure the value you pass matches a backend deployed in the stack you started.
Switching between stacks
The two stacks listen on the same ports (streaming-api on 8080, ASR load balancer on the gRPC backend), so they cannot run simultaneously. To switch:
Production deployment
Per-service deployment strategy, resource sizing, autoscaling thresholds, health-check tuning, and thelicense-and-usage-proxy /v1/status endpoint reference all live in the repo’s Production Deployment Recommendations section.