Connect a browser to the Voice Agent API in two steps:Documentation Index
Fetch the complete documentation index at: https://assemblyai.com/docs/llms.txt
Use this file to discover all available pages before exploring further.
- Your server calls
GET /v1/tokenwith your API key to mint a short-lived temporary token. - Your browser opens the WebSocket with
?token=<token>, no API key exposed.
1. Generate a token on your server
CallGET /v1/token with your API key in the Authorization header. Pick an expires_in_seconds short enough to limit replay risk (60–300s is a good default) and an optional max_session_duration_seconds to cap the session length.
These two parameters control different things and are easy to confuse:
expires_in_secondsis the token redemption window — how long the client has to use this token to open a WebSocket. If the window elapses before the WebSocket is opened, the server returns asession.errorwith codeunauthorizedon the first frame instead ofsession.ready. Once asession.readyhas been received, this value no longer applies.max_session_duration_secondsis the session duration cap — how long the resulting voice agent session is allowed to run after the WebSocket is open.
API reference
View the endpoint reference.
expires_in_seconds must be between 1 and 600. max_session_duration_seconds must be between 60 and 10800 (defaults to 10800, the 3-hour maximum session duration).Session end at max_session_duration_seconds
When the session reaches its server-side duration limit, the WebSocket closes. There is no separate “closing soon” warning event before this — if you need to finalize gracefully (e.g. play a wrap-up message, save state), run a client-side timer using the value you passed for max_session_duration_seconds and start your wrap-up a few seconds before it elapses.
Token expiry and failure modes
If a token is missing, expired, or invalid, the server rejects the handshake with anUNAUTHORIZED error (close code 1008). In browsers, this may surface as a close event with code 1006 and no body, you won’t receive a session.error event. Always fetch a fresh token immediately before each connection attempt.
If the WebSocket drops mid-session and you need to reconnect with session.resume, you’ll need a new token for the new WebSocket, the original token can’t be reused.
2. Connect from the browser with the token
Fetch the token from your server, then open the WebSocket with?token=<token>. No Authorization header is needed.
Fetch a fresh token for every new WebSocket connection. Tokens are single-use, a dropped connection needs a new token to reconnect (including when using
session.resume).3. Browser quickstart
A complete working example that captures microphone audio, streams it to the Voice Agent API, and plays back the agent’s response. This requires two files, an HTML page and an AudioWorklet processor.AudioWorklet processors must be loaded from a URL (
audioContext.audioWorklet.addModule(url)), so you need at least two files. This example won’t work in a single-file environment like CodePen or JSFiddle without modifications. Use a local server (npx serve .) or a framework with static file support.pcm-processor.js in the same directory as your HTML file:
4. Browser compatibility
The quickstart above works as-is on Chromium-based browsers (Chrome, Edge, Brave, Arc) and Firefox. Safari has a known quirk that produces silently garbled audio if you don’t account for it.| Browser | AudioContext({ sampleRate }) honored | Recommended pipeline |
|---|---|---|
| Chrome / Edge | Yes | Use the quickstart as-is. |
| Firefox | Yes | Use the quickstart as-is. |
| Safari (desktop, iOS) | No — runs at hardware rate (typically 48 kHz) | Let AudioContext use its default rate and resample to/from 24 kHz inside the worklet (capture) and before playback. |
Safari: resample inside the worklet
Safari ignores thesampleRate constructor option, so an AudioContext({ sampleRate: 24000 }) will silently run at 48 kHz on most Macs. Sending those samples to the Voice Agent API as if they were 24 kHz produces audio that sounds chipmunked or garbled.
Detect the actual context rate at runtime, send it into the worklet, and resample there:
AudioBuffer at 24 kHz and let the context resample on output, or resample the decoded PCM16 to audioCtx.sampleRate before scheduling — the simplest version (createBuffer(1, length, 24000)) works on all current browsers.
Linear interpolation is good enough for speech at 24 kHz. If you want higher fidelity, use a windowed-sinc resampler such as
libsamplerate compiled to WASM, or push the PCM16 through an OfflineAudioContext at the target rate.Cross-browser checklist
- User gesture required. All major browsers gate
getUserMediaandAudioContextstartup behind a user gesture (Safari is strictest). Start audio inside aclickortouchstarthandler and callawait audioCtx.resume()before connecting nodes. - HTTPS or
localhost.getUserMediaonly works on secure origins. - Echo cancellation. Pass
echoCancellation: truetogetUserMediaso the agent’s TTS playing through the speakers doesn’t get re-captured by the mic. - Audio output sink. On iOS Safari, set the
<audio playsinline>attribute or route through anAudioContextdestination — autoplay and full-screen behavior differ from desktop.