April 8, 2026

How to build a voice agent with Python in 5 minutes

Python voice agent tutorial: build a real-time app with AssemblyAI speech-to-text, GPT-4, and ElevenLabs voice output in 5 minutes with clear Python code examples.

Kelsey Foster

Growth

AI voice agents

Tutorial

Reviewed by

Table of contents

[Visible on live site]

This tutorial shows you how to build a complete voice agent that listens, thinks, and responds naturally using Python. You’ll create a streaming application that processes speech in real-time, generates intelligent responses, and speaks back to users—all in under 100 lines of code.

The voice agent combines three APIs: AssemblyAI’s Universal-3 Pro Streaming model for speech-to-text, OpenAI’s GPT-4 for conversational AI, and ElevenLabs for natural voice synthesis. Each component streams data to minimize response delays and create smooth, human-like conversations.

What you’ll need to get started

You need Python 3.9 or higher, three API keys, and a computer with a microphone and speakers. The setup takes about 2 minutes once you have everything ready.

Install Python dependencies

Open your terminal and run this command to install everything you need:

pip install "assemblyai>=1.0.0" openai "elevenlabs>=1.0.0" pyaudio python-dotenv

Here’s what each package does for your voice agent:

assemblyai: Handles real-time speech recognition with Universal-3 Pro Streaming
openai: Connects to GPT models for smart responses
elevenlabs: Creates natural-sounding voices
pyaudio: Provides access to your microphone
python-dotenv: Loads your API keys from a .env file

Configure your API keys

Create a .env file in your project directory with your API keys:

ASSEMBLYAI_API_KEY=your_assemblyai_key_here OPENAI_API_KEY=your_openai_key_here ELEVENLABS_API_KEY=your_elevenlabs_key_here

Never share this file or commit it to version control. Add .env to your .gitignore file to protect your API keys.

What are the components of a voice agent?

A voice agent is a program that talks to you like a human using three connected parts. These parts work together to create conversations: speech-to-text converts your voice into text, a language model thinks about what you said and creates a response, and text-to-speech turns that response back into spoken words.

This pipeline needs to work in real-time to feel natural. When you speak to Siri or Alexa, you expect quick responses—not awkward pauses that break the conversational flow. Here’s what each component does in your voice agent:

Component	Role	Why Streaming Matters	Example
AssemblyAI	Speech-to-text	Transcribes audio as it arrives, so the LLM can start responding sooner	Converts "what's the weather" to text before you finish speaking
OpenAI GPT-4	Language model	Generates a response token by token, so text-to-speech can begin immediately	Starts answering while still composing the full response
ElevenLabs	Text-to-speech	Plays audio while more is being generated	Speaks the first sentence while generating the second

The difference between good and bad voice agents comes down to speed. Batch processing—where each step waits for the previous one to finish completely—creates those robotic pauses that make conversations feel unnatural.

AssemblyAI’s Universal-3 Pro Streaming model solves this by processing speech as it happens. You get accurate transcription with minimal delay, making conversations feel smooth and responsive.

Set up speech-to-text with AssemblyAI

Speech recognition forms the foundation of your voice agent. AssemblyAI’s Universal-3 Pro Streaming API listens to your microphone and converts speech to text in real-time. The SDK handles all WebSocket complexity automatically—no manual connection management required.

Create a new file called voice_agent.py and add this code:

import assemblyai as aai
 
from assemblyai.streaming.v3 import (
    BeginEvent,
    StreamingClient,
    StreamingClientOptions,
    StreamingError,
    StreamingEvents,
    StreamingParameters,
    TurnEvent,
    TerminationEvent,
)
 
from dotenv import load_dotenv
import os
 
load_dotenv()
 
 
class VoiceAgent:
 
    def __init__(self):
        self.client = StreamingClient(
            StreamingClientOptions(
                api_key=os.getenv('ASSEMBLYAI_API_KEY'),
                api_host="streaming.assemblyai.com",
            )
        )
 
        self.client.on(StreamingEvents.Begin, self.on_begin)
        self.client.on(StreamingEvents.Turn, self.on_turn)
        self.client.on(StreamingEvents.Termination, self.on_terminated)
        self.client.on(StreamingEvents.Error, self.on_error)
 
        self.is_processing = False
 
    def on_begin(self, event: BeginEvent):
        print("Listening... Start speaking!")
 
    def on_turn(self, turn: TurnEvent):
        if not turn.transcript:
            return
 
        if turn.end_of_turn:
            print(f"You said: {turn.transcript}")
            # AI processing added in next section
        else:
            print(f"Hearing: {turn.transcript}", end="\r")
 
    def on_error(self, error: StreamingError):
        print(f"Error: {error}")
 
    def on_terminated(self, event: TerminationEvent):
        print("Connection closed")

This code creates a real-time transcription system that gives you two types of output. Partial transcripts (where end_of_turn is False) show you what the system is hearing as you speak, and final transcripts (where end_of_turn is True) provide the complete sentence when you pause.

Universal-3 Pro uses punctuation-based turn detection—it ends a turn when it detects terminal punctuation (. ? !) after a natural pause. This means you don’t need to press buttons or give special commands—just speak naturally and pause.

Try Speech-to-Text in Playground

Validate accuracy and formatting on your own audio before wiring up the full agent. No code required to start.

Open playground

Connect the language model

The language model is the brain of your voice agent—it understands what you said and decides how to respond. OpenAI’s GPT-4 streams responses token by token, so ElevenLabs can begin speaking before the full response is generated.

Add this OpenAI integration to your VoiceAgent class:

from openai import OpenAI
 
 
class VoiceAgent:
 
    def __init__(self):
        # Previous code...
 
        self.openai_client = OpenAI(api_key=os.getenv('OPENAI_API_KEY'))
 
        self.conversation = [
            {"role": "system", "content": """You are a helpful voice assistant.
 
Keep responses short and conversational.
 
Talk like you're having a normal conversation with someone."""}
        ]
 
    def process_with_llm(self, user_text):
        self.conversation.append({"role": "user", "content": user_text})
 
        response_text = ""
 
        stream = self.openai_client.chat.completions.create(
            model="gpt-4",
            messages=self.conversation,
            stream=True,
            temperature=0.7,
            max_tokens=150
        )
 
        print("Assistant: ", end="")
 
        for chunk in stream:
            if chunk.choices[0].delta.content:
                content = chunk.choices[0].delta.content
                response_text += content
                print(content, end="", flush=True)
 
        print()
 
        self.conversation.append({"role": "assistant", "content": response_text})
 
        self.speak(response_text)

The conversation list keeps the full chat history, so your agent remembers context across turns. The system prompt instructs GPT-4 to keep responses short and conversational—long answers feel awkward when spoken aloud.

Add text-to-speech output

Text-to-speech completes your voice agent by converting AI responses into natural-sounding speech. ElevenLabs provides high-quality voice synthesis that starts playing audio before the entire response finishes generating.

Add voice synthesis to your VoiceAgent class:

from elevenlabs.client import ElevenLabs
from elevenlabs import stream as play_stream
import threading
 
 
class VoiceAgent:
 
    def __init__(self):
        # Previous code...
 
        self.elevenlabs_client = ElevenLabs(api_key=os.getenv('ELEVENLABS_API_KEY'))
 
        self.voice_id = "EXAVITQu4vr4xnSDxMaL"  # Sarah voice
 
    def speak(self, text):
 
        def generate_and_play():
            try:
                audio_stream = self.elevenlabs_client.text_to_speech.stream(
                    voice_id=self.voice_id,
                    text=text,
                    model_id="eleven_turbo_v2_5",
                )
 
                play_stream(audio_stream)
 
            except Exception as e:
                print(f"Voice error: {e}")
 
        thread = threading.Thread(target=generate_and_play, daemon=True)
        thread.start()

ElevenLabs offers different voices with distinct personalities:

Sarah (EXAVITQu4vr4xnSDxMaL): Clear, professional female voice (used in this example)
Josh (TxGEqnHWrfWFTfGW9XjX): Warm, friendly male voice
Elli (MF3mGyEYCl7XYWbV9V6O): Young, energetic female voice

The background thread prevents voice synthesis from freezing your program. While audio generates and plays, your voice agent continues listening for the next input.

Build the complete voice agent

Here’s your complete voice_agent.py file:

import assemblyai as aai
import os
import sys
import threading
 
from assemblyai.streaming.v3 import (
    BeginEvent,
    StreamingClient,
    StreamingClientOptions,
    StreamingError,
    StreamingEvents,
    StreamingParameters,
    TurnEvent,
    TerminationEvent,
)
 
from elevenlabs.client import ElevenLabs
from elevenlabs import stream as play_stream
from openai import OpenAI
from dotenv import load_dotenv
 
load_dotenv()
 
 
class VoiceAgent:
 
    def __init__(self):
        self.client = StreamingClient(
            StreamingClientOptions(
                api_key=os.getenv('ASSEMBLYAI_API_KEY'),
                api_host="streaming.assemblyai.com",
            )
        )
 
        self.client.on(StreamingEvents.Begin, self.on_begin)
        self.client.on(StreamingEvents.Turn, self.on_turn)
        self.client.on(StreamingEvents.Termination, self.on_terminated)
        self.client.on(StreamingEvents.Error, self.on_error)
 
        self.elevenlabs_client = ElevenLabs(api_key=os.getenv('ELEVENLABS_API_KEY'))
        self.openai_client = OpenAI(api_key=os.getenv('OPENAI_API_KEY'))
 
        self.is_processing = False
        self.voice_id = "EXAVITQu4vr4xnSDxMaL"
 
        self.conversation = [
            {"role": "system", "content": """You are a helpful voice assistant.
 
Keep responses short and conversational.
 
Talk like you're having a normal conversation with someone."""}
        ]
 
    def on_begin(self, event: BeginEvent):
        print("\n Voice Agent Ready! Start speaking...\n")
 
    def on_turn(self, turn: TurnEvent):
        if not turn.transcript:
            return
 
        if turn.end_of_turn:
            print("\r" + " " * 50 + "\r", end="")
            print(f"You: {turn.transcript}")
 
            if not self.is_processing:
                self.is_processing = True
                self.process_with_llm(turn.transcript)
                self.is_processing = False
        else:
            print(f"Listening: {turn.transcript}...", end="\r")
 
    def on_error(self, error: StreamingError):
        print(f"\n Error: {error}\n")
 
    def on_terminated(self, event: TerminationEvent):
        print("\n Voice Agent stopped\n")
 
    def process_with_llm(self, user_text):
        self.conversation.append({"role": "user", "content": user_text})
 
        response_text = ""
 
        stream = self.openai_client.chat.completions.create(
            model="gpt-4",
            messages=self.conversation,
            stream=True,
            temperature=0.7,
            max_tokens=150
        )
 
        print("Agent: ", end="")
 
        for chunk in stream:
            if chunk.choices[0].delta.content:
                content = chunk.choices[0].delta.content
                response_text += content
                print(content, end="", flush=True)
 
        print()
 
        self.conversation.append({"role": "assistant", "content": response_text})
 
        self.speak(response_text)
 
    def speak(self, text):
 
        def generate_and_play():
            try:
                audio_stream = self.elevenlabs_client.text_to_speech.stream(
                    voice_id=self.voice_id,
                    text=text,
                    model_id="eleven_turbo_v2_5",
                )
 
                play_stream(audio_stream)
 
            except Exception as e:
                print(f"Voice error: {e}")
 
        voice_thread = threading.Thread(target=generate_and_play)
        voice_thread.daemon = True
        voice_thread.start()
 
    def start(self):
        self.client.connect(
            StreamingParameters(
                sample_rate=16000,
                speech_model="u3-rt-pro",
            )
        )
 
        try:
            self.client.stream(aai.extras.MicrophoneStream(sample_rate=16000))
        except KeyboardInterrupt:
            self.stop()
 
    def stop(self):
        print("\nStopping voice agent...")
        self.client.disconnect(terminate=True)
        sys.exit(0)
 
 
if __name__ == "__main__":
    agent = VoiceAgent()
    agent.start()

This complete implementation includes error handling, conversation memory, and clean shutdown. The agent remembers what you’ve talked about during each session, enabling natural back-and-forth conversations.

Run your voice agent

Start your voice agent with this command:

python voice_agent.py

When you see "Voice Agent Ready! Start speaking..." the agent is listening. Speak normally into your microphone and the agent responds with both text and voice.

Try these conversation starters to test your agent:

"What’s the weather like today?"
"Tell me a quick joke"
"Help me plan dinner"
"Explain how WiFi works simply"

Common problems and solutions:

No microphone input: Check system permissions and microphone settings
Slow responses: Test your internet connection and consider using gpt-3.5-turbo for faster processing
Voice cuts off: Add a small delay after TTS playback or check your ElevenLabs API quota

Final words

You’ve built a complete streaming voice agent that processes speech in real-time and responds with natural conversation. This implementation combines speech recognition, AI processing, and voice synthesis into a single program that demonstrates the power of modern Voice AI models.

AssemblyAI’s Universal-3 Pro Streaming model makes this possible by providing the accuracy and speed that voice agents require. The SDK handles complex WebSocket connections and audio processing, letting you focus on building your application instead of managing low-level networking code.

To go further, explore the Universal-3 Pro Streaming docs for advanced features like keyterm prompting, speaker diarization, and real-time configuration updates—all without restarting your agent.

Start building with AssemblyAI

Get your API key and access the Universal-3 Pro Streaming model used in this tutorial. Spin up real-time transcription in minutes with the SDK.

Get API key

Frequently asked questions

Do I need WebSocket knowledge to build this voice agent?

No. The AssemblyAI Python SDK handles the WebSocket connection, reconnection logic, and audio streaming protocol automatically. You write event handlers and the SDK takes care of the rest.

How much does running this voice agent cost per hour?

This voice agent costs approximately $0.50–$1.00 per hour of conversation across all three services. AssemblyAI charges about $0.45/hr for Universal-3 Pro Streaming transcription, OpenAI costs roughly $0.30/hour for GPT-4 responses, and ElevenLabs runs about $0.20/hour for voice synthesis.

Can I replace AssemblyAI with a different speech-to-text service?

While technically possible, switching providers requires implementing WebSocket handling, audio streaming protocols, turn detection logic, and connection management yourself. You’d lose AssemblyAI’s built-in punctuation-based turn detection and the SDK simplicity that enables 5-minute implementation.

Can I use this pattern inside a framework like Pipecat or LiveKit?

Yes — AssemblyAI has first-party integrations for Pipecat, LiveKit, Vapi, and Twilio. These frameworks handle telephony, orchestration, and turn-taking so you can focus on your agent’s logic.

Does this work with languages other than English?

Yes. You can configure the speech model for other languages using StreamingParameters, for example: StreamingParameters(speech_model="u3-rt-pro", sample_rate=16_000, prompt="Transcribe Spanish."). For a full list of supported languages, see the Supported Languages page.

How to build a voice agent with Python in 5 minutes

What you’ll need to get started

Install Python dependencies

Configure your API keys

What are the components of a voice agent?

Set up speech-to-text with AssemblyAI

Connect the language model

Add text-to-speech output

Build the complete voice agent

Run your voice agent

Final words

Frequently asked questions

How to build with the Voice Agent API

Top text-to-speech APIs in 2026

Using the Voice Agent API alongside an existing voice stack

Build a voice agent for telehealth triage

Build a voice agent with function calling

🚀 New AssemblyAI Go SDK & Speech-to-Text Tutorials

Kaldi Install for Dummies

Conformer-2: a state-of-the-art speech recognition model trained on 1.1M hours of data

How to build a voice agent with Python in 5 minutes

What you’ll need to get started

Install Python dependencies

Configure your API keys

What are the components of a voice agent?

Set up speech-to-text with AssemblyAI

Connect the language model

Add text-to-speech output

Build the complete voice agent

Run your voice agent

Final words

Frequently asked questions

Related posts

How to build with the Voice Agent API

Top text-to-speech APIs in 2026

Using the Voice Agent API alongside an existing voice stack

Build a voice agent for telehealth triage

Build a voice agent with function calling

🚀 New AssemblyAI Go SDK & Speech-to-Text Tutorials

Kaldi Install for Dummies

Conformer-2: a state-of-the-art speech recognition model trained on 1.1M hours of data