How to build a voice agent with Python in 5 minutes
Python voice agent tutorial: build a real-time app with AssemblyAI speech-to-text, GPT-4, and ElevenLabs voice output in 5 minutes with clear Python code examples.



This tutorial shows you how to build a complete voice agent that listens, thinks, and responds naturally using Python. You’ll create a streaming application that processes speech in real-time, generates intelligent responses, and speaks back to users—all in under 100 lines of code.
The voice agent combines three APIs: AssemblyAI’s Universal-3 Pro Streaming model for speech-to-text, OpenAI’s GPT-4 for conversational AI, and ElevenLabs for natural voice synthesis. Each component streams data to minimize response delays and create smooth, human-like conversations.
What you’ll need to get started
You need Python 3.9 or higher, three API keys, and a computer with a microphone and speakers. The setup takes about 2 minutes once you have everything ready.
Install Python dependencies
Open your terminal and run this command to install everything you need:
pip install "assemblyai>=1.0.0" openai "elevenlabs>=1.0.0" pyaudio python-dotenvHere’s what each package does for your voice agent:
- assemblyai: Handles real-time speech recognition with Universal-3 Pro Streaming
- openai: Connects to GPT models for smart responses
- elevenlabs: Creates natural-sounding voices
- pyaudio: Provides access to your microphone
- python-dotenv: Loads your API keys from a .env file
Configure your API keys
Create a .env file in your project directory with your API keys:
Never share this file or commit it to version control. Add .env to your .gitignore file to protect your API keys.
What are the components of a voice agent?
A voice agent is a program that talks to you like a human using three connected parts. These parts work together to create conversations: speech-to-text converts your voice into text, a language model thinks about what you said and creates a response, and text-to-speech turns that response back into spoken words.
This pipeline needs to work in real-time to feel natural. When you speak to Siri or Alexa, you expect quick responses—not awkward pauses that break the conversational flow. Here’s what each component does in your voice agent:
The difference between good and bad voice agents comes down to speed. Batch processing—where each step waits for the previous one to finish completely—creates those robotic pauses that make conversations feel unnatural.
AssemblyAI’s Universal-3 Pro Streaming model solves this by processing speech as it happens. You get accurate transcription with minimal delay, making conversations feel smooth and responsive.
Set up speech-to-text with AssemblyAI
Speech recognition forms the foundation of your voice agent. AssemblyAI’s Universal-3 Pro Streaming API listens to your microphone and converts speech to text in real-time. The SDK handles all WebSocket complexity automatically—no manual connection management required.
Create a new file called voice_agent.py and add this code:
import assemblyai as aai
from assemblyai.streaming.v3 import (
BeginEvent,
StreamingClient,
StreamingClientOptions,
StreamingError,
StreamingEvents,
StreamingParameters,
TurnEvent,
TerminationEvent,
)
from dotenv import load_dotenv
import os
load_dotenv()
class VoiceAgent:
def __init__(self):
self.client = StreamingClient(
StreamingClientOptions(
api_key=os.getenv('ASSEMBLYAI_API_KEY'),
api_host="streaming.assemblyai.com",
)
)
self.client.on(StreamingEvents.Begin, self.on_begin)
self.client.on(StreamingEvents.Turn, self.on_turn)
self.client.on(StreamingEvents.Termination, self.on_terminated)
self.client.on(StreamingEvents.Error, self.on_error)
self.is_processing = False
def on_begin(self, event: BeginEvent):
print("Listening... Start speaking!")
def on_turn(self, turn: TurnEvent):
if not turn.transcript:
return
if turn.end_of_turn:
print(f"You said: {turn.transcript}")
# AI processing added in next section
else:
print(f"Hearing: {turn.transcript}", end="\r")
def on_error(self, error: StreamingError):
print(f"Error: {error}")
def on_terminated(self, event: TerminationEvent):
print("Connection closed")
This code creates a real-time transcription system that gives you two types of output. Partial transcripts (where end_of_turn is False) show you what the system is hearing as you speak, and final transcripts (where end_of_turn is True) provide the complete sentence when you pause.
Universal-3 Pro uses punctuation-based turn detection—it ends a turn when it detects terminal punctuation (. ? !) after a natural pause. This means you don’t need to press buttons or give special commands—just speak naturally and pause.
Connect the language model
The language model is the brain of your voice agent—it understands what you said and decides how to respond. OpenAI’s GPT-4 streams responses token by token, so ElevenLabs can begin speaking before the full response is generated.
Add this OpenAI integration to your VoiceAgent class:
from openai import OpenAI
class VoiceAgent:
def __init__(self):
# Previous code...
self.openai_client = OpenAI(api_key=os.getenv('OPENAI_API_KEY'))
self.conversation = [
{"role": "system", "content": """You are a helpful voice assistant.
Keep responses short and conversational.
Talk like you're having a normal conversation with someone."""}
]
def process_with_llm(self, user_text):
self.conversation.append({"role": "user", "content": user_text})
response_text = ""
stream = self.openai_client.chat.completions.create(
model="gpt-4",
messages=self.conversation,
stream=True,
temperature=0.7,
max_tokens=150
)
print("Assistant: ", end="")
for chunk in stream:
if chunk.choices[0].delta.content:
content = chunk.choices[0].delta.content
response_text += content
print(content, end="", flush=True)
print()
self.conversation.append({"role": "assistant", "content": response_text})
self.speak(response_text)
The conversation list keeps the full chat history, so your agent remembers context across turns. The system prompt instructs GPT-4 to keep responses short and conversational—long answers feel awkward when spoken aloud.
Add text-to-speech output
Text-to-speech completes your voice agent by converting AI responses into natural-sounding speech. ElevenLabs provides high-quality voice synthesis that starts playing audio before the entire response finishes generating.
Add voice synthesis to your VoiceAgent class:
from elevenlabs.client import ElevenLabs
from elevenlabs import stream as play_stream
import threading
class VoiceAgent:
def __init__(self):
# Previous code...
self.elevenlabs_client = ElevenLabs(api_key=os.getenv('ELEVENLABS_API_KEY'))
self.voice_id = "EXAVITQu4vr4xnSDxMaL" # Sarah voice
def speak(self, text):
def generate_and_play():
try:
audio_stream = self.elevenlabs_client.text_to_speech.stream(
voice_id=self.voice_id,
text=text,
model_id="eleven_turbo_v2_5",
)
play_stream(audio_stream)
except Exception as e:
print(f"Voice error: {e}")
thread = threading.Thread(target=generate_and_play, daemon=True)
thread.start()
ElevenLabs offers different voices with distinct personalities:
- Sarah (EXAVITQu4vr4xnSDxMaL): Clear, professional female voice (used in this example)
- Josh (TxGEqnHWrfWFTfGW9XjX): Warm, friendly male voice
- Elli (MF3mGyEYCl7XYWbV9V6O): Young, energetic female voice
The background thread prevents voice synthesis from freezing your program. While audio generates and plays, your voice agent continues listening for the next input.
Build the complete voice agent
Here’s your complete voice_agent.py file:
import assemblyai as aai
import os
import sys
import threading
from assemblyai.streaming.v3 import (
BeginEvent,
StreamingClient,
StreamingClientOptions,
StreamingError,
StreamingEvents,
StreamingParameters,
TurnEvent,
TerminationEvent,
)
from elevenlabs.client import ElevenLabs
from elevenlabs import stream as play_stream
from openai import OpenAI
from dotenv import load_dotenv
load_dotenv()
class VoiceAgent:
def __init__(self):
self.client = StreamingClient(
StreamingClientOptions(
api_key=os.getenv('ASSEMBLYAI_API_KEY'),
api_host="streaming.assemblyai.com",
)
)
self.client.on(StreamingEvents.Begin, self.on_begin)
self.client.on(StreamingEvents.Turn, self.on_turn)
self.client.on(StreamingEvents.Termination, self.on_terminated)
self.client.on(StreamingEvents.Error, self.on_error)
self.elevenlabs_client = ElevenLabs(api_key=os.getenv('ELEVENLABS_API_KEY'))
self.openai_client = OpenAI(api_key=os.getenv('OPENAI_API_KEY'))
self.is_processing = False
self.voice_id = "EXAVITQu4vr4xnSDxMaL"
self.conversation = [
{"role": "system", "content": """You are a helpful voice assistant.
Keep responses short and conversational.
Talk like you're having a normal conversation with someone."""}
]
def on_begin(self, event: BeginEvent):
print("\n Voice Agent Ready! Start speaking...\n")
def on_turn(self, turn: TurnEvent):
if not turn.transcript:
return
if turn.end_of_turn:
print("\r" + " " * 50 + "\r", end="")
print(f"You: {turn.transcript}")
if not self.is_processing:
self.is_processing = True
self.process_with_llm(turn.transcript)
self.is_processing = False
else:
print(f"Listening: {turn.transcript}...", end="\r")
def on_error(self, error: StreamingError):
print(f"\n Error: {error}\n")
def on_terminated(self, event: TerminationEvent):
print("\n Voice Agent stopped\n")
def process_with_llm(self, user_text):
self.conversation.append({"role": "user", "content": user_text})
response_text = ""
stream = self.openai_client.chat.completions.create(
model="gpt-4",
messages=self.conversation,
stream=True,
temperature=0.7,
max_tokens=150
)
print("Agent: ", end="")
for chunk in stream:
if chunk.choices[0].delta.content:
content = chunk.choices[0].delta.content
response_text += content
print(content, end="", flush=True)
print()
self.conversation.append({"role": "assistant", "content": response_text})
self.speak(response_text)
def speak(self, text):
def generate_and_play():
try:
audio_stream = self.elevenlabs_client.text_to_speech.stream(
voice_id=self.voice_id,
text=text,
model_id="eleven_turbo_v2_5",
)
play_stream(audio_stream)
except Exception as e:
print(f"Voice error: {e}")
voice_thread = threading.Thread(target=generate_and_play)
voice_thread.daemon = True
voice_thread.start()
def start(self):
self.client.connect(
StreamingParameters(
sample_rate=16000,
speech_model="u3-rt-pro",
)
)
try:
self.client.stream(aai.extras.MicrophoneStream(sample_rate=16000))
except KeyboardInterrupt:
self.stop()
def stop(self):
print("\nStopping voice agent...")
self.client.disconnect(terminate=True)
sys.exit(0)
if __name__ == "__main__":
agent = VoiceAgent()
agent.start()
This complete implementation includes error handling, conversation memory, and clean shutdown. The agent remembers what you’ve talked about during each session, enabling natural back-and-forth conversations.
Run your voice agent
Start your voice agent with this command:
python voice_agent.py
When you see "Voice Agent Ready! Start speaking..." the agent is listening. Speak normally into your microphone and the agent responds with both text and voice.
Try these conversation starters to test your agent:
- "What’s the weather like today?"
- "Tell me a quick joke"
- "Help me plan dinner"
- "Explain how WiFi works simply"
Common problems and solutions:
- No microphone input: Check system permissions and microphone settings
- Slow responses: Test your internet connection and consider using gpt-3.5-turbo for faster processing
- Voice cuts off: Add a small delay after TTS playback or check your ElevenLabs API quota
Final words
You’ve built a complete streaming voice agent that processes speech in real-time and responds with natural conversation. This implementation combines speech recognition, AI processing, and voice synthesis into a single program that demonstrates the power of modern Voice AI models.
AssemblyAI’s Universal-3 Pro Streaming model makes this possible by providing the accuracy and speed that voice agents require. The SDK handles complex WebSocket connections and audio processing, letting you focus on building your application instead of managing low-level networking code.
To go further, explore the Universal-3 Pro Streaming docs for advanced features like keyterm prompting, speaker diarization, and real-time configuration updates—all without restarting your agent.
Frequently asked questions
Do I need WebSocket knowledge to build this voice agent?
No. The AssemblyAI Python SDK handles the WebSocket connection, reconnection logic, and audio streaming protocol automatically. You write event handlers and the SDK takes care of the rest.
How much does running this voice agent cost per hour?
This voice agent costs approximately $0.50–$1.00 per hour of conversation across all three services. AssemblyAI charges about $0.45/hr for Universal-3 Pro Streaming transcription, OpenAI costs roughly $0.30/hour for GPT-4 responses, and ElevenLabs runs about $0.20/hour for voice synthesis.
Can I replace AssemblyAI with a different speech-to-text service?
While technically possible, switching providers requires implementing WebSocket handling, audio streaming protocols, turn detection logic, and connection management yourself. You’d lose AssemblyAI’s built-in punctuation-based turn detection and the SDK simplicity that enables 5-minute implementation.
Can I use this pattern inside a framework like Pipecat or LiveKit?
Yes — AssemblyAI has first-party integrations for Pipecat, LiveKit, Vapi, and Twilio. These frameworks handle telephony, orchestration, and turn-taking so you can focus on your agent’s logic.
Does this work with languages other than English?
Yes. You can configure the speech model for other languages using StreamingParameters, for example: StreamingParameters(speech_model="u3-rt-pro", sample_rate=16_000, prompt="Transcribe Spanish."). For a full list of supported languages, see the Supported Languages page.
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.





