Evaluate Streaming transcription accuracy with WER
Learn how to evaluate the accuracy of your AssemblyAI streaming transcripts using Word Error Rate (WER), the industry-standard metric for measuring speech-to-text performance. This guide walks you through setting up a complete benchmarking workflow to measure how well your streaming implementation performs against a reference transcript.
Quickstart
1 # pip install websocket-client jiwer whisper-normalizer 2 import websocket 3 import json 4 import os 5 import threading 6 import time 7 import wave 8 from urllib.parse import urlencode 9 import jiwer 10 from whisper_normalizer.basic import BasicTextNormalizer 11 from whisper_normalizer.english import EnglishTextNormalizer 12 13 # --- Configuration --- 14 ASSEMBLYAI_API_KEY = os.environ["ASSEMBLYAI_API_KEY"] 15 AUDIO_FILE = "audio.wav" # Path to your audio file 16 SAMPLE_RATE = 48000 # Change to match the sample rate of your audio file 17 18 CONNECTION_PARAMS = { 19 "speech_model": "u3-rt-pro", 20 "sample_rate": SAMPLE_RATE, 21 } 22 API_ENDPOINT_BASE_URL = "wss://streaming.assemblyai.com/v3/ws" 23 API_ENDPOINT = f"{API_ENDPOINT_BASE_URL}?{urlencode(CONNECTION_PARAMS)}" 24 25 # Global variables 26 ws_app = None 27 audio_thread = None 28 stop_event = threading.Event() 29 assembly_streaming_transcript = "" 30 31 # --- WebSocket Event Handlers --- 32 33 def on_open(ws): 34 """Called when the WebSocket connection is established.""" 35 print("WebSocket connection opened.") 36 37 def stream_file(): 38 chunk_duration = 0.1 39 40 with wave.open(AUDIO_FILE, 'rb') as wav_file: 41 if wav_file.getnchannels() != 1: 42 raise ValueError("Only mono audio is supported") 43 44 file_sample_rate = wav_file.getframerate() 45 if file_sample_rate != SAMPLE_RATE: 46 print(f"Warning: File sample rate ({file_sample_rate}) doesn't match expected rate ({SAMPLE_RATE})") 47 48 frames_per_chunk = int(file_sample_rate * chunk_duration) 49 50 while not stop_event.is_set(): 51 frames = wav_file.readframes(frames_per_chunk) 52 if not frames: 53 break 54 ws.send(frames, websocket.ABNF.OPCODE_BINARY) 55 56 print("File streaming complete. Waiting for final transcripts...") 57 try: 58 ws.send(json.dumps({"type": "Terminate"})) 59 except Exception: 60 pass 61 62 global audio_thread 63 audio_thread = threading.Thread(target=stream_file) 64 audio_thread.daemon = True 65 audio_thread.start() 66 67 68 def on_message(ws, message): 69 global assembly_streaming_transcript 70 try: 71 data = json.loads(message) 72 msg_type = data.get('type') 73 74 if msg_type == "Begin": 75 print(f"Session ID: {data.get('id')}") 76 elif msg_type == "Turn": 77 transcript = data.get('transcript', '') 78 if data.get('end_of_turn'): 79 assembly_streaming_transcript += transcript + " " 80 print(transcript) 81 elif msg_type == "Termination": 82 print(f"Session terminated: {data.get('audio_duration_seconds', 0)} seconds of audio processed") 83 except Exception as e: 84 print(f"Error handling message: {e}") 85 86 87 def on_error(ws, error): 88 """Called when a WebSocket error occurs.""" 89 print(f"\nWebSocket Error: {error}") 90 stop_event.set() 91 92 93 def on_close(ws, close_status_code, close_msg): 94 """Called when the WebSocket connection is closed.""" 95 print(f"\nWebSocket Disconnected: Status={close_status_code}") 96 stop_event.set() 97 if audio_thread and audio_thread.is_alive(): 98 audio_thread.join(timeout=1.0) 99 100 101 # --- Main Execution --- 102 103 ws_app = websocket.WebSocketApp( 104 API_ENDPOINT, 105 header={"Authorization": ASSEMBLYAI_API_KEY}, 106 on_open=on_open, 107 on_message=on_message, 108 on_error=on_error, 109 on_close=on_close, 110 ) 111 112 ws_thread = threading.Thread(target=ws_app.run_forever) 113 ws_thread.daemon = True 114 ws_thread.start() 115 116 try: 117 while ws_thread.is_alive(): 118 time.sleep(0.1) 119 except KeyboardInterrupt: 120 print("\nStopping...") 121 stop_event.set() 122 if ws_app: 123 ws_app.close() 124 ws_thread.join(timeout=2.0) 125 126 # --- Evaluate collected transcripts --- 127 128 reference_transcript = "AssemblyAI is a deep learning company that builds powerful APIs to help you transcribe and understand audio. The most common use case for the API is to automatically convert prerecorded audio and video files as well as real time audio streams into text transcriptions. Our APIs convert audio and video into text using powerful deep learning models that we research and develop end to end in house. Millions of podcasts, zoom recordings, phone calls or video files are being transcribed with Assembly AI every single day. But where Assembly AI really excels is with helping you understand your data. So let's say we transcribe Joe Biden's State of the Union using Assembly AI's API. With our Auto Chapters feature, you can generate time coded summaries of the key moments of your audio file. For example, with the State of the Union address we get chapter summaries like this. Auto Chapters automatically segments your audio or video files into chapters and provides a summary for each of these chapters. With Sentiment Analysis, we can classify what's being spoken in your audio files as either positive, negative or neutral. So for example, in the State of the Union address we see that this sentence was classified as positive, whereas this sentence was classified as negative. Content Safety Detection can flag sensitive content as it is spoken like hate speech, profanity, violence or weapons. For example, in Biden's State of the Union address, content safety detection flagged parts of his speech as being about weapons. This feature is especially useful for automatic content moderation and brand safety use cases. With Auto Highlights, you can automatically identify important words and phrases that are being spoken in your data owned by the State of the Union address. AssemblyAI's API detected these words and phrases as being important. Lastly, with entity detection you can identify entities that are spoken in your audio like organization names or person names. In Biden's speech, these were the entities that were detected. This is just a preview of the most popular features of AssemblyAI's API. If you want a full list of features, go check out our API documentation linked in the description below. And if you ever need some support, our team of developers is here to help. Everyday developers are using these features to build really exciting applications. From meeting summarizers to brand safety or contextual targeting platforms to full blown conversational intelligence tools. We can't wait to see what you build with AssemblyAI." 129 130 # Initialize normalizer 131 normalizer = EnglishTextNormalizer() 132 # For Spanish and other languages 133 # normalizer = BasicTextNormalizer() 134 135 def calculate_wer(reference, hypothesis, language='en'): 136 # Normalize both texts 137 normalized_reference = normalizer(reference) 138 print("Reference: " + reference) 139 print("Normalized Reference: " + normalized_reference + "\n") 140 141 normalized_hypothesis = normalizer(hypothesis) 142 print("Hypothesis: " + hypothesis) 143 print("Normalized Hypothesis: " + normalized_hypothesis + "\n") 144 145 # Calculate WER 146 wer = jiwer.wer(normalized_reference, normalized_hypothesis) 147 148 return wer * 100 # Return as percentage 149 150 wer_score = calculate_wer(reference_transcript, assembly_streaming_transcript.strip()) 151 print(f"Final WER: {wer_score:.2f}%")
Step-by-step implementation
- Install the required dependencies
$ pip install websocket-client jiwer whisper-normalizer
- Import the necessary libraries
1 # pip install websocket-client jiwer whisper-normalizer 2 import websocket 3 import json 4 import os 5 import threading 6 import time 7 import wave 8 from urllib.parse import urlencode 9 import jiwer 10 from whisper_normalizer.basic import BasicTextNormalizer 11 from whisper_normalizer.english import EnglishTextNormalizer
-
Set up configuration and transcript collection
Configure your API key, audio file settings, and create a global variable to store streaming transcripts. Your streaming session will append to this variable as it processes audio, and you’ll use it for WER analysis.
1 ASSEMBLYAI_API_KEY = os.environ["ASSEMBLYAI_API_KEY"] 2 AUDIO_FILE = "audio.wav" # Path to your audio file 3 SAMPLE_RATE = 48000 # Change to match the sample rate of your audio file 4 5 CONNECTION_PARAMS = { 6 "speech_model": "u3-rt-pro", 7 "sample_rate": SAMPLE_RATE, 8 } 9 API_ENDPOINT_BASE_URL = "wss://streaming.assemblyai.com/v3/ws" 10 API_ENDPOINT = f"{API_ENDPOINT_BASE_URL}?{urlencode(CONNECTION_PARAMS)}" 11 12 # Global variables 13 ws_app = None 14 audio_thread = None 15 stop_event = threading.Event() 16 assembly_streaming_transcript = ""
-
Configure streaming audio processing
Stream your audio file to the AssemblyAI endpoint. The
on_messagefunction captures final transcripts and appends them to your collection variable.
1 def on_open(ws): 2 """Called when the WebSocket connection is established.""" 3 print("WebSocket connection opened.") 4 5 def stream_file(): 6 chunk_duration = 0.1 7 8 with wave.open(AUDIO_FILE, 'rb') as wav_file: 9 if wav_file.getnchannels() != 1: 10 raise ValueError("Only mono audio is supported") 11 12 file_sample_rate = wav_file.getframerate() 13 if file_sample_rate != SAMPLE_RATE: 14 print(f"Warning: File sample rate ({file_sample_rate}) doesn't match expected rate ({SAMPLE_RATE})") 15 16 frames_per_chunk = int(file_sample_rate * chunk_duration) 17 18 while not stop_event.is_set(): 19 frames = wav_file.readframes(frames_per_chunk) 20 if not frames: 21 break 22 ws.send(frames, websocket.ABNF.OPCODE_BINARY) 23 24 print("File streaming complete. Waiting for final transcripts...") 25 try: 26 ws.send(json.dumps({"type": "Terminate"})) 27 except Exception: 28 pass 29 30 global audio_thread 31 audio_thread = threading.Thread(target=stream_file) 32 audio_thread.daemon = True 33 audio_thread.start() 34 35 36 def on_message(ws, message): 37 global assembly_streaming_transcript 38 try: 39 data = json.loads(message) 40 msg_type = data.get('type') 41 42 if msg_type == "Begin": 43 print(f"Session ID: {data.get('id')}") 44 elif msg_type == "Turn": 45 transcript = data.get('transcript', '') 46 if data.get('end_of_turn'): 47 assembly_streaming_transcript += transcript + " " 48 print(transcript) 49 elif msg_type == "Termination": 50 print(f"Session terminated: {data.get('audio_duration_seconds', 0)} seconds of audio processed") 51 except Exception as e: 52 print(f"Error handling message: {e}") 53 54 55 def on_error(ws, error): 56 """Called when a WebSocket error occurs.""" 57 print(f"\nWebSocket Error: {error}") 58 stop_event.set() 59 60 61 def on_close(ws, close_status_code, close_msg): 62 """Called when the WebSocket connection is closed.""" 63 print(f"\nWebSocket Disconnected: Status={close_status_code}") 64 stop_event.set() 65 if audio_thread and audio_thread.is_alive(): 66 audio_thread.join(timeout=1.0) 67 68 69 # Connect and stream 70 ws_app = websocket.WebSocketApp( 71 API_ENDPOINT, 72 header={"Authorization": ASSEMBLYAI_API_KEY}, 73 on_open=on_open, 74 on_message=on_message, 75 on_error=on_error, 76 on_close=on_close, 77 ) 78 79 ws_thread = threading.Thread(target=ws_app.run_forever) 80 ws_thread.daemon = True 81 ws_thread.start() 82 83 try: 84 while ws_thread.is_alive(): 85 time.sleep(0.1) 86 except KeyboardInterrupt: 87 print("\nStopping...") 88 stop_event.set() 89 if ws_app: 90 ws_app.close() 91 ws_thread.join(timeout=2.0)
- Prepare your reference transcript Define the ground truth transcript for comparison. This serves as your accuracy benchmark for the WER calculation.
Pro tip: Create a high-quality reference transcript by first transcribing your audio file with AssemblyAI’s Universal-3 Pro model, then manually reviewing and correcting any errors to achieve 100% accuracy.
Ground truth quality directly affects WER results. Human transcriptions often contain systematic errors — missing filler words, incorrect proper nouns, and simplified speech patterns. If your reference transcript has errors, your WER score will be misleading. For detailed guidance on auditing ground truth files, see the streaming evaluation guide.
1 # Evaluate collected transcripts 2 reference_transcript = "AssemblyAI is a deep learning company that builds powerful APIs to help you transcribe and understand audio. The most common use case for the API is to automatically convert prerecorded audio and video files as well as real time audio streams into text transcriptions. Our APIs convert audio and video into text using powerful deep learning models that we research and develop end to end in house. Millions of podcasts, zoom recordings, phone calls or video files are being transcribed with Assembly AI every single day. But where Assembly AI really excels is with helping you understand your data. So let's say we transcribe Joe Biden's State of the Union using Assembly AI's API. With our Auto Chapters feature, you can generate time coded summaries of the key moments of your audio file. For example, with the State of the Union address we get chapter summaries like this. Auto Chapters automatically segments your audio or video files into chapters and provides a summary for each of these chapters. With Sentiment Analysis, we can classify what's being spoken in your audio files as either positive, negative or neutral. So for example, in the State of the Union address we see that this sentence was classified as positive, whereas this sentence was classified as negative. Content Safety Detection can flag sensitive content as it is spoken like hate speech, profanity, violence or weapons. For example, in Biden's State of the Union address, content safety detection flagged parts of his speech as being about weapons. This feature is especially useful for automatic content moderation and brand safety use cases. With Auto Highlights, you can automatically identify important words and phrases that are being spoken in your data owned by the State of the Union address. AssemblyAI's API detected these words and phrases as being important. Lastly, with entity detection you can identify entities that are spoken in your audio like organization names or person names. In Biden's speech, these were the entities that were detected. This is just a preview of the most popular features of AssemblyAI's API. If you want a full list of features, go check out our API documentation linked in the description below. And if you ever need some support, our team of developers is here to help. Everyday developers are using these features to build really exciting applications. From meeting summarizers to brand safety or contextual targeting platforms to full blown conversational intelligence tools. We can't wait to see what you build with AssemblyAI."
- Initialize text normalization Set up the normalizer and create your WER calculation function to ensure consistent text formatting before comparison.
1 # Initialize normalizers 2 normalizer = EnglishTextNormalizer() 3 # For Spanish and other languages 4 # normalizer = BasicTextNormalizer() 5 6 def calculate_wer(reference, hypothesis, language='en'): 7 # Normalize both texts 8 normalized_reference = normalizer(reference) 9 print("Reference: " + reference) 10 print("Normalized Reference: " + normalized_reference + "\n") 11 12 normalized_hypothesis = normalizer(hypothesis) 13 print("Hypothesis: " + hypothesis) 14 print("Normalized Hypothesis: " + normalized_hypothesis + "\n") 15 16 # Calculate WER 17 wer = jiwer.wer(normalized_reference, normalized_hypothesis) 18 19 return wer * 100 # Return as percentage
- Calculate your WER score Run the final calculation to measure transcription accuracy.
1 wer_score = calculate_wer(reference_transcript, assembly_streaming_transcript.strip()) 2 print(f"Final WER: {wer_score:.2f}%")
Next steps
WER is a useful starting point, but it treats all errors equally — trivial formatting differences are penalized the same as critical errors like wrong names or hallucinated words. Consider complementing your WER analysis with Semantic WER, which normalizes equivalent words and phrases before calculating WER so that differences like dr. vs doctor or 1300 vs thirteen hundred aren’t counted as errors. For a complete evaluation framework, see the streaming evaluation guide.