Documentation Index
Fetch the complete documentation index at: https://assemblyai.com/docs/llms.txt
Use this file to discover all available pages before exploring further.
Introduction
The high level objective of a streaming STT model evaluation is to answer the question: Which streaming Speech-to-text model is the best for my voice agent? For voice agents, the STT component is the “ears” of the system — transcription errors propagate into the LLM and response logic, so even small accuracy gaps compound in impact. At scale, these differences have real financial consequences: a 10% difference in entity accuracy (for example, credit card numbers) across millions of calls can translate to significant revenue loss. A rigorous evaluation ensures you choose the model that drives the best outcomes. This guide will provide a step-by-step framework for evaluating and benchmarking streaming Speech-to-text models to help you select the best fit for your voice agent.Need help evaluating our Speech-to-text products? Contact our Sales
team to request for an evaluation.
Common evaluation metrics
Time to First Token (TTFT) / Time to First Byte (TTFB)
This measures the time from when the audio stream begins (including model startup/initialization) to when the very first token is returned by the model.For Universal-3 Pro Streaming, the
interruption_delay parameter directly controls TTFT by configuring how soon the first partial transcript is emitted. Lower values (e.g. 0) produce faster TTFT (~300ms effective) at the cost of less audio context; higher values (e.g. 500–1000) are slower but more confident. See Tuning early partial timing for configuration details.Time to Complete Transcript (TTCT) / Transcription Delay
This measures the latency between when a user finishes speaking (end of speech detected) and when the complete transcription for that utterance is received from the STT model. This metric is crucial for understanding overall streaming model latency performance.TTCT is measured on a per utterance basis, and a single user turn may contain
multiple utterances. This metric is crucial for minimizing overall voice agent
latency since it represents how soon you’ll be able to send STT outputs
downstream to the LLM.
End-of-Turn Finalization Latency / Endpointing Latency
This measures the time from when the user actually finishes speaking (end of speech detected) to when the system recognizes and signals the end of their conversational turn. This includes both speech detection latency and any additional processing to determine turn completion.End-of-Turn Detection Accuracy
This measures how accurately the model detects when a user has finished speaking, considering true positives (correct detections), false positives (premature cutoffs), and false negatives (missed endpoints).Word Error Rate (WER)
This formula takes the number of Substitutions (S), Deletions (D), and Insertions (I), and divides their sum by the Total Number of Words in the ground truth transcript (N).For streaming models, it’s important to measure both partial WER (accuracy of
interim results) and final WER (accuracy after all corrections). The delta
between these indicates the model’s self-correction capability. While WER
calculation may seem simple, it requires a methodical granular approach and
reliable reference data.
Semantic WER
Traditional WER treats every difference between the model output and a reference transcript as an error—even when the difference is semantically equivalent. Semantic WER corrects this by normalizing equivalent words and phrases before calculating WER, so that differences likedr. vs doctor or 1300 vs thirteen hundred aren’t counted as errors.
Rule-based normalization
At its simplest, Semantic WER is a preprocessing step. Before running standard WER, apply find-and-replace rules to both the reference and hypothesis transcripts:- Number formats:
1300→thirteen hundred,$5→five dollars - Abbreviations and titles:
dr.→doctor,mr.→mister,govt→government - Contractions:
gonna→going to,can't→cannot - Variant spellings:
grey→gray,cancelled→canceled - Filler words: Remove
um,uh,you knowfrom both sides (or keep both—just be consistent)
LLM-based scoring
For cases where simple rules can’t capture the nuance—was an omission meaningful? Is a proper noun misspelling close enough?—an LLM can perform word-level alignment and classify each difference by severity:- No penalty: Semantically equivalent forms (number formats, contractions, variant spellings)
- Minor penalty: Single-character misspellings, minor grammatical markers
- Major penalty: Incorrect substitutions, meaning-altering errors, significant omissions or additions of content words
LASER score (LLM-based ASR Evaluation Rubric)
LASER is a published LLM-based evaluation metric (Parulekar & Jyothi, EMNLP 2025) that uses an LLM prompt with detailed examples to classify ASR errors and compute a score: The LLM aligns each word in the ASR output against the reference transcription and assigns a penalty per word pair:- No penalty (0): Acceptable variations including numerical format differences, abbreviations, compound word splits, transliterations, alternate spellings, proper noun variants, and colloquial terms
- Minor penalty (0.5): Small spelling errors (single character) or minor grammatical errors (gender, tense, number markers) that preserve sentence meaning
- Major penalty (1.0): Incorrect word substitutions, significant omissions or additions, and reordering that changes meaning
Accuracy metrics like WER, Semantic WER, and LASER work the same way for streaming as they do for pre-recorded audio. In both cases, you’re comparing a final transcript against ground truth. The difference is only in how the transcript was generated (real-time vs batch), not in how you evaluate it. For detailed evaluation methodology and tooling, see the pre-recorded evaluation guide.
Ground truth quality
The quality of your ground truth data directly affects the reliability of your evaluation. Modern STT models frequently outperform human transcribers, which means errors in your ground truth files will show up as false negatives in your metrics. Common issues with ground truth data:- Missing filler words: Human transcribers often omit
um,uh,like, and other disfluencies - Incorrect proper nouns: Rare names, technical terms, and domain vocabulary are often misspelled
- Simplified speech patterns: Human transcribers tend to “clean up” speech, missing repetitions, false starts, and self-corrections
- Code-switching errors: Multilingual segments are frequently translated to English rather than transcribed as spoken
The evaluation process
This section will be the core of the documentation, providing a step-by-step guide on how to run an evaluation. For that reason, the evaluation process should closely match your production environment - including the streaming conditions you intend to transcribe, the model you intend to use, and the settings applied to those models.Step 1: Set up your benchmarking environment
When benchmarking a voice agent, you can decide to benchmark the audio directly against the provider’s API and/or set up a live testing environment. Benchmarks with real files against the API are best for measuring overall accuracy and latency metrics, like WER and TTFB. These will give you a good idea of the model’s performance. See our section on Pre-recorded evaluation benchmarks for how to do this part. Voice agents are complex and other metrics like TTCT and End-of-Turn Finalization Latency often depend on additional factors like the end user’s audio device and environment. We highly recommend you run live side by side evals with your voice agent hooked up to multiple streaming STT providers to experience what your user will feel for themselves.Step 2: Run your test scenarios
If you have out of the box test scenarios, you can run these through the API and capture the metrics above. If you don’t have scenarios, you can make these up based on expected customer behaviors and measure the side by side results across different providers. For example, if you are building a drive-through ordering system, create simulated test scenarios to represent different user orders, pacing, tonality, background noise, accents, etc. If you are unsure how to proceed here, it might be worth checking out BlueJay, Coval, Hamming, who all help with evaluating and measuring performance of voice agents.Step 3: Compare the results
It is highly unlikely you will find a single streaming STT model that wins in all of the metrics outlined above. Ultimately your goal should be to compare for your use case which of these metrics helps your agent drive the best end user outcome. For this case, you might consider:- Are you replacing humans with your voice agent? It is likely TTCT and Endpointing Latency matter most since these metrics best simulate human behavior.
- Are you working with domain specific words like medical? While WER is important, it’s most important the LLM in your voice agent understands the user. This requires simulating full test scenarios outside of just metrics.
- Are you showing transcript text to your end users (like subtitles)? Perhaps WER is most important to end user perception of quality and accuracy.
See this article to learn more about why your word error rate (WER) benchmark might be lying to you.