Evaluating Streaming STT models for Voice Agents

Introduction

The high level objective of a streaming STT model evaluation is to answer the question: Which streaming Speech-to-text model is the best for my voice agent? For voice agents, the STT component is the “ears” of the system — transcription errors propagate into the LLM and response logic, so even small accuracy gaps compound in impact. At scale, these differences have real financial consequences: a 10% difference in entity accuracy (for example, credit card numbers) across millions of calls can translate to significant revenue loss. A rigorous evaluation ensures you choose the model that drives the best outcomes. This guide will provide a step-by-step framework for evaluating and benchmarking streaming Speech-to-text models to help you select the best fit for your voice agent.

Need help evaluating our Speech-to-text products? Contact our Sales team to request for an evaluation.

Common evaluation metrics

Time to First Token (TTFT) / Time to First Byte (TTFB)

TTFT = t_{first\_token} - t_{stream\_start}

This measures the time from when the audio stream begins (including model startup/initialization) to when the very first token is returned by the model.

TTFT is borrowed from LLM evaluation and is less informative for streaming STT than other metrics. Because some providers emit tokens before any audio is actually spoken to game this metric, TTFT can be misleading. For streaming STT, prefer emission latency (the time from when a word is spoken to when that word is returned, best for use cases that consume partial transcriptions in real time) or TTCT (for voice agents that act on finalized turns).

For Universal-3 Pro Streaming, the interruption_delay parameter directly controls TTFT by configuring how soon the first partial transcript is emitted. Lower values (e.g. 0) produce faster TTFT (~300ms effective) at the cost of less audio context; higher values (e.g. 500–1000) are slower but more confident. See Tuning early partial timing for configuration details.

Time to Complete Transcript (TTCT) / Transcription Delay

TTCT = t_{final\_token} - t_{utterance\_end}

This measures the latency between when a user finishes speaking (end of speech detected) and when the complete transcription for that utterance is received from the STT model. This metric is crucial for understanding overall streaming model latency performance.

TTCT is measured on a per utterance basis, and a single user turn may contain multiple utterances. This metric is crucial for minimizing overall voice agent latency since it represents how soon you’ll be able to send STT outputs downstream to the LLM.

End-of-Turn Finalization Latency / Endpointing Latency

Latency = t_{turn\_boundary} - t_{speech\_end}

This measures the time from when the user actually finishes speaking (end of speech detected) to when the system recognizes and signals the end of their conversational turn. This includes both speech detection latency and any additional processing to determine turn completion.

End-of-Turn Finalization Latency should measured from the actual environment the user will use when the turn ends. AssemblyAI provides native endpointing in our models, but you can also use our outputs with other providers and their End-of-Turn models like LiveKit, Pipecat, or Vapi.

End-of-Turn Detection Accuracy

EOT Accuracy = \frac{TP_{eos}}{TP_{eos} + FP_{eos} + FN_{eos}}

This measures how accurately the model detects when a user has finished speaking, considering true positives (correct detections), false positives (premature cutoffs), and false negatives (missed endpoints).

Word Error Rate (WER)

WER = \frac{S + D + I}{N}

This formula takes the number of Substitutions (S), Deletions (D), and Insertions (I), and divides their sum by the Total Number of Words in the ground truth transcript (N).

For streaming models, it’s important to measure both partial WER (accuracy of interim results) and final WER (accuracy after all corrections). The delta between these indicates the model’s self-correction capability. While WER calculation may seem simple, it requires a methodical granular approach and reliable reference data.

Semantic WER

Traditional WER treats every difference between the model output and a reference transcript as an error—even when the difference is semantically equivalent. Semantic WER corrects this by normalizing equivalent words and phrases before calculating WER, so that differences like dr. vs doctor or 1300 vs thirteen hundred aren’t counted as errors.

Rule-based normalization

At its simplest, Semantic WER is a preprocessing step. Before running standard WER, apply find-and-replace rules to both the reference and hypothesis transcripts:

Number formats: 1300 → thirteen hundred, $5 → five dollars
Abbreviations and titles: dr. → doctor, mr. → mister, govt → government
Contractions: gonna → going to, can't → cannot
Variant spellings: grey → gray, cancelled → canceled
Filler words: Remove um, uh, you know from both sides (or keep both—just be consistent)

This alone eliminates a significant portion of false errors and can be implemented in a few lines of Python. No model inference required.

LLM-based scoring

For cases where simple rules can’t capture the nuance—was an omission meaningful? Is a proper noun misspelling close enough?—an LLM can perform word-level alignment and classify each difference by severity:

No penalty: Semantically equivalent forms (number formats, contractions, variant spellings)
Minor penalty: Single-character misspellings, minor grammatical markers
Major penalty: Incorrect substitutions, meaning-altering errors, significant omissions or additions of content words

These approaches are particularly valuable for voice agents because traditional WER penalizes transcription variants that would never trip up an LLM. If a user says “I’d like thirteen hundred units” and the model transcribes “1300 units”, traditional WER counts that as an error — but the LLM processes both identically. For implementations of Semantic WER, see prompt-seeker and aai-cli in the pre-recorded evaluation guide.

LASER score (LLM-based ASR Evaluation Rubric)

LASER is a published LLM-based evaluation metric (Parulekar & Jyothi, EMNLP 2025) that uses an LLM prompt with detailed examples to classify ASR errors and compute a score:

\text{LASER} = 1 - \frac{\text{Total Penalty}}{\text{Reference Word Count}}

The LLM aligns each word in the ASR output against the reference transcription and assigns a penalty per word pair:

No penalty (0): Acceptable variations including numerical format differences, abbreviations, compound word splits, transliterations, alternate spellings, proper noun variants, and colloquial terms
Minor penalty (0.5): Small spelling errors (single character) or minor grammatical errors (gender, tense, number markers) that preserve sentence meaning
Major penalty (1.0): Incorrect word substitutions, significant omissions or additions, and reordering that changes meaning

LASER provides structured per-error feedback alongside the score, making it useful for understanding why a transcript scored poorly, not just how much error there was. For an implementation of LASER scoring, see aai-cli.

Accuracy metrics like WER, Semantic WER, and LASER work the same way for streaming as they do for pre-recorded audio. In both cases, you’re comparing a final transcript against ground truth. The difference is only in how the transcript was generated (real-time vs batch), not in how you evaluate it. For detailed evaluation methodology and tooling, see the pre-recorded evaluation guide.

Ground truth quality

The quality of your ground truth data directly affects the reliability of your evaluation. Modern STT models frequently outperform human transcribers, which means errors in your ground truth files will show up as false negatives in your metrics. Common issues with ground truth data:

Missing filler words: Human transcribers often omit um, uh, like, and other disfluencies
Incorrect proper nouns: Rare names, technical terms, and domain vocabulary are often misspelled
Simplified speech patterns: Human transcribers tend to “clean up” speech, missing repetitions, false starts, and self-corrections
Code-switching errors: Multilingual segments are frequently translated to English rather than transcribed as spoken

Before running evaluations, audit a sample of your ground truth files by listening to the audio and comparing. To inspect and correct issues in your ground truth files, use the Truth File Corrector in the AssemblyAI Dashboard (found at the bottom of the left sidebar), which lets you listen back to audio and fix human transcription errors by clicking through differences.

WER is only as good as your ground truth labels. Human transcriptions contain systematic errors — missed filler words, incorrect proper nouns, simplified speech patterns, and translated code-switching. When your STT model transcribes audio more accurately than the human label, those improvements show up as WER errors. Before reporting WER, manually audit at least 20 insertions to determine what percentage are true errors versus ground truth omissions.If you see an unexpected spike in insertion rates — especially when evaluating a newer or higher-accuracy model — audit your ground truth before concluding the model is worse. High insertion counts often indicate the model is correctly transcribing words that the human transcriber missed.

The evaluation process

This section will be the core of the documentation, providing a step-by-step guide on how to run an evaluation. For that reason, the evaluation process should closely match your production environment - including the streaming conditions you intend to transcribe, the model you intend to use, and the settings applied to those models.

Step 1: Set up your benchmarking environment

When benchmarking a voice agent, you can decide to benchmark the audio directly against the provider’s API and/or set up a live testing environment. Benchmarks with real files against the API are best for measuring overall accuracy and latency metrics, like WER and TTFB. These will give you a good idea of the model’s performance. See our section on Pre-recorded evaluation benchmarks for how to do this part. Voice agents are complex and other metrics like TTCT and End-of-Turn Finalization Latency often depend on additional factors like the end user’s audio device and environment. We highly recommend you run live side by side evals with your voice agent hooked up to multiple streaming STT providers to experience what your user will feel for themselves.

Step 2: Run your test scenarios

If you have out of the box test scenarios, you can run these through the API and capture the metrics above. If you don’t have scenarios, you can make these up based on expected customer behaviors and measure the side by side results across different providers. For example, if you are building a drive-through ordering system, create simulated test scenarios to represent different user orders, pacing, tonality, background noise, accents, etc. If you are unsure how to proceed here, it might be worth checking out BlueJay, Coval, Hamming, who all help with evaluating and measuring performance of voice agents.

Step 3: Compare the results

It is highly unlikely you will find a single streaming STT model that wins in all of the metrics outlined above. Ultimately your goal should be to compare for your use case which of these metrics helps your agent drive the best end user outcome. For this case, you might consider:

Are you replacing humans with your voice agent? It is likely TTCT and Endpointing Latency matter most since these metrics best simulate human behavior.
Are you working with domain specific words like medical? While WER is important, it’s most important the LLM in your voice agent understands the user. This requires simulating full test scenarios outside of just metrics.
Are you showing transcript text to your end users (like subtitles)? Perhaps WER is most important to end user perception of quality and accuracy.

See this article to learn more about why your word error rate (WER) benchmark might be lying to you.

Vibes vs metrics

While metrics provide a useful quantitative evaluation of a streaming Speech-to-text model, voice agents are complex and are not made up solely of STT models. For this, we recommend doing a “vibe-eval”.

Why do a vibe-eval?

Vibe-evals are useful to determine the qualitative difference between STT providers that affect the end outcome of your voice agent. For example, how errors in transcript may or may not trip up a voice agent as the LLM may fix some of the issues. Vibe-evals are also good for tie-breaking instances where the benchmarking metrics don’t lean in favour of one model over the other. Another benefit of doing a vibe-evals is that truth files don’t have to be sourced for them since Speech-to-text models are being compared against each other in a real voice agent.

How to do a vibe-eval?

Side-by-side comparison: Compare formatted transcription outputs from different STT providers. Tools like Diffchecker or any side-by-side interface work well for this. Since you’re comparing models against each other, ground truth files aren’t required. LLM as a judge: An LLM can automatically identify differences between two transcriptions and pick a winner. However, be cautious: an LLM judge can be misled by outputs that look correct but contain subtle errors (such as translated code-switching segments that read well in English but don’t reflect what was actually spoken). Always pair LLM-based judgments with spot-checking against the actual audio. For a structured LLM-based scoring approach, see LASER above and aai-cli for an implementation. Live A/B testing: Set up your voice agent with different STT providers using integrations like LiveKit where you can swap out providers. Serve transcripts from different providers and collect feedback — ask users to score agent interactions directly, or track indirect signals like support ticket complaints about the agent misunderstanding users.

Conclusion

We hope that this short guide was helpful in shaping your evaluation methodology. Have more questions about evaluating our Speech-to-text models? Contact our sales team and we can help.

Documentation Index

​Introduction

​Common evaluation metrics

​Time to First Token (TTFT) / Time to First Byte (TTFB)

​Time to Complete Transcript (TTCT) / Transcription Delay

​End-of-Turn Finalization Latency / Endpointing Latency

​End-of-Turn Detection Accuracy

​Word Error Rate (WER)

​Semantic WER

​Rule-based normalization

​LLM-based scoring

​LASER score (LLM-based ASR Evaluation Rubric)

​Ground truth quality

​The evaluation process

​Step 1: Set up your benchmarking environment

​Step 2: Run your test scenarios

​Step 3: Compare the results

​Vibes vs metrics

​Why do a vibe-eval?

​How to do a vibe-eval?

​Conclusion

Introduction

Common evaluation metrics

Time to First Token (TTFT) / Time to First Byte (TTFB)

Time to Complete Transcript (TTCT) / Transcription Delay

End-of-Turn Finalization Latency / Endpointing Latency

End-of-Turn Detection Accuracy

Word Error Rate (WER)

Semantic WER

Rule-based normalization

LLM-based scoring

LASER score (LLM-based ASR Evaluation Rubric)

Ground truth quality

The evaluation process

Step 1: Set up your benchmarking environment

Step 2: Run your test scenarios

Step 3: Compare the results

Vibes vs metrics

Why do a vibe-eval?

How to do a vibe-eval?

Conclusion