How to evaluate speech recognition models
Learn how to evaluate speech-to-text models beyond Word Error Rate. This guide covers Semantic WER, Missed Entity Rate, ground truth correction, and practical benchmarking frameworks for 2026.



For years, evaluating speech-to-text models meant running Word Error Rate (WER) against a clean dataset and declaring a winner. Simple. Reproducible. And increasingly unreliable.
When we launched Universal-3 Pro, something unexpected happened. Some customers reported their internal benchmarks showed our new model performing worse than our older models. But our testing told a completely different story. So we dug in. What we found—and what you’re about to learn—changes how you should evaluate transcription accuracy entirely.
The problem isn’t WER itself. The problem is that WER is only one piece of the puzzle. It treats all words equally, ignores ground truth quality issues, and can’t measure what actually matters to AI systems that consume transcripts downstream. As Voice AI applications get more sophisticated, your evaluation methods need to evolve too.
This guide walks you through modern transcription evaluation, from traditional metrics to semantic accuracy, ground truth validation, and the practical tools you need to benchmark models correctly. If you’re starting from scratch on speech-to-text concepts, that guide covers the fundamentals.
Part 1: Understanding core evaluation metrics
Word Error Rate (WER): The foundation
Word Error Rate remains the industry standard, and for good reason. It’s simple, reproducible, and gives you a single number to compare across vendors. But you need to understand what it actually measures. (For a deeper primer on WER as a metric, see Is Word Error Rate Useful?.)
WER calculates the percentage of words that differ between a reference transcript (ground truth) and the AI-generated transcript. It counts three types of errors:
- Insertions (hallucinations): Words the AI added that weren’t in the reference
- Deletions: Words in the reference that the AI missed entirely
- Substitutions: Words where the AI transcribed something different (e.g., “Cadillac” as “cataracts”)
The formula:
WER = (Insertions + Deletions + Substitutions) / Total Words in Reference × 100
A simple example:
- Reference: “The quick brown fox jumps over the lazy dog” (9 words)
- AI output: “A quick brown fox jumped over the lazy dog” (9 words)
- Errors: 2 substitutions (“The” → “A”, “jumps” → “jumped”)
- WER: (2 / 9) × 100 = 22.2%
The catch: Before you calculate WER, both transcripts must be normalized. You need to lowercase everything, strip punctuation, convert numbers to words, and standardize contractions. A transcript that says “don’t” will score differently than “do not” if you skip normalization—even though they’re semantically identical.
AssemblyAI recommends using the open-source Whisper Normalizer for English transcription and the Basic normalizer for other languages.
Semantic WER: Beyond word-for-word accuracy
Here’s where things get interesting. Traditional WER assumes every word matters equally. But in modern Voice AI applications, transcripts often feed directly into LLMs, not human readers. An LLM doesn’t care whether the transcript says “cannot” or “can’t”—it understands the meaning either way. Yet traditional WER penalizes both equally.
Semantic WER addresses this. Instead of comparing words, it uses a reasoning model to evaluate whether the meaning of the transcript is preserved. The process:
- An LLM (like Claude Sonnet) receives both the AI transcript and the reference transcript
- The LLM judges whether they convey the same meaning and information
- The system calculates accuracy based on semantic equivalence, not word-for-word matching
Example of semantic equivalence that traditional WER misses:
- Reference: “The patient is on hydrochlorothiazide for hypertension”
- AI output: “The patient is on HCTZ for hypertension”
- Traditional WER: ~20% error (one word substitution)
- Semantic WER: 0% error (meaning fully preserved, HCTZ is the standard abbreviation)
For voice agents and LLM-based applications, Semantic WER is increasingly the more relevant metric because it measures what actually impacts downstream performance. Our post on how accurate speech-to-text is in 2026 covers how Semantic WER benchmarks compare across providers.
Missed Entity Rate (MER): What really matters in specialized domains
WER treats all words equally, but not all words are equal. In medical transcription, getting “hydrochlorothiazide” wrong is catastrophic. Missing a filler word like “um” is harmless.
Missed Entity Rate (MER) fixes this. It measures accuracy specifically on high-value words like drug names and dosages, proper nouns and company names, medical procedures and diagnoses, numbers, dates, and addresses, and domain-specific terminology.
Instead of averaging errors across an entire transcript, MER focuses on the words that matter most for your application.
Real-world example — Medical Mode:
AssemblyAI’s Medical Mode—available as a $0.15/hour add-on—enhances accuracy on clinical terminology. Our internal benchmarks show:
That 2.35 percentage point difference might not sound large until you realize it means 30% fewer missed drug names in clinical transcripts. In healthcare, this is the difference between an acceptable documentation system and one clinicians actually trust. For a full breakdown of how Medical Mode handles clinical terminology, see Medical transcription that actually works.
LLM-as-a-Judge evaluation
For the highest-confidence evaluation, use a reasoning model (like Claude Sonnet 4.6 with extended thinking) as a judge. This works especially well for comparing transcripts where subjective quality matters.
The process:
- Present two anonymized transcripts side-by-side—one from your preferred model, one from a competitor
- Include the reference transcript as ground truth
- Ask the LLM to evaluate which transcript is higher quality for your specific use case
- Calculate win rates across a sample of audio
This approach works especially well for medical transcription (where clinical accuracy matters more than grammar), legal transcription (where precise phrasing has legal implications), and accessibility transcripts (where clarity and formatting affect usability).
The advantage: you’re evaluating real-world usefulness rather than benchmark gaming. The tradeoff: it’s slower and more expensive than automated metrics.
Part 2: The ground truth problem
Here’s what nobody talks about: the biggest source of evaluation error isn’t the speech-to-text model. It’s the ground truth.
When customers complained that Universal-3 Pro scored worse than our older models, we investigated their ground truth files. What we found was shocking. The human-transcribed references contained systematic errors and inconsistencies that penalized our newer model for being more accurate than the reference. We published the full investigation in Why your WER benchmark might be lying to you—it’s essential reading before you run any benchmark.
Why ground truth quality matters
Imagine a customer transcribes audio and creates a reference file. They have a human transcriber listen and type what they hear. This reference becomes the “truth” you’ll compare models against.
But here’s the problem: human transcribers make mistakes too. They might mishear words and transcribe them consistently wrong, normalize speech in inconsistent ways (“alright” vs “all right”), miss medical abbreviations or specialized terminology, include filler words inconsistently, or disagree on formatting (capitalization, punctuation, numbers).
When your reference contains these errors, a more accurate model can score worse because it transcribes what was actually said, not what a fallible human thought they heard.
The Universal-3 Pro case study:
When we launched Universal-3 Pro, some customers saw their WER increase by 2–3 percentage points. Investigation revealed their ground truth files contained systematic errors. One example:
- Reference: “Patient allergic to penicillin”
- Ground truth (human): “Patient allergic to penicillium” (transcriber misheard)
- Universal-3 Pro output: “Patient allergic to penicillin” (correct)
- WER calculation: Model marked as wrong despite being more accurate
Semantic equivalence issues
Beyond outright errors, ground truth quality degrades when transcribers make subjective formatting choices: “healthcare” vs “health care,” “alright” vs “all right,” “can’t” vs “cannot,” “12” vs “twelve.”
These inconsistencies can swing WER by several percentage points while contributing zero to actual transcription quality. Our guide on handling transcript errors covers how to identify and correct these patterns systematically.
How to fix ground truth quality
This is where the Truth File Corrector comes in. Released in the AssemblyAI dashboard, this tool automatically:
- Detects semantic equivalence issues across your reference file
- Flags potential transcriber errors using language models
- Suggests corrections based on domain knowledge
- Allows you to review and approve changes before re-running evaluation
The result: your benchmark reflects model quality, not transcriber inconsistency.
The Truth File Corrector is live in your AssemblyAI dashboard. The benchmarking SDK is on GitHub. Both are free.
Part 3: Practical benchmarking frameworks
Step 1: Prepare high-quality reference audio
Select 10–20 audio files that represent your real use cases: a mix of speaker accents and dialects, variety of audio qualities (clean studio, conference calls, noisy environments), real domain-specific terminology, and different speaker counts and overlapping speech.
Avoid using only clean, curated audio. Your models will perform worse on real production audio, and you want benchmarks that predict that reality.
Step 2: Create ground truth with quality controls
Option 1 (recommended): Use Universal-3 Pro to generate transcripts, then have domain experts review and correct only clear errors. This is faster than human transcription from scratch and you know the baseline quality is high.
Option 2: Hire professional transcribers. Specify that they should transcribe what’s actually said (not “clean up” grammar), and use industry-standard formatting for your domain.
Quick test before you commit: Upload a sample audio file to the AssemblyAI Playground to see real-time transcripts, confidence scores, and formatting—before writing a single line of code.
Step 3: Normalize both transcripts
Before calculating metrics, normalize both the reference and model output using the Whisper Normalizer:
from whisper_normalizer.english import EnglishTextNormalizer
normalizer = EnglishTextNormalizer()
reference_normalized = normalizer(reference_transcript)
ai_normalized = normalizer(ai_transcript)
Step 4: Calculate multiple metrics
Don’t stop at WER. Run at least three metrics:
import jiwer
# Traditional WER
wer = jiwer.wer(reference_normalized, ai_normalized)
# Calculate error breakdown
output = jiwer.compute_measures(reference_normalized, ai_normalized)
substitutions = output['substitutions']
deletions = output['deletions']
insertions = output['insertions']
Step 5: Use the Truth File Corrector
Before finalizing your benchmark:
- Upload your reference file to the AssemblyAI dashboard
- Run it through the Truth File Corrector
- Review flagged inconsistencies
- Approve or modify corrections
- Re-run your evaluation with the cleaned reference file
This single step often reveals 5–15% of your ground truth contains unnecessary inconsistencies.
Step 6: Run streaming vs. async evaluation separately
If you’re evaluating both streaming and pre-recorded models, benchmark them independently. They’re optimized for different tradeoffs.
Async (pre-recorded) models:
Higher accuracy (more time to process), no latency constraints. Ideal for batch processing, podcasts, recorded meetings. Evaluate with traditional WER on full transcripts.
Streaming models:
Lower latency (respond quickly). Ideal for voice agents, real-time transcription, live calls. Evaluate metrics like Time to First Token (TTFT) and Time to Complete Turn (TTCT) alongside accuracy.
Part 4: Comparison table of evaluation metrics
Our recommendation: For most applications, calculate WER + MER on domain-specific terms as your primary metrics. Use Semantic WER if your transcripts feed into LLMs. Add LLM-as-a-Judge evaluation for regulated industries (medical, legal) where subjective quality matters.
See the benchmarks: Compare Universal-3 Pro’s WER across standard datasets on our Benchmarks page—updated with every model release.
Part 5: Real-world evaluation patterns
Enterprise medical transcription
Use MER on clinical terminology as your north star. Primary metric: MER on drug names, procedures, diagnoses. Secondary metrics: traditional WER, Semantic WER. Ground truth: use the Truth File Corrector for consistency. Separate streaming and async benchmarks (medical ambient scribes need streaming; batch documentation needs async).
AssemblyAI’s Medical Mode delivers 4.97% MER on clinical entities—meaningfully better than general models—because it’s specifically trained to catch the terminology that healthcare workflows depend on. For implementation details, see our guide to building AI medical ambient scribes.
Voice agent transcription
For voice agents passing transcripts to LLMs, focus on Semantic WER. Primary metric: Semantic WER (what the LLM understands). Secondary metric: deletion rate (hangs are worse than minor substitutions). Benchmark scenario: real customer conversations with overlapping speech. Evaluate Time to First Token for agent responsiveness.
A 1% traditional WER increase might represent zero impact if it’s just “cannot” vs “can’t.” But a 1% deletion rate increase means more conversations stall. See choosing an STT API for voice agents for a framework on evaluating these tradeoffs.
Contact center analytics
Balance accuracy with sentiment detection. Primary metric: WER on critical information (customer names, issues, account numbers). Secondary metrics: sentiment analysis accuracy, emotional tone preservation. Run on actual calls with background noise and accent diversity.
Part 6: The benchmarking SDK and GitHub tools
AssemblyAI provides open-source tools to automate evaluation:
GitHub Repository: AssemblyAI Benchmarking SDK (free, open source)
The SDK includes automated WER calculation with Whisper Normalizer, MER calculation for domain-specific terms, visualization of error breakdowns, batch evaluation across multiple models, and export results to CSV/JSON for analysis.
pip install assemblyai-benchmark-sdk
python -m aai_benchmark \
--reference truth.txt \
--hypothesis model_output.txt \
--language en \
--metrics wer,mer,semantic_wer
The Truth File Corrector is built into the AssemblyAI dashboard—no setup required. It’s free for all users.
Part 7: Streaming vs. async benchmarking
Pre-recorded (async) evaluation
Async models have time to process fully, so optimize purely for accuracy. Calculate WER on full final transcripts, use normalized ground truth for fair comparison, benchmark against curated datasets, and separate audio scenarios (clean, conference calls, noisy).
Async models like Universal-3 Pro achieve mean WER of 5.6% (median 4.9%) across English benchmarks because they can apply multiple passes and correction layers.
Streaming evaluation
Streaming models balance accuracy against latency, so evaluate both dimensions:
- TTFT (Time to First Token): How quickly the model returns the first word
- TTCT (Time to Complete Turn): How long to finalize a complete utterance
- Accuracy on final transcripts: WER on completed turns (not partial hypotheses)
- Deletion rate: Especially important—a missed word causes the conversation to hang
Streaming models like Universal-3 Pro Streaming maintain high accuracy (similar WER to async) while delivering sub-300ms latency for voice agents. For a full comparison of streaming providers, see our top APIs and models for real-time speech recognition.
When comparing streaming vs. async, don’t use pure accuracy—evaluate accuracy at the latency level you’ll actually deploy at. An async model is meaningless if you need real-time responses. For a deep dive on latency optimization, see Building a production-ready voice agent and our guide to achieving sub-300ms latency.
Final words: Evaluation is a continuous process
Evaluation doesn’t end when you choose a model. It’s ongoing. Your use case evolves. Audio patterns change. Ground truth quality matters more than you’d expect. And metrics like Semantic WER and Missed Entity Rate tell a different story than traditional WER.
The benchmarking frameworks and tools in this guide—the Truth File Corrector, the open-source SDK, LLM-as-a-Judge evaluation—exist because companies we work with discovered these issues the hard way. We’re sharing what we learned so you don’t have to repeat it.
Quick action steps:
- Run your evaluation on at least 10 representative audio files with multiple models
- Calculate WER and domain-specific metrics (MER for medical, deletion rate for voice agents)
- Use the Truth File Corrector to validate ground truth quality before finalizing results
- Separate streaming and async evaluation—they optimize for different things
- Compare Semantic WER if your transcripts feed into LLMs
Frequently asked questions
What is Word Error Rate (WER) and why is it important?
Word Error Rate is the industry-standard metric for transcription accuracy. It measures the percentage of words that differ between a reference transcript and an AI-generated transcript. Lower WER means better accuracy.
WER is important because it’s reproducible—two people evaluating the same audio should get the same number. But it’s not the only metric that matters. For voice agents feeding transcripts to LLMs, Semantic WER is often more meaningful. For medical transcription, Missed Entity Rate on drug names is critical.
What is WER and how do I reduce it?
WER = (Insertions + Deletions + Substitutions) / Total Words in Reference × 100
To reduce WER:
- Improve audio quality: Use better microphones, record in quieter environments, minimize compression
- Match your audio to your model’s training data: Universal-3 Pro handles diverse accents and noisy audio better than older models
- Use domain-specific features: Medical Mode reduces WER on clinical terminology
- Fix your ground truth: Use the Truth File Corrector to remove transcriber inconsistencies that artificially inflate WER
- Benchmark fairly: Normalize both transcripts before comparing, use the same dataset for all models
What metrics measure transcription accuracy and quality?
Beyond WER, consider: Semantic WER for AI applications where meaning matters more than exact words; Missed Entity Rate (MER) for domain-specific accuracy on names, numbers, terminology; Deletion rate for voice agents (hangs are worse than minor errors); and LLM-as-a-Judge for regulated industries or human-reviewed use cases.
The right metric depends on how you’ll use the transcript. General transcription: WER. LLM input: Semantic WER. Medical: MER. Voice agent: deletion rate + Semantic WER. For a broader guide on selecting the right provider, see How to choose the best speech-to-text API.
What factors affect accuracy of speech-to-text transcripts?
Audio factors: Background noise, microphone quality, audio compression, echo and reverberation.
Speaker factors: Accent and dialect, speaking pace, clarity, voice characteristics.
Content factors: Vocabulary complexity, proper nouns, numbers and dates, language mixing.
Model selection matters. Universal-3 Pro handles diverse accents, background noise, and technical terminology far better than older models—but only if you’re actually using it.
Does real-time transcription sacrifice accuracy for speed?
Streaming transcription adds latency constraints, but modern models like Universal-3 Pro Streaming maintain accuracy comparable to async models while delivering sub-300ms latency. The tradeoff isn’t accuracy vs. speed anymore—it’s accuracy at different latency points.
Don’t assume streaming models are less accurate. Test them on your actual use case at your actual latency requirements.
How do I validate transcription accuracy automatically?
Before production: Use the AssemblyAI Playground to test your audio on multiple models. Run at least 10 representative audio files through each model. Create high-quality reference transcripts. Calculate WER using the Truth File Corrector for consistent ground truth.
In production: Sample 10–20 transcripts per week. Have domain experts review them (can be automated with LLM-as-a-Judge). Track WER, MER, and semantic accuracy trends over time. Alert if accuracy drops below your threshold.
Related resources
- Benchmarks page — See Universal-3 Pro’s latest WER numbers across datasets
- Why your WER benchmark might be lying to you — The companion blog covering ground truth issues in depth
- How accurate is speech-to-text — Detailed breakdown of accuracy factors
- Word Error Rate explained — Deep dive into WER calculation and normalization
- Speech-to-text product page — Explore Universal-3 Pro for pre-recorded audio
- Streaming STT product page — Build voice agents with Universal-3 Pro Streaming
- Pricing — Transparent, per-minute pricing with no minimums
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.





