April 8, 2026

How to evaluate speech recognition models

Learn how to evaluate speech-to-text models beyond Word Error Rate. This guide covers Semantic WER, Missed Entity Rate, ground truth correction, and practical benchmarking frameworks for 2026.

Kelsey Foster

Growth

Product Management

Reviewed by

Table of contents

[Visible on live site]

For years, evaluating speech-to-text models meant running Word Error Rate (WER) against a clean dataset and declaring a winner. Simple. Reproducible. And increasingly unreliable.

When we launched Universal-3 Pro, something unexpected happened. Some customers reported their internal benchmarks showed our new model performing worse than our older models. But our testing told a completely different story. So we dug in. What we found—and what you’re about to learn—changes how you should evaluate transcription accuracy entirely.

The problem isn’t WER itself. The problem is that WER is only one piece of the puzzle. It treats all words equally, ignores ground truth quality issues, and can’t measure what actually matters to AI systems that consume transcripts downstream. As Voice AI applications get more sophisticated, your evaluation methods need to evolve too.

This guide walks you through modern transcription evaluation, from traditional metrics to semantic accuracy, ground truth validation, and the practical tools you need to benchmark models correctly. If you’re starting from scratch on speech-to-text concepts, that guide covers the fundamentals.

Part 1: Understanding core evaluation metrics

Word Error Rate (WER): The foundation

Word Error Rate remains the industry standard, and for good reason. It’s simple, reproducible, and gives you a single number to compare across vendors. But you need to understand what it actually measures. (For a deeper primer on WER as a metric, see Is Word Error Rate Useful?.)

WER calculates the percentage of words that differ between a reference transcript (ground truth) and the AI-generated transcript. It counts three types of errors:

Insertions (hallucinations): Words the AI added that weren’t in the reference
Deletions: Words in the reference that the AI missed entirely
Substitutions: Words where the AI transcribed something different (e.g., “Cadillac” as “cataracts”)

The formula:

WER = (Insertions + Deletions + Substitutions) / Total Words in Reference × 100

A simple example:

Reference: “The quick brown fox jumps over the lazy dog” (9 words)
AI output: “A quick brown fox jumped over the lazy dog” (9 words)
Errors: 2 substitutions (“The” → “A”, “jumps” → “jumped”)
WER: (2 / 9) × 100 = 22.2%

The catch: Before you calculate WER, both transcripts must be normalized. You need to lowercase everything, strip punctuation, convert numbers to words, and standardize contractions. A transcript that says “don’t” will score differently than “do not” if you skip normalization—even though they’re semantically identical.

AssemblyAI recommends using the open-source Whisper Normalizer for English transcription and the Basic normalizer for other languages.

Semantic WER: Beyond word-for-word accuracy

Here’s where things get interesting. Traditional WER assumes every word matters equally. But in modern Voice AI applications, transcripts often feed directly into LLMs, not human readers. An LLM doesn’t care whether the transcript says “cannot” or “can’t”—it understands the meaning either way. Yet traditional WER penalizes both equally.

Semantic WER addresses this. Instead of comparing words, it uses a reasoning model to evaluate whether the meaning of the transcript is preserved. The process:

An LLM (like Claude Sonnet) receives both the AI transcript and the reference transcript
The LLM judges whether they convey the same meaning and information
The system calculates accuracy based on semantic equivalence, not word-for-word matching

Example of semantic equivalence that traditional WER misses:

Reference: “The patient is on hydrochlorothiazide for hypertension”
AI output: “The patient is on HCTZ for hypertension”
Traditional WER: ~20% error (one word substitution)
Semantic WER: 0% error (meaning fully preserved, HCTZ is the standard abbreviation)

For voice agents and LLM-based applications, Semantic WER is increasingly the more relevant metric because it measures what actually impacts downstream performance. Our post on how accurate speech-to-text is in 2026 covers how Semantic WER benchmarks compare across providers.

Missed Entity Rate (MER): What really matters in specialized domains

WER treats all words equally, but not all words are equal. In medical transcription, getting “hydrochlorothiazide” wrong is catastrophic. Missing a filler word like “um” is harmless.

Missed Entity Rate (MER) fixes this. It measures accuracy specifically on high-value words like drug names and dosages, proper nouns and company names, medical procedures and diagnoses, numbers, dates, and addresses, and domain-specific terminology.

Instead of averaging errors across an entire transcript, MER focuses on the words that matter most for your application.

Real-world example — Medical Mode:

AssemblyAI’s Medical Mode—available as a $0.15/hour add-on—enhances accuracy on clinical terminology. Our internal benchmarks show:

Model	MER on Medical Entities
Universal-3 Pro with Medical Mode	4.97%
Deepgram Nova-3-medical	7.32%

That 2.35 percentage point difference might not sound large until you realize it means 30% fewer missed drug names in clinical transcripts. In healthcare, this is the difference between an acceptable documentation system and one clinicians actually trust. For a full breakdown of how Medical Mode handles clinical terminology, see Medical transcription that actually works.

Try it yourself

Test Medical Mode on your own clinical audio in the AssemblyAI Playground

Try playground

LLM-as-a-Judge evaluation

For the highest-confidence evaluation, use a reasoning model (like Claude Sonnet 4.6 with extended thinking) as a judge. This works especially well for comparing transcripts where subjective quality matters.

The process:

Present two anonymized transcripts side-by-side—one from your preferred model, one from a competitor
Include the reference transcript as ground truth
Ask the LLM to evaluate which transcript is higher quality for your specific use case
Calculate win rates across a sample of audio

This approach works especially well for medical transcription (where clinical accuracy matters more than grammar), legal transcription (where precise phrasing has legal implications), and accessibility transcripts (where clarity and formatting affect usability).

The advantage: you’re evaluating real-world usefulness rather than benchmark gaming. The tradeoff: it’s slower and more expensive than automated metrics.

Part 2: The ground truth problem

Here’s what nobody talks about: the biggest source of evaluation error isn’t the speech-to-text model. It’s the ground truth.

When customers complained that Universal-3 Pro scored worse than our older models, we investigated their ground truth files. What we found was shocking. The human-transcribed references contained systematic errors and inconsistencies that penalized our newer model for being more accurate than the reference. We published the full investigation in Why your WER benchmark might be lying to you—it’s essential reading before you run any benchmark.

Why ground truth quality matters

Imagine a customer transcribes audio and creates a reference file. They have a human transcriber listen and type what they hear. This reference becomes the “truth” you’ll compare models against.

But here’s the problem: human transcribers make mistakes too. They might mishear words and transcribe them consistently wrong, normalize speech in inconsistent ways (“alright” vs “all right”), miss medical abbreviations or specialized terminology, include filler words inconsistently, or disagree on formatting (capitalization, punctuation, numbers).

When your reference contains these errors, a more accurate model can score worse because it transcribes what was actually said, not what a fallible human thought they heard.

The Universal-3 Pro case study:

When we launched Universal-3 Pro, some customers saw their WER increase by 2–3 percentage points. Investigation revealed their ground truth files contained systematic errors. One example:

Reference: “Patient allergic to penicillin”
Ground truth (human): “Patient allergic to penicillium” (transcriber misheard)
Universal-3 Pro output: “Patient allergic to penicillin” (correct)
WER calculation: Model marked as wrong despite being more accurate

Semantic equivalence issues

Beyond outright errors, ground truth quality degrades when transcribers make subjective formatting choices: “healthcare” vs “health care,” “alright” vs “all right,” “can’t” vs “cannot,” “12” vs “twelve.”

These inconsistencies can swing WER by several percentage points while contributing zero to actual transcription quality. Our guide on handling transcript errors covers how to identify and correct these patterns systematically.

How to fix ground truth quality

This is where the Truth File Corrector comes in. Released in the AssemblyAI dashboard, this tool automatically:

Detects semantic equivalence issues across your reference file
Flags potential transcriber errors using language models
Suggests corrections based on domain knowledge
Allows you to review and approve changes before re-running evaluation

The result: your benchmark reflects model quality, not transcriber inconsistency.

The Truth File Corrector is live in your AssemblyAI dashboard. The benchmarking SDK is on GitHub. Both are free.

Part 3: Practical benchmarking frameworks

Step 1: Prepare high-quality reference audio

Select 10–20 audio files that represent your real use cases: a mix of speaker accents and dialects, variety of audio qualities (clean studio, conference calls, noisy environments), real domain-specific terminology, and different speaker counts and overlapping speech.

Avoid using only clean, curated audio. Your models will perform worse on real production audio, and you want benchmarks that predict that reality.

Step 2: Create ground truth with quality controls

Option 1 (recommended): Use Universal-3 Pro to generate transcripts, then have domain experts review and correct only clear errors. This is faster than human transcription from scratch and you know the baseline quality is high.

Option 2: Hire professional transcribers. Specify that they should transcribe what’s actually said (not “clean up” grammar), and use industry-standard formatting for your domain.

Quick test before you commit: Upload a sample audio file to the AssemblyAI Playground to see real-time transcripts, confidence scores, and formatting—before writing a single line of code.

Step 3: Normalize both transcripts

Before calculating metrics, normalize both the reference and model output using the Whisper Normalizer:

from whisper_normalizer.english import EnglishTextNormalizer
 
normalizer = EnglishTextNormalizer()
reference_normalized = normalizer(reference_transcript)
ai_normalized = normalizer(ai_transcript)

Step 4: Calculate multiple metrics

Don’t stop at WER. Run at least three metrics:

import jiwer
 
# Traditional WER
wer = jiwer.wer(reference_normalized, ai_normalized)
 
# Calculate error breakdown
output = jiwer.compute_measures(reference_normalized, ai_normalized)
substitutions = output['substitutions']
deletions = output['deletions']
insertions = output['insertions']

Step 5: Use the Truth File Corrector

Before finalizing your benchmark:

Upload your reference file to the AssemblyAI dashboard
Run it through the Truth File Corrector
Review flagged inconsistencies
Approve or modify corrections
Re-run your evaluation with the cleaned reference file

This single step often reveals 5–15% of your ground truth contains unnecessary inconsistencies.

Step 6: Run streaming vs. async evaluation separately

If you’re evaluating both streaming and pre-recorded models, benchmark them independently. They’re optimized for different tradeoffs.

Async (pre-recorded) models:

Higher accuracy (more time to process), no latency constraints. Ideal for batch processing, podcasts, recorded meetings. Evaluate with traditional WER on full transcripts.

Streaming models:

Lower latency (respond quickly). Ideal for voice agents, real-time transcription, live calls. Evaluate metrics like Time to First Token (TTFT) and Time to Complete Turn (TTCT) alongside accuracy.

Part 4: Comparison table of evaluation metrics

Metric	What it measures	Best for	Limitations
Word Error Rate (WER)	% words wrong, missing, or added	General benchmarking	Treats all words equally
Semantic WER	Whether meaning is preserved	Voice agents, LLM apps	Slower; requires reasoning model
Missed Entity Rate (MER)	Accuracy on high-value words	Medical, legal, technical	Requires defining key terms
LLM-as-a-Judge	Overall quality assessment	Specialized, regulated use	Subjective; expensive
Character Error Rate (CER)	Character-level accuracy	Non-English languages	Less intuitive for English
Ins/Del/Sub rates	Breakdown of error types	Diagnosing model failures	No single-number comparison

Our recommendation: For most applications, calculate WER + MER on domain-specific terms as your primary metrics. Use Semantic WER if your transcripts feed into LLMs. Add LLM-as-a-Judge evaluation for regulated industries (medical, legal) where subjective quality matters.

See the benchmarks: Compare Universal-3 Pro’s WER across standard datasets on our Benchmarks page—updated with every model release.

Part 5: Real-world evaluation patterns

Enterprise medical transcription

Use MER on clinical terminology as your north star. Primary metric: MER on drug names, procedures, diagnoses. Secondary metrics: traditional WER, Semantic WER. Ground truth: use the Truth File Corrector for consistency. Separate streaming and async benchmarks (medical ambient scribes need streaming; batch documentation needs async).

AssemblyAI’s Medical Mode delivers 4.97% MER on clinical entities—meaningfully better than general models—because it’s specifically trained to catch the terminology that healthcare workflows depend on. For implementation details, see our guide to building AI medical ambient scribes.

Voice agent transcription

For voice agents passing transcripts to LLMs, focus on Semantic WER. Primary metric: Semantic WER (what the LLM understands). Secondary metric: deletion rate (hangs are worse than minor substitutions). Benchmark scenario: real customer conversations with overlapping speech. Evaluate Time to First Token for agent responsiveness.

A 1% traditional WER increase might represent zero impact if it’s just “cannot” vs “can’t.” But a 1% deletion rate increase means more conversations stall. See choosing an STT API for voice agents for a framework on evaluating these tradeoffs.

Contact center analytics

Balance accuracy with sentiment detection. Primary metric: WER on critical information (customer names, issues, account numbers). Secondary metrics: sentiment analysis accuracy, emotional tone preservation. Run on actual calls with background noise and accent diversity.

Part 6: The benchmarking SDK and GitHub tools

AssemblyAI provides open-source tools to automate evaluation:

GitHub Repository: AssemblyAI Benchmarking SDK (free, open source)

The SDK includes automated WER calculation with Whisper Normalizer, MER calculation for domain-specific terms, visualization of error breakdowns, batch evaluation across multiple models, and export results to CSV/JSON for analysis.

pip install assemblyai-benchmark-sdk
 
python -m aai_benchmark \
  --reference truth.txt \
  --hypothesis model_output.txt \
  --language en \
  --metrics wer,mer,semantic_wer

The Truth File Corrector is built into the AssemblyAI dashboard—no setup required. It’s free for all users.

Part 7: Streaming vs. async benchmarking

Pre-recorded (async) evaluation

Async models have time to process fully, so optimize purely for accuracy. Calculate WER on full final transcripts, use normalized ground truth for fair comparison, benchmark against curated datasets, and separate audio scenarios (clean, conference calls, noisy).

Async models like Universal-3 Pro achieve mean WER of 5.6% (median 4.9%) across English benchmarks because they can apply multiple passes and correction layers.

Streaming evaluation

Streaming models balance accuracy against latency, so evaluate both dimensions:

TTFT (Time to First Token): How quickly the model returns the first word
TTCT (Time to Complete Turn): How long to finalize a complete utterance
Accuracy on final transcripts: WER on completed turns (not partial hypotheses)
Deletion rate: Especially important—a missed word causes the conversation to hang

Streaming models like Universal-3 Pro Streaming maintain high accuracy (similar WER to async) while delivering sub-300ms latency for voice agents. For a full comparison of streaming providers, see our top APIs and models for real-time speech recognition.

When comparing streaming vs. async, don’t use pure accuracy—evaluate accuracy at the latency level you’ll actually deploy at. An async model is meaningless if you need real-time responses. For a deep dive on latency optimization, see Building a production-ready voice agent and our guide to achieving sub-300ms latency.

Final words: Evaluation is a continuous process

Evaluation doesn’t end when you choose a model. It’s ongoing. Your use case evolves. Audio patterns change. Ground truth quality matters more than you’d expect. And metrics like Semantic WER and Missed Entity Rate tell a different story than traditional WER.

The benchmarking frameworks and tools in this guide—the Truth File Corrector, the open-source SDK, LLM-as-a-Judge evaluation—exist because companies we work with discovered these issues the hard way. We’re sharing what we learned so you don’t have to repeat it.

Quick action steps:

Run your evaluation on at least 10 representative audio files with multiple models
Calculate WER and domain-specific metrics (MER for medical, deletion rate for voice agents)
Use the Truth File Corrector to validate ground truth quality before finalizing results
Separate streaming and async evaluation—they optimize for different things
Compare Semantic WER if your transcripts feed into LLMs

Test with your own audio in under 10 minutes

No commits, no minimums. Starts at $0.15/hr.

Start building

Frequently asked questions

What is Word Error Rate (WER) and why is it important?

Word Error Rate is the industry-standard metric for transcription accuracy. It measures the percentage of words that differ between a reference transcript and an AI-generated transcript. Lower WER means better accuracy.

WER is important because it’s reproducible—two people evaluating the same audio should get the same number. But it’s not the only metric that matters. For voice agents feeding transcripts to LLMs, Semantic WER is often more meaningful. For medical transcription, Missed Entity Rate on drug names is critical.

What is WER and how do I reduce it?

WER = (Insertions + Deletions + Substitutions) / Total Words in Reference × 100

To reduce WER:

Improve audio quality: Use better microphones, record in quieter environments, minimize compression
Match your audio to your model’s training data: Universal-3 Pro handles diverse accents and noisy audio better than older models
Use domain-specific features: Medical Mode reduces WER on clinical terminology
Fix your ground truth: Use the Truth File Corrector to remove transcriber inconsistencies that artificially inflate WER
Benchmark fairly: Normalize both transcripts before comparing, use the same dataset for all models

What metrics measure transcription accuracy and quality?

Beyond WER, consider: Semantic WER for AI applications where meaning matters more than exact words; Missed Entity Rate (MER) for domain-specific accuracy on names, numbers, terminology; Deletion rate for voice agents (hangs are worse than minor errors); and LLM-as-a-Judge for regulated industries or human-reviewed use cases.

The right metric depends on how you’ll use the transcript. General transcription: WER. LLM input: Semantic WER. Medical: MER. Voice agent: deletion rate + Semantic WER. For a broader guide on selecting the right provider, see How to choose the best speech-to-text API.

What factors affect accuracy of speech-to-text transcripts?

Audio factors: Background noise, microphone quality, audio compression, echo and reverberation.

Speaker factors: Accent and dialect, speaking pace, clarity, voice characteristics.

Content factors: Vocabulary complexity, proper nouns, numbers and dates, language mixing.

Model selection matters. Universal-3 Pro handles diverse accents, background noise, and technical terminology far better than older models—but only if you’re actually using it.

Does real-time transcription sacrifice accuracy for speed?

Streaming transcription adds latency constraints, but modern models like Universal-3 Pro Streaming maintain accuracy comparable to async models while delivering sub-300ms latency. The tradeoff isn’t accuracy vs. speed anymore—it’s accuracy at different latency points.

Don’t assume streaming models are less accurate. Test them on your actual use case at your actual latency requirements.

How do I validate transcription accuracy automatically?

Before production: Use the AssemblyAI Playground to test your audio on multiple models. Run at least 10 representative audio files through each model. Create high-quality reference transcripts. Calculate WER using the Truth File Corrector for consistent ground truth.

In production: Sample 10–20 transcripts per week. Have domain experts review them (can be automated with LLM-as-a-Judge). Track WER, MER, and semantic accuracy trends over time. Alert if accuracy drops below your threshold.

Related resources

Benchmarks page — See Universal-3 Pro’s latest WER numbers across datasets
Why your WER benchmark might be lying to you — The companion blog covering ground truth issues in depth
How accurate is speech-to-text — Detailed breakdown of accuracy factors
Word Error Rate explained — Deep dive into WER calculation and normalization
Speech-to-text product page — Explore Universal-3 Pro for pre-recorded audio
Streaming STT product page — Build voice agents with Universal-3 Pro Streaming
Pricing — Transparent, per-minute pricing with no minimums

How to evaluate speech recognition models

Part 1: Understanding core evaluation metrics

Word Error Rate (WER): The foundation

Semantic WER: Beyond word-for-word accuracy

Missed Entity Rate (MER): What really matters in specialized domains

LLM-as-a-Judge evaluation

Part 2: The ground truth problem

Why ground truth quality matters

Semantic equivalence issues

How to fix ground truth quality

Part 3: Practical benchmarking frameworks

Step 1: Prepare high-quality reference audio

Step 2: Create ground truth with quality controls

Step 3: Normalize both transcripts

Step 4: Calculate multiple metrics

Step 5: Use the Truth File Corrector

Step 6: Run streaming vs. async evaluation separately

Part 4: Comparison table of evaluation metrics

Part 5: Real-world evaluation patterns

Enterprise medical transcription

Voice agent transcription

Contact center analytics

Part 6: The benchmarking SDK and GitHub tools

Part 7: Streaming vs. async benchmarking

Pre-recorded (async) evaluation

Streaming evaluation

Final words: Evaluation is a continuous process

Frequently asked questions

What is Word Error Rate (WER) and why is it important?

What is WER and how do I reduce it?

What metrics measure transcription accuracy and quality?

What factors affect accuracy of speech-to-text transcripts?

Does real-time transcription sacrifice accuracy for speed?

How do I validate transcription accuracy automatically?

Related resources

Related posts

How to use Voice AI for healthcare market research

Business use cases for Generative AI

7 LLM use cases and applications in 2026

How to evaluate speech recognition models

Content moderation on audio files with Python

Speech-to-Text AI for product managers: How it works and key considerations

Build with AssemblyAI's Speaker Diarization Model + Latest Tutorials

Medical transcription that actually works — Beyond generic STT