Benchmarks are an important first step before running your own evaluation. Below are the current benchmarks for our pre-recorded models so you can assess performance across accuracy, latency, and error rates.Documentation Index
Fetch the complete documentation index at: https://assemblyai.com/docs/llms.txt
Use this file to discover all available pages before exploring further.
Public benchmarks can be misleading due to overfitting and benchmark gaming.
We strongly recommend running your own evaluation on your
audio data to identify the best model for your use case.
Word error rate (WER)
Word Error Rate (WER) is the classical metric for speech-to-text accuracy. It counts substitutions, deletions, and insertions against a reference transcript, divided by the total word count in the ground truth. AssemblyAI Universal-3 Pro achieves a mean WER of 5.6% (median 4.9%) on English benchmarks. WER weights every word equally, so a misrecognized filler word counts the same as a misrecognized email address or medication name. For production voice workflows, we recommend pairing WER with Missed entity rate, which measures accuracy on the high-stakes entities — names, emails, phone numbers, and medical terms — that actually drive end-user outcomes.English benchmarks
Most recent update: January 2026.| Dataset | Universal-3 Pro WER (%) | Universal-2 WER (%) | Relative gain vs Universal-2 |
|---|---|---|---|
| Overall Performance | Mean: 5.6% | Median: 4.9% | Mean: 6.1% | Median: 6.5% | Mean: 8.2% | Median: 24.6% |
| commonvoice | 4.87% | 6.48% | 24.8% |
| earnings21 | 8.80% | 9.37% | 6.1% |
| librispeech_test_clean | 1.52% | 1.68% | 9.5% |
| librispeech_test_other | 2.69% | 3.00% | 10.3% |
| meanwhile | 4.22% | 4.41% | 4.3% |
| tedlium | 6.77% | 7.30% | 7.3% |
| rev16 | 10.29% | 10.32% | 0.3% |
Multilingual benchmarks
Most recent update: January 2026. Dataset: FLEURS.| Language Code | Language | Universal-3 Pro WER (%) | Universal-2 WER (%) | Relative gain vs Universal-2 |
|---|---|---|---|---|
| Average | All | 4.58% | 7.42% | 38.3% |
| de | German | 4.88% | 6.22% | 21.5% |
| en | English | - | 4.38% | - |
| es | Spanish | 3.98% | 4.56% | 12.7% |
| fi | Finnish | - | 10.10% | - |
| fr | French | 4.98% | 7.56% | 34.1% |
| hi | Hindi | - | 7.38% | - |
| it | Italian | 3.69% | 4.75% | 22.3% |
| ja | Japanese | - | 7.79% | - |
| ko | Korean | - | 14.54% | - |
| nl | Dutch | - | 7.79% | - |
| pl | Polish | - | 6.63% | - |
| pt | Portuguese | 5.39% | 5.98% | 9.9% |
| ru | Russian | - | 5.80% | - |
| tr | Turkish | - | 8.12% | - |
| uk | Ukrainian | - | 7.42% | - |
| vi | Vietnamese | - | 9.75% | - |
Missed entity rate
For production voice workflows, the actual words that matter most are entities — names, organizations, emails, phone numbers, and medical terms. The Missed Entity Rate (MER) measures how often a model fails to correctly transcribe these high-stakes terms. See Missed Entity Rate for the full definition. Universal-3 Pro delivers relative gains over Universal-2 across every entity category we track for voice workflows, with the largest improvements on emails, locations, and medical terms.| Entity type | Universal-3 Pro MER (%) | Universal-2 MER (%) | Relative gain vs Universal-2 |
|---|---|---|---|
| Medical terms | 13.15% | 18.43% | 28.6% |
| Locations | 8.61% | 12.40% | 30.6% |
| Job titles | 9.03% | 9.86% | 8.4% |
| Organization names | 17.02% | 20.96% | 18.8% |
| Email addresses | 33.76% | 53.81% | 37.3% |
| Phone numbers | 13.14% | 14.69% | 10.6% |
| Credit card numbers | 21.83% | 25.07% | 12.9% |
Hallucinations and consecutive errors
Hallucinations are a critical concern in production STT systems. AssemblyAI reduces hallucinations by 30% compared to Whisper, across three error categories:- Fabrications — words inserted that were never spoken
- Omissions — spoken words that are missing from the transcript
- Hallucinations — extended sequences of fabricated content