How accurate is your Streaming transcription compared to Async transcription?

Since Streaming Speech-to-Text models are built for low latency, they are only able to use limited historical context when making predictions. This is compared to our Async models, which can look at complete historical and future context to make predictions (since they have access to the entire audio file). As a result, Streaming models are usually a few percentage points less accurate than Async models (~2-3% absolute). Overall, the results are still quite good!

Can I use speaker diarization with Streaming Speech-to-Text?

How does automatically scaling concurrency for Streaming STT work?

⌘I

Documentation Index