What is the best speech-to-text API for field service voice agents?

AssemblyAI's Universal-3 Pro Streaming is the leading speech-to-text API for field service voice agents — ~150ms P50 median latency, 28% better consecutive number recognition for part SKUs and asset IDs, native speaker diarization, and dynamic keyterm prompting (up to 100 terms per session) for parts catalogs, tool names, and customer-specific vocabulary. The model is trained on noisy real-world audio so it stays accurate around HVAC units, generators, and road traffic. Universal-3 Pro Streaming runs at $0.45/hr with unlimited concurrency; the Voice Agent API ($4.50/hr) bundles STT, LLM, and TTS for a fully managed hands-free agent.

How do I build a hands-free voice agent for field technicians?

Stream audio from a Bluetooth headset, mobile phone, or truck mic into the Voice Agent API via a single WebSocket — the API handles STT, LLM, and TTS end-to-end in under one second. Configure the system prompt for field-service workflows ("capture part numbers, log work status, confirm before saving"), add part SKUs and customer asset IDs as keyterms, and register tools like update_work_order or order_part for write-backs to your FSM platform. For deeper customization, use Universal-3 Pro Streaming directly with your own LLM and TTS — drop it into LiveKit, Pipecat, or any standard WebSocket client.

Can speech-to-text capture part numbers and asset IDs accurately?

Yes. Universal-3 Pro Streaming delivers 28% better consecutive number recognition than competing real-time STT providers, with native handling of alphanumeric sequences like "3BAK-0601" or "FS-20260518-4471". Boost recognition further with keyterm prompting — pass your part catalog, customer asset IDs, and model numbers as keyterms (up to 100 per session) so the model is biased toward your exact vocabulary. For service workflows that involve spelling-out ("P as in Peter"), Universal-3 Pro Streaming handles letter-by-letter dictation cleanly.

How does AssemblyAI handle noisy environments like HVAC units or job sites?

Universal-3 Pro Streaming is trained on a broad mix of real-world noisy audio — call center, in-car, mobile, and industrial environments — so accuracy holds up around HVAC compressors, generators, road traffic, wind, and machinery. For the cleanest results in the loudest environments, pair Universal-3 Pro Streaming with a noise-cancelling Bluetooth headset and run keyterm prompting on domain vocabulary. Streaming PII redaction is also available for customer-facing field workflows that involve personal data.

What integrations work with field service management platforms like ServiceTitan or Jobber?

AssemblyAI's Universal-3 Pro Streaming integrates with standard WebSocket clients, so it drops directly into any field service management (FSM) platform that exposes a custom API or webhook layer — ServiceTitan, FieldEdge, HousecallPro, Jobber, Salesforce Field Service, and custom builds. Pipe finalized transcripts to the LLM Gateway (25+ models across Claude, GPT, and Gemini) for structured-field extraction (part number, asset, work status), then push those fields to your FSM platform's API. The Voice Agent API exposes the same WebSocket pattern with built-in tool calls for write-back workflows.

Does AssemblyAI support multilingual field technicians?

Yes. Universal-3 Pro Streaming natively handles 6 core code-switching languages (English, Spanish, French, German, Italian, Portuguese) with automatic detection — perfect for bilingual field crews who switch between languages mid-job. Whisper Streaming (whisper-rt) extends coverage to 99 total languages for any workforce. Enable language_detection on the streaming connection and the API returns a detected language code with confidence on each turn, so your downstream LLM responds and your FSM system stores transcripts in the right language.

Solutions

Voice agents for field service operations

Build hands-free voice agents that let technicians pull up manuals, log work completed, order parts, and update work orders — all through voice while their hands stay on the job.

Get started free Talk to sales

Work order

Live

WO # FS-20260518-4471

Asset Carrier 50XC 15T RTU

Task Compressor replacement

Status In progress

Parts ordered Pending

Voice note captured

"Compressor seized — replacing scroll assembly. Need part #3BAK-0601, ordering now via voice."

The problem

Field techs lose hours every shift to paperwork

Most field technicians spend 30–60 minutes per shift tapping through FSM screens — on ladders, in crawl spaces, on rooftops, with greasy or wet hands. Voice would fix it, but consumer ASR breaks on noisy job sites and butchers part numbers. The result: late work orders, missing parts, and revenue stuck in unbilled hours. AssemblyAI's purpose-built voice AI handles the real noise, the real vocabulary, and the real workflows of field service.

Built for the real conditions of field work

Latency ~150ms

Median streaming latency for hands-free voice prompts and confirmations.

Entity accuracy 28%

Better consecutive number recognition for part SKUs, model numbers, and asset IDs.

Languages 99

Total languages supported for multilingual field technician workforces.

Keyterms 100

Domain-specific terms per session — boost recognition of parts, tools, and procedures.

Two ways to build

Pick the API that fits your field service stack

Ship a working hands-free agent in an afternoon, or drop best-in-class streaming STT into the FSM platform you already run.

Recommended

Voice Agent API

Our proprietary voice stack via one WebSocket. Run a hands-free agent that captures voice notes, confirms back via TTS, and writes updates to your FSM platform — zero infra to manage.

Best for

Hands-free voice capture and read-back confirmation
Tool calls for FSM write-back (ServiceTitan, FieldEdge, Jobber, custom)
Built-in keyterm prompting for parts catalogs and asset IDs
Claude Code compatible — paste the docs and build anything

$4.50/hr — speech, LLM, and voice all included

Get started for free

Free tier available · No credit card required

Bring Your Own Stack

Universal-3 Pro Streaming STT API

The live transcription layer for your FSM platform. Works natively with LiveKit, Pipecat, Vapi, and Twilio — entity-accurate, noise-robust, and multilingual out of the box.

Best for

Teams running their own LLM and FSM integrations
~150ms P50 latency for real-time voice prompts
28% better consecutive number recognition for SKUs and asset IDs
99-language coverage via automatic model routing
Trained on real-world noisy audio — job sites work

$0.45/hr — transcription only, unlimited streams

View integration docs

No concurrency caps · Autoscaling included

One pipeline turns voice into structured field-service data

Capture hands-free voice

Stream audio from a Bluetooth headset, phone speaker, or work-truck mic. No tapping, no swiping — technicians keep both hands on the job.

Transcribe with noise robustness

Universal-3 Pro handles loud HVAC units, generators, road traffic, and wind. Speaker labels separate technician from customer when on-site.

Extract structured work-order data

Finalized turns feed the LLM Gateway (25+ models across Claude, GPT, and Gemini) to extract part numbers, asset IDs, work status, and parts requests as structured fields.

Confirm and write back to FSM

Read captured fields back to the technician for confirmation, then push to ServiceTitan, FieldEdge, HousecallPro, Jobber, or your custom backend via tool calls or webhooks.

handyman

Field service pipeline

Capture hands-free voice notes

↓

Transcribe — noise-robust + multilingual

↓

Extract structured work-order fields

↓

Confirm + push to FSM platform

Quickstart

Build a hands-free field service voice agent in minutes

Voice Agent API — hands-free agent with FSM write-back

# Voice Agent API: hands-free field service voice agent
import asyncio, json, websockets

API_KEY = "YOUR_API_KEY"

async def run_agent():
    async with websockets.connect(
        "wss://agents.assemblyai.com/v1/ws",
        additional_headers={"Authorization": f"Bearer {API_KEY}"},
    ) as ws:
        await ws.send(json.dumps({
            "type": "session.update",
            "session": {
                "system_prompt": (
                    "You are a hands-free assistant for an HVAC field technician. "
                    "Capture part numbers, asset IDs, and work status. Always "
                    "confirm captured fields back to the tech before calling "
                    "update_work_order. Keep responses under 2 sentences."
                ),
                "greeting": "Ready when you are — what's the update?",
                "input": {"keyterms": ["Carrier 50XC", "Trane XR", "scroll assembly", "compressor"]},
                "output": {"voice": "ivy"},
                "tools": [{
                    "type": "function",
                    "name": "update_work_order",
                    "description": "Push captured fields to the FSM platform.",
                    "parameters": {
                        "type": "object",
                        "properties": {
                            "wo_id": {"type": "string"},
                            "part_number": {"type": "string"},
                            "status": {"type": "string"},
                        },
                        "required": ["wo_id", "status"],
                    },
                }],
            },
        }))
        async for msg in ws:
            handle(json.loads(msg))  # transcript.user, reply.audio, tool.call, ...

Universal-3 Pro Streaming — voice notes to structured fields

# Universal-3 Pro Streaming: voice notes → structured work order
import asyncio, json, websockets
from urllib.parse import urlencode

API_KEY = "YOUR_API_KEY"

params = urlencode({
    "sample_rate": 16000,
    "speech_model": "u3-rt-pro",
    "language_detection": "true",              # tag each turn with detected language
    "keyterms_prompt": json.dumps([
        "Carrier 50XC", "Trane XR", "scroll assembly",
        "compressor seized", "refrigerant leak",
        "3BAK-0601", "FS-20260518",
    ]),
    "format_turns": "true",
    "speaker_labels": "true",                  # tech vs. customer on-site
})

async def stream_field_notes(audio_iter, send_to_fsm):
    url = f"wss://streaming.assemblyai.com/v3/ws?{params}"
    async with websockets.connect(
        url, additional_headers={"Authorization": API_KEY},
    ) as ws:
        async def send_audio():
            async for chunk in audio_iter:
                await ws.send(chunk)
        asyncio.create_task(send_audio())
        async for raw in ws:
            evt = json.loads(raw)
            if evt.get("type") == "Turn" and evt.get("end_of_turn"):
                # finalized turn → LLM Gateway extracts {wo_id, part, status}
                fields = extract_work_order_fields(evt["transcript"])
                send_to_fsm(fields)

Try in Playground View full docs

Part SKUs, model numbers, and asset IDs captured cleanly

Universal-3 Pro Streaming delivers 28% better consecutive number recognition for alphanumeric sequences. Add part catalogs and customer asset terms via keyterm prompting (up to 100 per session) for near-perfect domain accuracy.

Built for the real noise of a job site

Universal-3 Pro Streaming is trained on noisy real-world audio — HVAC compressors, generators, road traffic, wind. The model stays accurate where consumer ASR breaks down, so field-truck dictation works the first time.

Multilingual workforce out of the box

Universal-3 Pro Streaming handles 6 core languages with native code-switching at the highest accuracy. Automatic model routing extends coverage to 99 languages — field technicians dictate work-order notes in their preferred language and your FSM system receives clean transcripts every time.

Voice AI builders at scale on AssemblyAI

80% increase in customer satisfaction

Calabrio's enterprise workforce intelligence platform runs on AssemblyAI for real-time transcription accuracy across multilingual call recordings — the same audio fundamentals that power hands-free field workflows.

Calabrio

40% increase in sales growth

It's one microphone picking up a bunch of different voices.

Jake Cronin, Co-founder & CEO — Siro

: AssemblyAI's Universal-3 Pro Streaming is the leading speech-to-text API for field service voice agents — ~150ms P50 median latency, 28% better consecutive number recognition for part SKUs and asset IDs, native speaker diarization, and dynamic keyterm prompting (up to 100 terms per session) for parts catalogs, tool names, and customer-specific vocabulary. The model is trained on noisy real-world audio so it stays accurate around HVAC units, generators, and road traffic. Universal-3 Pro Streaming runs at $0.45/hr with unlimited concurrency; the Voice Agent API ($4.50/hr) bundles STT, LLM, and TTS for a fully managed hands-free agent.
: Stream audio from a Bluetooth headset, mobile phone, or truck mic into the Voice Agent API via a single WebSocket — the API handles STT, LLM, and TTS end-to-end in under one second. Configure the system prompt for field-service workflows ("capture part numbers, log work status, confirm before saving"), add part SKUs and customer asset IDs as keyterms, and register tools like update_work_order or order_part for write-backs to your FSM platform. For deeper customization, use Universal-3 Pro Streaming directly with your own LLM and TTS — drop it into LiveKit, Pipecat, or any standard WebSocket client.
: Yes. Universal-3 Pro Streaming delivers 28% better consecutive number recognition than competing real-time STT providers, with native handling of alphanumeric sequences like "3BAK-0601" or "FS-20260518-4471". Boost recognition further with keyterm prompting — pass your part catalog, customer asset IDs, and model numbers as keyterms (up to 100 per session) so the model is biased toward your exact vocabulary. For service workflows that involve spelling-out ("P as in Peter"), Universal-3 Pro Streaming handles letter-by-letter dictation cleanly.
: Universal-3 Pro Streaming is trained on a broad mix of real-world noisy audio — call center, in-car, mobile, and industrial environments — so accuracy holds up around HVAC compressors, generators, road traffic, wind, and machinery. For the cleanest results in the loudest environments, pair Universal-3 Pro Streaming with a noise-cancelling Bluetooth headset and run keyterm prompting on domain vocabulary. Streaming PII redaction is also available for customer-facing field workflows that involve personal data.
: AssemblyAI's Universal-3 Pro Streaming integrates with standard WebSocket clients, so it drops directly into any field service management (FSM) platform that exposes a custom API or webhook layer — ServiceTitan, FieldEdge, HousecallPro, Jobber, Salesforce Field Service, and custom builds. Pipe finalized transcripts to the LLM Gateway (25+ models across Claude, GPT, and Gemini) for structured-field extraction (part number, asset, work status), then push those fields to your FSM platform's API. The Voice Agent API exposes the same WebSocket pattern with built-in tool calls for write-back workflows.
: Yes. Universal-3 Pro Streaming natively handles 6 core code-switching languages (English, Spanish, French, German, Italian, Portuguese) with automatic detection — perfect for bilingual field crews who switch between languages mid-job. Whisper Streaming (whisper-rt) extends coverage to 99 total languages for any workforce. Enable language_detection on the streaming connection and the API returns a detected language code with confidence on each turn, so your downstream LLM responds and your FSM system stores transcripts in the right language.

Build hands-free voice for your field crews today

Free tier, no credit card. From voice notes to structured work orders in an afternoon.

Get started free