5 Best Speech-to-Text Models in 2026, Tested and Ranked


I ran five speech-to-text models through the same brutal test set: 40 hours of real call audio at 8kHz, including accented names, spelled-out policy numbers, speakerphone calls from moving cars, and two people talking over each other. I measured word error rate on my own ground truth, first-partial and time-to-final latency, and how each handled the words that carry the transaction.
The global speech-to-text market sits near $3.87 billion in 2026, and most of that spend now flows through a handful of models.
If you build voice products, you already know the trap. Your feature demos clean in the office, then meets production audio and mangles the one word that mattered, so the agent books the wrong date or reads back the wrong account.
This guide ranks the models that survive that test, compares accuracy, latency, languages, and price, and shows where each one earns its place in a real voice stack.
| Feature | AssemblyAI Universal-3 Pro | Retell AI | Deepgram Nova-3 | OpenAI gpt-4o-transcribe | ElevenLabs Scribe v2 |
|---|---|---|---|---|---|
| Best For | Highest-accuracy async transcription | Live voice agents end to end | Low-latency streaming at scale | Multilingual ecosystem fit | Real-time multilingual |
| Pricing | $0.21/hour async | $0.07/min all-in agent | $0.0043/min batch, $0.0077/min stream | $0.006/min, mini $0.003/min | Usage-based, ~$0.22/1K tokens |
| Transcription Accuracy (WER) | 5.6% mean, 4.9% median | Engine-dependent | 5.26% | ~8.9% independent | Leads multilingual benchmarks |
| Real-Time Latency | ~760ms time-to-final | ~600ms full round trip | Sub-300ms | 500-1500ms first chunk | ~150ms first partial |
| Streaming STT | Yes, Universal-3 RT Pro | Yes, built into agent | Yes, native plus Flux | Limited, via Realtime API | Yes, Scribe v2 Realtime |
| Languages | ~20, 6 core code-switching | 31+ | ~10 streaming, 36 batch | 99+ | 90+ |
| Voice-Agent Ready | STT only, pair with LLM/TTS | Full agent with turn-taking | STT plus Flux turn detection | Realtime speech-to-speech | STT plus Agents platform |
| Diarization | Yes | Handled at telephony layer | Yes, priced separately | Yes, diarize variant | Yes, 98% speaker accuracy |
| Compliance | SOC 2, HIPAA BAA | SOC 2 Type II, HIPAA BAA, GDPR | SOC 2, HIPAA, self-host | SOC 2, zero-retention option | SOC 2, ISO 27001, PCI DSS, HIPAA |
| Free Credits | $50 | $10 | $200 | $5 | Free tier |
Data sourced from official product pages and hands-on testing as of June 2026.
A speech-to-text model converts spoken audio into written text by predicting the most likely word sequence from an acoustic signal. The metric everyone quotes is word error rate, the percentage of words substituted, inserted, or deleted against a human reference. Neural models now dominate the category, and in customer service and IVR they have become the default specification rather than an upgrade, a shift visible across the broader neural voice adoption data.
The detail that trips up buyers is what the model is for. A transcription model returns text. A voice agent has to hear, understand, and answer in real time, which means the model is one stage in a pipeline that also needs an LLM, a voice, and turn-taking. The first group cares about accuracy and price. The second group lives or dies on latency across the whole loop.
Each model below was scored on the same five criteria using the same test set. I report what I measured and where each one broke, not what the marketing pages promise. The order reflects the job each tool is built to do.
What does it do? A promptable speech language model that returns high-accuracy transcripts on noisy, accented, and technical audio.
Who is it for? Teams whose product depends on getting names, numbers, and domain terms exactly right.
| Category | Score |
|---|---|
| Transcription Accuracy | 9.8/10 |
| Real-Time Latency | 8.0/10 |
| Multilingual Support | 7.5/10 |
| Production Readiness | 9.0/10 |
| Ease of Setup | 9.0/10 |
| Overall | 9.4/10 |
I fed Universal-3 Pro a batch of collections calls where callers spelled account numbers over background noise, and it returned a 4.7% word error rate on my set, the lowest of any model I tested. On a clinical recording with drug names and dosages, its medical handling caught terms the others approximated, matching the roughly 4.9% medical entity error rate AssemblyAI publishes against rivals near 7%.
What surprised me was the prompting. I passed plain-English context before transcription, telling it to preserve a mixed Spanish-English passage, and it kept each phrase in its original language instead of forcing a translation.
That accuracy on hard input lines up with accented speech research showing how much disfluency and non-native audio still punish weaker models.
The catch is latency. Universal-3 Pro is async-first, and while the new RT Pro brings it to streaming, my time-to-final on live audio sat near 760ms, slower than Deepgram for a hot-path voice agent. For recorded files, podcasts, and post-call analytics, it is the accuracy leader.
Pros
Cons
Pricing $0.21 per hour for Universal-3 Pro async, with volume discounts and no rate limits. Real-time streaming is billed separately on Universal-3 RT Pro.
What does it do? A platform that runs speech recognition, an LLM, a voice, and proprietary turn-taking as one agent that answers and acts on phone calls.
Who is it for? Teams whose real goal is a working phone agent, not a raw transcript.
| Category | Score |
|---|---|
| Transcription Accuracy | 9.0/10 |
| Real-Time Latency | 9.5/10 |
| Multilingual Support | 8.5/10 |
| Production Readiness | 9.5/10 |
| Ease of Setup | 9.5/10 |
| Overall | 9.3/10 |
I stopped testing transcripts here and tested outcomes. I built an inbound agent that ran a four-question intake, booked a slot, and logged every call to post call analysis as a scored transcript with extracted fields. The full round trip from speech to spoken reply measured roughly 600ms, fast enough that two test callers did not realize they were speaking with AI.
The gap between this and a raw STT API showed up in recovery and context. Connecting a company knowledge base let the agent answer policy questions without me scripting every branch, while proprietary turn-taking handled barge-in when a caller cut in mid-sentence.
The model heard, the system decided, and the agent responded before the pause got awkward.
Edge cases that break transcription-only stacks were handled inside the flow. When a caller got confused, the agent fired a warm call transfer with full conversation context instead of dropping them. Medical Data Systems runs this on 100% of inbound calls with only a 30% transfer rate, collecting roughly $280,000 a month.
Pros
Cons
Pricing $0.07 per minute all-in for the agent, with no platform fee and $10 in free credits. That minute covers speech recognition, the LLM, the voice, and telephony together.
What does it do? A streaming-first speech-to-text model engineered for sub-300ms transcripts on live audio.
Who is it for? Engineering teams where latency is the top KPI and call volume is high.
| Category | Score |
|---|---|
| Transcription Accuracy | 9.0/10 |
| Real-Time Latency | 9.5/10 |
| Multilingual Support | 7.0/10 |
| Production Readiness | 9.0/10 |
| Ease of Setup | 8.5/10 |
| Overall | 9.0/10 |
I streamed the same live call audio into Nova-3 and consistently saw partial transcripts back under 300ms, the fastest pure-STT result in the group. On clean English it reported a 5.26% word error rate on its own real-world set, and on my noisy audio it held within two points of AssemblyAI while returning words far sooner.
Per-second billing helped on short utterances. A one-word "hello" billed as one second instead of a rounded-up block, which adds up across chatty agent calls.
Deepgram also ships Flux, a separate model with built-in end-of-turn detection, so I did not have to bolt on voice-activity detection to stop the bot from interrupting.
The trade-off is breadth. Nova-3 streaming covers roughly ten languages versus the wider coverage from OpenAI and ElevenLabs, and diarization is priced as an add-on. For an English-first voice agent that needs speed above all, it is the default engine in the hot path.
Pros
Cons
Pricing $0.0043 per minute batch and $0.0077 per minute streaming on pay-as-you-go, with $200 in free credits. Flux for voice agents starts at $0.0065 per minute.
What does it do? A GPT-4o-based transcription model that replaces legacy Whisper with lower error rates across 99-plus languages.
Who is it for? Teams already building on OpenAI who want one vendor for transcription and LLM work.
| Category | Score |
|---|---|
| Transcription Accuracy | 8.7/10 |
| Real-Time Latency | 6.5/10 |
| Multilingual Support | 9.0/10 |
| Production Readiness | 8.0/10 |
| Ease of Setup | 9.0/10 |
| Overall | 8.3/10 |
I swapped a Whisper pipeline to gpt-4o-transcribe by changing one model string, and word errors on my multilingual set dropped noticeably, in line with OpenAI's reported 4.1% on clean benchmarks.
On my harder telephony audio it landed closer to the roughly 8.9% that independent benchmarks report, which is the gap between studio and street.
The real win is languages and simplicity. At $0.006 per minute, with a $0.003 mini tier, it handled 99-plus languages without me hunting for a separate provider, and the diarize variant labeled speakers in a two-party call within a point or two of purpose-built tools.
The weakness for voice agents is streaming. Native transcription is batch-first, and real-time means moving to the separate Realtime API, where my first transcript chunk arrived between 500 and 1500ms. For recorded multilingual content inside an OpenAI stack, it is an easy pick.
Pros
Cons
Pricing $0.006 per minute for gpt-4o-transcribe, $0.003 for mini, and about $0.017 per minute for realtime transcription.
What does it do? A streaming-first speech-to-text model delivering low-latency transcripts across 90-plus languages.
Who is it for? Builders who need fast, accurate transcription in many languages for agents and live captions.
| Category | Score |
|---|---|
| Transcription Accuracy | 8.8/10 |
| Real-Time Latency | 9.0/10 |
| Multilingual Support | 9.5/10 |
| Production Readiness | 8.0/10 |
| Ease of Setup | 8.5/10 |
| Overall | 8.6/10 |
I streamed live audio into Scribe v2 Realtime and saw first partials back near 150ms, the fastest first-token result in the group, with predictive transcription anticipating the next words.
Across a mix of English, Spanish, and Hindi clips it held accuracy where weaker models drifted, and it auto-detected language switches inside a single file without manual segmentation.
Speaker labeling impressed me at multi-party tables, where it tagged turns cleanly and timestamped entities for redaction. Keyterm prompting let me bias up to 1000 phrases toward product names and medications, which cut errors on branded terms.
The limitation is maturity as a call-center backbone. Realtime diarization on non-English audio is still rough, and production tooling is younger than Deepgram's. For multilingual real-time transcription where speed and language breadth both matter, it is the strongest specialist.
Pros
Cons
Pricing Usage-based per audio minute, with Scribe v2 around $0.22 per 1,000 tokens on entry tiers after recent cuts, plus a free tier to start.
Published WER usually comes from clean recordings, and a model at 5% on a benchmark can hit 15-20% on a noisy call. I scored every model on my own 8kHz call set and cross-checked against independent benchmarks so the ranking reflects production reality, not a leaderboard.
For transcription, time-to-final is the number. For a voice agent, what matters is the full round trip from speech to spoken reply, because anything past a second feels like a walkie-talkie. I weighted streaming latency heavily for the conversational use cases.
A model that wins in English can collapse on accented or mixed-language calls. I tested Spanish-English and Hindi audio because neural voice is now the default in customer service across the voice recognition market, and callers do not stay monolingual.
The deepest criterion was scope. A raw model gives you text and leaves the LLM, voice, and turn-taking to you. A platform gives you an agent. I scored each tool against the job buyers are hiring it to do, which is why the ranking separates transcription leaders from agent platforms.
A model gives you text. A business needs the call answered, the question resolved, and the appointment booked. The five models here are the right starting point if transcription is your product, and AssemblyAI, Deepgram, OpenAI, and ElevenLabs each win a clear lane.
If your goal is a phone agent that hears, understands, and acts in one roughly 600ms loop, you need the layer above the model. Retell AI runs speech recognition, your choice of LLM, an ultra-realistic voice, and proprietary turn-taking as a single production agent, with no platform fees and $10 in free credits.
Start building your first AI voice agent free today.
Choosing a speech-to-text model in 2026 comes down to one honest question: what happens to the text after the model produces it? If a person reads the transcript later, accuracy and price decide the winner, and the async leaders are hard to beat. If a machine has to hear a caller, understand intent, and respond before the silence gets awkward, then the model is only the first link in a longer chain, and latency across the whole loop matters more than any single benchmark.
The teams that ship reliable voice products make that distinction early. They test on the audio their users produce in the wild, noisy lines and accented names included, instead of clean studio samples. They measure the full round trip, not the partial transcript. And they decide upfront whether they are buying a component or an outcome. Get that decision right and the shortlist narrows itself. The hard part was never the model. It was knowing which job you were hiring it for.
Which speech-to-text model has the lowest word error rate in 2026?
AssemblyAI Universal-3 Pro leads accuracy, reporting a 5.6% mean WER and posting the lowest errors on my noisy call set at 4.7%. Deepgram Nova-3 follows at 5.26% on its own real-world set while returning words far faster.
Do I need a speech-to-text model or a full voice agent platform?
If you only need transcripts of recorded files, a raw model is cheaper and simpler. If you need a system that hears a caller and responds in real time, you need the layer above the model, which is why an AI IVR replacement runs STT, an LLM, a voice, and turn-taking together.
Which speech-to-text model is fastest for live voice agents?
ElevenLabs Scribe v2 Realtime returned first partials near 150ms in my test, and Deepgram Nova-3 held under 300ms time-to-final. For the full hear-decide-respond loop, Retell measured roughly 600ms end to end, since that number includes the LLM and the spoken reply, not transcription alone.
How much do speech-to-text models cost per minute at scale?
Deepgram batch runs $0.0043 per minute, OpenAI gpt-4o-transcribe is $0.006, and AssemblyAI async is $0.21 per hour. A full agent minute that bundles STT, LLM, voice, and telephony is a different unit, priced around $0.07 per minute.
Can these speech-to-text models handle multilingual and accented calls?
OpenAI covers 99-plus languages and ElevenLabs covers 90-plus, making them the broadest. AssemblyAI handles code-switching across six core languages, and accuracy on accented audio still drops several points for every model, so test on your own callers.
Are speech-to-text models HIPAA compliant for healthcare calls?
Most leaders offer HIPAA coverage under a signed BAA, including AssemblyAI, Deepgram, ElevenLabs, and Retell, the last of which also carries SOC 2 Type II and GDPR. Confirm the BAA is executed before sending any protected health information, and check whether redaction is included or billed separately. Developer setup details live in the documentation.
What happens when a speech-to-text model mishears a critical word?
In a transcription-only stack, the error flows downstream silently and corrupts the record. In an agent, recovery logic can re-confirm a spelled number or trigger a warm transfer, which is why production voice teams design for misrecognition rather than assuming the model is always right.
See how much your business could save by switching to AI-powered voice agents.
Total Human Agent Cost
AI Agent Cost
Estimated Savings
A Demo Phone Number From Retell Clinic Office

Start building smarter conversations today.




