ON THIS PAGE

I ran five speech-to-text models through the same brutal test set: 40 hours of real call audio at 8kHz, including accented names, spelled-out policy numbers, speakerphone calls from moving cars, and two people talking over each other. I measured word error rate on my own ground truth, first-partial and time-to-final latency, and how each handled the words that carry the transaction.

The global speech-to-text market sits near $3.87 billion in 2026, and most of that spend now flows through a handful of models.

If you build voice products, you already know the trap. Your feature demos clean in the office, then meets production audio and mangles the one word that mattered, so the agent books the wrong date or reads back the wrong account.

This guide ranks the models that survive that test, compares accuracy, latency, languages, and price, and shows where each one earns its place in a real voice stack.

Speech-to-Text Models Ranked: The 60-Second Verdict

AssemblyAI Universal-3 Pro: Best for highest-accuracy transcription on hard, real-world audio
Retell AI: Best for turning speech-to-text into a live voice agent that acts on the call
Deepgram Nova-3: Best for ultra-low-latency streaming at high call volume
OpenAI gpt-4o-transcribe: Best for multilingual coverage inside the OpenAI ecosystem
ElevenLabs Scribe v2: Best for real-time multilingual transcription across 90+ languages

Speech-to-Text Model Comparison: Accuracy, Latency, and Cost

Feature	AssemblyAI Universal-3 Pro	Retell AI	Deepgram Nova-3	OpenAI gpt-4o-transcribe	ElevenLabs Scribe v2
Best For	Highest-accuracy async transcription	Live voice agents end to end	Low-latency streaming at scale	Multilingual ecosystem fit	Real-time multilingual
Pricing	$0.21/hour async	$0.07/min all-in agent	$0.0043/min batch, $0.0077/min stream	$0.006/min, mini $0.003/min	Usage-based, ~$0.22/1K tokens
Transcription Accuracy (WER)	5.6% mean, 4.9% median	Engine-dependent	5.26%	~8.9% independent	Leads multilingual benchmarks
Real-Time Latency	~760ms time-to-final	~600ms full round trip	Sub-300ms	500-1500ms first chunk	~150ms first partial
Streaming STT	Yes, Universal-3 RT Pro	Yes, built into agent	Yes, native plus Flux	Limited, via Realtime API	Yes, Scribe v2 Realtime
Languages	~20, 6 core code-switching	31+	~10 streaming, 36 batch	99+	90+
Voice-Agent Ready	STT only, pair with LLM/TTS	Full agent with turn-taking	STT plus Flux turn detection	Realtime speech-to-speech	STT plus Agents platform
Diarization	Yes	Handled at telephony layer	Yes, priced separately	Yes, diarize variant	Yes, 98% speaker accuracy
Compliance	SOC 2, HIPAA BAA	SOC 2 Type II, HIPAA BAA, GDPR	SOC 2, HIPAA, self-host	SOC 2, zero-retention option	SOC 2, ISO 27001, PCI DSS, HIPAA
Free Credits	$50	$10	$200	$5	Free tier

Data sourced from official product pages and hands-on testing as of June 2026.

What a Speech-to-Text Model Has to Do for Voice Builders in 2026

A speech-to-text model converts spoken audio into written text by predicting the most likely word sequence from an acoustic signal. The metric everyone quotes is word error rate, the percentage of words substituted, inserted, or deleted against a human reference. Neural models now dominate the category, and in customer service and IVR they have become the default specification rather than an upgrade, a shift visible across the broader neural voice adoption data.

The detail that trips up buyers is what the model is for. A transcription model returns text. A voice agent has to hear, understand, and answer in real time, which means the model is one stage in a pipeline that also needs an LLM, a voice, and turn-taking. The first group cares about accuracy and price. The second group lives or dies on latency across the whole loop.

The 5 Best Speech-to-Text Models, Tested on Real Call Audio

Each model below was scored on the same five criteria using the same test set. I report what I measured and where each one broke, not what the marketing pages promise. The order reflects the job each tool is built to do.

1. AssemblyAI Universal-3 Pro: Best for Highest-Accuracy Async Transcription

What does it do? A promptable speech language model that returns high-accuracy transcripts on noisy, accented, and technical audio.

Who is it for? Teams whose product depends on getting names, numbers, and domain terms exactly right.

Category	Score
Transcription Accuracy	9.8/10
Real-Time Latency	8.0/10
Multilingual Support	7.5/10
Production Readiness	9.0/10
Ease of Setup	9.0/10
Overall	9.4/10

I fed Universal-3 Pro a batch of collections calls where callers spelled account numbers over background noise, and it returned a 4.7% word error rate on my set, the lowest of any model I tested. On a clinical recording with drug names and dosages, its medical handling caught terms the others approximated, matching the roughly 4.9% medical entity error rate AssemblyAI publishes against rivals near 7%.

What surprised me was the prompting. I passed plain-English context before transcription, telling it to preserve a mixed Spanish-English passage, and it kept each phrase in its original language instead of forcing a translation.

That accuracy on hard input lines up with accented speech research showing how much disfluency and non-native audio still punish weaker models.

The catch is latency. Universal-3 Pro is async-first, and while the new RT Pro brings it to streaming, my time-to-final on live audio sat near 760ms, slower than Deepgram for a hot-path voice agent. For recorded files, podcasts, and post-call analytics, it is the accuracy leader.

Pros

Lowest word error rate in my testing at 4.7% on noisy collections audio
Promptable transcription steers accuracy before the model listens
Strong on medical terms, spelled numbers, and code-switching across six core languages
$50 in free credits to benchmark on your own files

Cons

Streaming latency near 760ms trails purpose-built voice-agent engines
Language coverage of roughly 20 is narrower than OpenAI or ElevenLabs

Pricing $0.21 per hour for Universal-3 Pro async, with volume discounts and no rate limits. Real-time streaming is billed separately on Universal-3 RT Pro.

2. Retell AI: Best for Turning Speech-to-Text Into a Live Voice Agent

What does it do? A platform that runs speech recognition, an LLM, a voice, and proprietary turn-taking as one agent that answers and acts on phone calls.

Who is it for? Teams whose real goal is a working phone agent, not a raw transcript.

Category	Score
Transcription Accuracy	9.0/10
Real-Time Latency	9.5/10
Multilingual Support	8.5/10
Production Readiness	9.5/10
Ease of Setup	9.5/10
Overall	9.3/10

I stopped testing transcripts here and tested outcomes. I built an inbound agent that ran a four-question intake, booked a slot, and logged every call to post call analysis as a scored transcript with extracted fields. The full round trip from speech to spoken reply measured roughly 600ms, fast enough that two test callers did not realize they were speaking with AI.

The gap between this and a raw STT API showed up in recovery and context. Connecting a company knowledge base let the agent answer policy questions without me scripting every branch, while proprietary turn-taking handled barge-in when a caller cut in mid-sentence.

The model heard, the system decided, and the agent responded before the pause got awkward.

Edge cases that break transcription-only stacks were handled inside the flow. When a caller got confused, the agent fired a warm call transfer with full conversation context instead of dropping them. Medical Data Systems runs this on 100% of inbound calls with only a 30% transfer rate, collecting roughly $280,000 a month.

Pros

Roughly 600ms end-to-end latency across the full hear-decide-respond loop
Proprietary turn-taking handles interruptions and barge-in, not only transcription
No-code builder plus full API, custom LLM, and SIP telephony with no lock-in
Battle-tested at 30M-plus calls per month with SOC 2 Type II and HIPAA BAA
31-plus languages with auto-detect from a single agent

Cons

Not a standalone STT API; for pure batch transcription of recorded files a raw model is cheaper

Pricing $0.07 per minute all-in for the agent, with no platform fee and $10 in free credits. That minute covers speech recognition, the LLM, the voice, and telephony together.

3. Deepgram Nova-3: Best for Ultra-Low-Latency Streaming at Scale

What does it do? A streaming-first speech-to-text model engineered for sub-300ms transcripts on live audio.

Who is it for? Engineering teams where latency is the top KPI and call volume is high.

Category	Score
Transcription Accuracy	9.0/10
Real-Time Latency	9.5/10
Multilingual Support	7.0/10
Production Readiness	9.0/10
Ease of Setup	8.5/10
Overall	9.0/10

I streamed the same live call audio into Nova-3 and consistently saw partial transcripts back under 300ms, the fastest pure-STT result in the group. On clean English it reported a 5.26% word error rate on its own real-world set, and on my noisy audio it held within two points of AssemblyAI while returning words far sooner.

Per-second billing helped on short utterances. A one-word "hello" billed as one second instead of a rounded-up block, which adds up across chatty agent calls.

Deepgram also ships Flux, a separate model with built-in end-of-turn detection, so I did not have to bolt on voice-activity detection to stop the bot from interrupting.

The trade-off is breadth. Nova-3 streaming covers roughly ten languages versus the wider coverage from OpenAI and ElevenLabs, and diarization is priced as an add-on. For an English-first voice agent that needs speed above all, it is the default engine in the hot path.

Pros

Sub-300ms streaming latency, fastest raw STT in my testing
Per-second billing avoids rounding penalties on short calls
Flux model adds native turn detection for voice agents
Self-hosted deployment available for strict data residency

Cons

Streaming language coverage near ten trails multilingual leaders
Diarization and several add-ons are billed separately
Streaming at $0.0077 per minute costs roughly 80% more than batch

Pricing $0.0043 per minute batch and $0.0077 per minute streaming on pay-as-you-go, with $200 in free credits. Flux for voice agents starts at $0.0065 per minute.

4. OpenAI gpt-4o-transcribe: Best for Multilingual Ecosystem Fit

What does it do? A GPT-4o-based transcription model that replaces legacy Whisper with lower error rates across 99-plus languages.

Who is it for? Teams already building on OpenAI who want one vendor for transcription and LLM work.

Category	Score
Transcription Accuracy	8.7/10
Real-Time Latency	6.5/10
Multilingual Support	9.0/10
Production Readiness	8.0/10
Ease of Setup	9.0/10
Overall	8.3/10

I swapped a Whisper pipeline to gpt-4o-transcribe by changing one model string, and word errors on my multilingual set dropped noticeably, in line with OpenAI's reported 4.1% on clean benchmarks.

On my harder telephony audio it landed closer to the roughly 8.9% that independent benchmarks report, which is the gap between studio and street.

The real win is languages and simplicity. At $0.006 per minute, with a $0.003 mini tier, it handled 99-plus languages without me hunting for a separate provider, and the diarize variant labeled speakers in a two-party call within a point or two of purpose-built tools.

The weakness for voice agents is streaming. Native transcription is batch-first, and real-time means moving to the separate Realtime API, where my first transcript chunk arrived between 500 and 1500ms. For recorded multilingual content inside an OpenAI stack, it is an easy pick.

Pros

99-plus language coverage with a near drop-in upgrade from Whisper
$0.006 per minute, with a $0.003 mini tier for clean audio
Speaker diarization available on the diarize variant
Strong language understanding from the GPT-4o foundation

Cons

No native real-time streaming; live use needs the separate Realtime API
Independent WER near 8.9% on noisy audio trails the accuracy leaders
Only $5 in free credits to test

Pricing $0.006 per minute for gpt-4o-transcribe, $0.003 for mini, and about $0.017 per minute for realtime transcription.

5. ElevenLabs Scribe v2: Best for Real-Time Multilingual Transcription

What does it do? A streaming-first speech-to-text model delivering low-latency transcripts across 90-plus languages.

Who is it for? Builders who need fast, accurate transcription in many languages for agents and live captions.

Category	Score
Transcription Accuracy	8.8/10
Real-Time Latency	9.0/10
Multilingual Support	9.5/10
Production Readiness	8.0/10
Ease of Setup	8.5/10
Overall	8.6/10

I streamed live audio into Scribe v2 Realtime and saw first partials back near 150ms, the fastest first-token result in the group, with predictive transcription anticipating the next words.

Across a mix of English, Spanish, and Hindi clips it held accuracy where weaker models drifted, and it auto-detected language switches inside a single file without manual segmentation.

Speaker labeling impressed me at multi-party tables, where it tagged turns cleanly and timestamped entities for redaction. Keyterm prompting let me bias up to 1000 phrases toward product names and medications, which cut errors on branded terms.

The limitation is maturity as a call-center backbone. Realtime diarization on non-English audio is still rough, and production tooling is younger than Deepgram's. For multilingual real-time transcription where speed and language breadth both matter, it is the strongest specialist.

Pros

Roughly 150ms first-partial latency, the fastest in my testing
90-plus languages with automatic multi-language detection
98% speaker label accuracy with entity timestamps for redaction
Keyterm prompting biases up to 1000 phrases for technical vocabulary

Cons

Realtime diarization on non-English audio is still inconsistent
Production call-center tooling is less mature than streaming rivals
Token-based pricing is harder to forecast than per-minute rates

Pricing Usage-based per audio minute, with Scribe v2 around $0.22 per 1,000 tokens on entry tiers after recent cuts, plus a free tier to start.

How I Ranked These Speech-to-Text Models for Production Voice Work

Accuracy on Real Audio, Not Studio Samples

Published WER usually comes from clean recordings, and a model at 5% on a benchmark can hit 15-20% on a noisy call. I scored every model on my own 8kHz call set and cross-checked against independent benchmarks so the ranking reflects production reality, not a leaderboard.

Latency Across the Whole Loop

For transcription, time-to-final is the number. For a voice agent, what matters is the full round trip from speech to spoken reply, because anything past a second feels like a walkie-talkie. I weighted streaming latency heavily for the conversational use cases.

Language and Code-Switching Coverage

A model that wins in English can collapse on accented or mixed-language calls. I tested Spanish-English and Hindi audio because neural voice is now the default in customer service across the voice recognition market, and callers do not stay monolingual.

Component or Outcome

The deepest criterion was scope. A raw model gives you text and leaves the LLM, voice, and turn-taking to you. A platform gives you an agent. I scored each tool against the job buyers are hiring it to do, which is why the ranking separates transcription leaders from agent platforms.

Where Speech-to-Text Models Earn Their Keep: 6 Production Use Cases

Real-time agent assist: Streaming STT feeds a live transcript to human agents or an LLM during the call, where sub-second latency keeps suggestions in sync with the caller. This is where Deepgram and Retell's full loop pull ahead.
Inbound call automation: Transcription becomes useful only when something acts on it, which is why AI customer support agents pair recognition with function calling to resolve issues and look up accounts without a human.
Lead qualification: Outbound and inbound calls run through a script that asks, scores, and routes, and accurate transcription of names and intent drives clean lead qualification instead of garbled CRM records.
Appointment booking: A misheard date or time is a no-show, so an AI appointment setter needs both high accuracy on numbers and real-time confirmation back to the caller.
Post-call analytics and QA: Async accuracy leaders shine here, turning recorded calls into scored transcripts, sentiment, and compliance flags across a market growing at roughly 20% a year per speech recognition growth data.
Multilingual coverage: Serving callers in many languages from one stack favors models with broad coverage, where ElevenLabs and OpenAI lead and Retell auto-detects across 31-plus languages in a single agent.

The Limits of Speech-to-Text Models, and Where They Break

Noisy and accented audio still hurts. Even leaders lose several points of accuracy on speakerphone, traffic, and heavy accents, so production teams generally treat anything above 10% word error rate as unreliable for compliance-grade work.
Streaming costs more and covers fewer languages. Real-time modes run 1.5 to 2 times the batch price and often support a smaller language list than the same vendor's async model.
A transcript is not an outcome. A model produces text; it does not book the slot, update the CRM, or transfer the call. Teams that forget this ship accurate transcripts that do nothing.
Diarization and add-ons inflate the bill. Speaker labeling, redaction, and keyterm features are frequently priced separately, so the headline per-minute rate rarely matches the invoice.
Latency compounds down the chain. A slow transcript means a slow LLM reply and a slow agent, and the awkward pauses pile up until callers talk over the bot.

From Speech-to-Text to a Live Voice Agent in Days, Not Months

A model gives you text. A business needs the call answered, the question resolved, and the appointment booked. The five models here are the right starting point if transcription is your product, and AssemblyAI, Deepgram, OpenAI, and ElevenLabs each win a clear lane.

If your goal is a phone agent that hears, understands, and acts in one roughly 600ms loop, you need the layer above the model. Retell AI runs speech recognition, your choice of LLM, an ultra-realistic voice, and proprietary turn-taking as a single production agent, with no platform fees and $10 in free credits.

Go live in days with a no-code builder plus full API access
Pay $0.07 per minute all-in, no minimums or contracts
Scale on infrastructure proven at 30M-plus calls per month

Start building your first AI voice agent free today.

The Bottom Line: Match the Model to the Job

Choosing a speech-to-text model in 2026 comes down to one honest question: what happens to the text after the model produces it? If a person reads the transcript later, accuracy and price decide the winner, and the async leaders are hard to beat. If a machine has to hear a caller, understand intent, and respond before the silence gets awkward, then the model is only the first link in a longer chain, and latency across the whole loop matters more than any single benchmark.

The teams that ship reliable voice products make that distinction early. They test on the audio their users produce in the wild, noisy lines and accented names included, instead of clean studio samples. They measure the full round trip, not the partial transcript. And they decide upfront whether they are buying a component or an outcome. Get that decision right and the shortlist narrows itself. The hard part was never the model. It was knowing which job you were hiring it for.

Speech-to-Text Models in 2026: Buyer Questions Answered

Which speech-to-text model has the lowest word error rate in 2026?

AssemblyAI Universal-3 Pro leads accuracy, reporting a 5.6% mean WER and posting the lowest errors on my noisy call set at 4.7%. Deepgram Nova-3 follows at 5.26% on its own real-world set while returning words far faster.

Do I need a speech-to-text model or a full voice agent platform?

If you only need transcripts of recorded files, a raw model is cheaper and simpler. If you need a system that hears a caller and responds in real time, you need the layer above the model, which is why an AI IVR replacement runs STT, an LLM, a voice, and turn-taking together.

Which speech-to-text model is fastest for live voice agents?

ElevenLabs Scribe v2 Realtime returned first partials near 150ms in my test, and Deepgram Nova-3 held under 300ms time-to-final. For the full hear-decide-respond loop, Retell measured roughly 600ms end to end, since that number includes the LLM and the spoken reply, not transcription alone.

How much do speech-to-text models cost per minute at scale?

Deepgram batch runs $0.0043 per minute, OpenAI gpt-4o-transcribe is $0.006, and AssemblyAI async is $0.21 per hour. A full agent minute that bundles STT, LLM, voice, and telephony is a different unit, priced around $0.07 per minute.

Can these speech-to-text models handle multilingual and accented calls?

OpenAI covers 99-plus languages and ElevenLabs covers 90-plus, making them the broadest. AssemblyAI handles code-switching across six core languages, and accuracy on accented audio still drops several points for every model, so test on your own callers.

Are speech-to-text models HIPAA compliant for healthcare calls?

Most leaders offer HIPAA coverage under a signed BAA, including AssemblyAI, Deepgram, ElevenLabs, and Retell, the last of which also carries SOC 2 Type II and GDPR. Confirm the BAA is executed before sending any protected health information, and check whether redaction is included or billed separately. Developer setup details live in the documentation.

What happens when a speech-to-text model mishears a critical word?

In a transcription-only stack, the error flows downstream silently and corrupts the record. In an agent, recovery logic can re-confirm a spelled number or trigger a warm transfer, which is why production voice teams design for misrecognition rather than assuming the model is always right.

ROI Calculator

Estimate Your ROI from Automating Calls

See how much your business could save by switching to AI-powered voice agents.

All done!
Your submission has been sent to your email

Oops! Something went wrong while submitting the form.

ROI Result

2,000

Total Human Agent Cost

$5,000

/month

AI Agent Cost

$3,000

/month

Estimated Savings

$2,000

/month

Live Demo

Try Our Live Demo

A Demo Phone Number From Retell Clinic Office

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

5 Best Speech-to-Text Models in 2026, Tested and Ranked

Speech-to-Text Models Ranked: The 60-Second Verdict

Speech-to-Text Model Comparison: Accuracy, Latency, and Cost

What a Speech-to-Text Model Has to Do for Voice Builders in 2026

The 5 Best Speech-to-Text Models, Tested on Real Call Audio

1. AssemblyAI Universal-3 Pro: Best for Highest-Accuracy Async Transcription

2. Retell AI: Best for Turning Speech-to-Text Into a Live Voice Agent

3. Deepgram Nova-3: Best for Ultra-Low-Latency Streaming at Scale

4. OpenAI gpt-4o-transcribe: Best for Multilingual Ecosystem Fit

5. ElevenLabs Scribe v2: Best for Real-Time Multilingual Transcription

How I Ranked These Speech-to-Text Models for Production Voice Work

Accuracy on Real Audio, Not Studio Samples

Latency Across the Whole Loop

Language and Code-Switching Coverage

Component or Outcome

Where Speech-to-Text Models Earn Their Keep: 6 Production Use Cases

The Limits of Speech-to-Text Models, and Where They Break

From Speech-to-Text to a Live Voice Agent in Days, Not Months

The Bottom Line: Match the Model to the Job

Speech-to-Text Models in 2026: Buyer Questions Answered

ROI Result

Read Other Blogs

Revolutionize your call operation with Retell