How Real-Time Voice AI Actually Works (STT → LLM → TTS, Explained)

How Real-Time Voice AI Actually Works (STT → LLM → TTS, Explained)
BACK TO BLOGS
ON THIS PAGE
Back to top

What happens between "hello" and the agent's reply, in plain English. No jargon, no hand-waving.

TL;DR

  • Real-time voice AI is a three-stage pipeline with two pieces of orchestration around it. Audio comes in → speech-to-text turns it into words → an LLM decides what to do → text-to-speech turns the reply back into audio. Wrapped around all of it: turn-taking (when has the caller stopped?) and barge-in handling (what if they cut us off?). That's the whole thing.

  • The whole pipeline has to finish in under ~700ms or it stops feeling human. Above that threshold callers get awkward, repeat themselves, and hang up. Below it, they forget they're talking to AI. Retell's stack runs around 600ms end-to-end. That's not luck. It's the result of every stage streaming into the next instead of waiting for it to finish.

  • Most of the latency hides where you don't expect it. Not in STT, not in TTS — in turn-taking decisions and LLM time-to-first-token. If your build feels slow, those are the two places to look first.

  • Streaming is the trick. STT emits partial transcripts every ~50ms instead of waiting for a complete sentence. The LLM streams tokens as they're generated. TTS streams audio chunks before the full reply exists. None of this works if any one stage waits for the previous to "finish."

  • The 2026 stacks all look similar architecturally. What separates "production-grade" from "demo" is orchestration quality — VAD tuning, turn-taking models, interruption handling, function-call latency. That's where the actual engineering investment goes.

How Real-Time Voice AI Actually Works

Strip the marketing off and a voice agent is a pipeline. Audio comes in over the phone. Software turns it into text. A language model reads that text, decides what to say or what to do, and generates a reply. More software turns the reply back into audio. Audio goes back out the phone. The caller hears it, says something, and the loop runs again. That's it. That's the whole product.

The reason that simple loop took years to make work is that all of it has to happen in less than a second. Every part of the pipeline has to be streaming. Every transition between stages has to be near-instant. Two stages have to make hard real-time decisions — turn-taking ("has the caller stopped talking yet?") and barge-in handling ("the caller just started talking over me, what do I do?"). Get any of it wrong and the conversation falls apart in a way callers notice immediately, even if they can't articulate why.

This article is the no-marketing version of how a voice agent actually works in 2026. We'll walk a single conversational turn end to end, look at where the latency hides, talk about the orchestration that separates production stacks from demos, and clean up a few common misconceptions about what's actually happening under the hood.

If you're a PM trying to understand what your engineers are building, this is for you. If you're an engineer evaluating a platform vs. building it yourself, also for you. Either way: by the end you'll know what's happening every time someone says "hello" and an AI says "hi" back.

Sixty Seconds to the Pipeline

Here's the elevator version.

A voice agent is three things in a row with two things wrapped around them. The three in a row: STT → LLM → TTS. Speech-to-text turns the caller's audio into words. A large language model reads those words (plus your system prompt, the conversation so far, and a description of any tools the agent can call), and decides whether to speak or call a function. Text-to-speech turns the model's reply back into audio.

The two things wrapped around that pipeline: turn-taking and barge-in. Turn-taking is the system that decides when the caller has finished a thought so the agent can respond — way harder than it sounds, because humans pause mid-sentence all the time. Barge-in is the system that handles a caller cutting the agent off mid-reply — also harder than it sounds, because you have to stop TTS instantly, drop whatever the model was about to say, and start listening again.

Why this is hard: every stage has to be streaming, and every stage has a latency budget you can't blow. Get the whole loop under ~700ms and the conversation feels human. Go over and it doesn't. That's the whole job.

What Happens in 600 Milliseconds: The Seven Stages of a Single Turn

Let's walk through one conversational turn end to end. The caller says "Hi, I'd like to book a cleaning for next Tuesday afternoon" — and 600ms later, the agent replies. Here's everything that happens in between.

1. Audio comes in over the phone

The call hits your telephony layer first — a SIP trunk if you're using your existing carrier, or a WebRTC stream if you're using a Retell number. Either way, the caller's audio shows up as a stream of small packets, usually 20ms each. From the moment the caller starts speaking, those packets are flowing into your stack at line speed. Network round-trip is the first piece of the latency budget you can't cheat: typically 30–80ms depending on geography and carrier, before any AI work has happened.

2. Voice activity detection (VAD)

VAD is the lightweight model that decides whether the audio coming in is speech or silence. It runs on every incoming chunk, in milliseconds. Why bother? Two reasons. One: you don't want to send silence to your STT — it wastes compute and confuses turn-taking. Two: VAD is the first signal turn-taking uses to decide when the caller has stopped speaking. Bad VAD is one of the quiet killers of voice AI. Tune it too tight and you cut the caller off mid-word. Tune it too loose and the agent feels sluggish. Production-grade stacks use a small neural net trained specifically on phone-call audio for this, not a generic energy threshold.

3. Speech-to-text streams partial transcripts

As soon as VAD says "this is speech," audio gets piped into a streaming STT engine. The key word is streaming. The STT doesn't wait for the caller to finish. It emits partial transcripts every ~50ms — incomplete guesses that get revised as more audio arrives. So at 200ms in, the transcript might say "Hi I'd like to book a." At 400ms, "Hi I'd like to book a cleaning for." At 700ms, the full sentence. Modern STT also handles diarization (who's speaking — useful when there's more than one person on the line), interim correction (revising "two" to "two-thirty" once more context arrives), and noise robustness for callers on speakerphone or in airports.

If you're wondering where most homemade builds quietly fail, this is one of the places. Recognition accuracy is fine in 2026. The hard part is the streaming, the partials, and the end-of-utterance detection — none of which you get from a generic "transcribe this audio file" API.

4. Turn-taking decides the caller is done

This is the dark art. Turn-taking is the model that decides when the caller has finished a thought, so the agent can respond. It's not just "wait 500ms after the last word." Humans pause mid-sentence, take breaths, say "um" while they think. A naive timeout will either cut them off ("Hi, I'd like to book—" "OK, what would you like to book?") or feel slow ("...for next Tuesday afternoon." [silence] [silence] "Got it, let me check.").

The 2026 production answer is a small, fast neural turn-taking model that takes the audio stream, the partial transcript, and the conversation context, and gives a probability that the caller has finished their turn. It updates dozens of times per second. When confidence crosses a threshold, the agent's turn starts. Retell's turn-taking model handles backchannels ("mm-hmm," "right"), hesitation pauses, and end-of-utterance detection inside an end-to-end response budget of roughly 600ms. (How our turn-taking works.)

If you take one thing away from this article: most of the difference between "feels human" and "feels robotic" lives in this stage. Latency-budget-wise, turn-taking eats 150–300ms of your total response time. Quality-wise, it's the single biggest factor in whether your callers respect the agent.

5. The LLM picks what to do

Once the caller's turn is over, the language model gets called with everything it needs: your system prompt, the full conversation transcript, any retrieved knowledge from your knowledge base, and the list of available functions. The model has two choices on every turn — generate a spoken reply, or call a tool (book the appointment, transfer the call, look up the customer record).

The latency metric that matters here is time to first token (TTFT). Not how long the full reply takes — how long until the first word starts streaming. A good 2026 LLM hits TTFT in 150–300ms for a typical voice-agent prompt. Once tokens start streaming, they keep streaming at 50–100 per second, which is faster than most people speak. So the TTS stage starts before the model has finished thinking. (Pricing details on the LLM tier.)

If the model decides to call a function instead of speaking, you pay a different latency: the round-trip to your webhook (booking the slot in Cal.com, writing the lead into Salesforce). For most preset functions, this is fast — single-digit hundreds of milliseconds. For slow third-party APIs, it can be slower, and the agent typically says something like "one moment while I check that" to fill the gap. (Booking, transferring, knowledge base.)

6. Text-to-speech streams audio back

As soon as the LLM emits the first few tokens, TTS starts. Modern voice agents stream audio out in 200–400ms chunks, so the caller hears the first word before the full reply has even been generated. This is the trick that makes the whole pipeline feel fast — every stage emits output before the previous stage finishes.

The 2026 voice menu has three tiers: Retell platform voices and Cartesia for fast, natural, low-latency at $0.015/min; ElevenLabs for highest-fidelity brand voices at $0.040/min; and a long tail of voice clones for premium use cases. Time to first audio (TTFA) is the metric to watch — production stacks hit 100–200ms. In blind tests with default voices, most callers can't reliably tell them from human. The thing that gives voice AI away in 2026 isn't the voice anymore. It's the timing.

7. Barge-in handling for interruptions

The pipeline above works great until the caller does what humans actually do: interrupt. They start talking over the agent. Maybe they realized they meant Wednesday, not Tuesday. Maybe they're irritated. Either way, the agent has to stop talking immediately, drop the rest of its planned reply, and start listening again — fast.

This is barge-in handling, and it's another quiet killer of voice AI. A naive build keeps reading out the rest of the TTS while the caller is talking — the worst feeling on a phone call. A good build cuts TTS within a single audio chunk (sub-100ms), discards whatever the LLM was going to say, and starts a fresh STT stream from the caller's new audio. Bonus points if the model knows what got said before the cutoff so it doesn't repeat itself.

Add up the budget: network (50ms) + VAD/turn-taking (200ms) + LLM TTFT (250ms) + TTS TTFA (100ms) = roughly 600ms. That's how a voice agent feels human. None of those numbers are magical. They're just the result of streaming aggressively and not waiting on anything you don't have to.

What "Real-Time" Looks Like at Production Scale

Three companies running on this exact pipeline today, worth studying.

Pine Park Health. Primary care for senior living communities. Phone tag was eating their schedule. They dropped a Retell voice agent in front of their scheduling line — same STT → LLM → TTS pipeline as everyone else, just orchestrated tightly enough that callers didn't bail. Scheduling NPS went up 38%. Their clinical staff stopped spending half the day on the phone.

SWTCH. EV charging company. When a driver is stranded at a broken charger, "we'll call you back tomorrow" isn't an answer. They put Lucas — a Retell agent — on the line. Lucas picks up in seconds, walks drivers through urgent troubleshooting, and does it 24/7 across the same seven-stage pipeline. Support costs dropped more than 50%.

Medical Data Systems. Debt collection. Regulated, tonally sensitive, unforgiving when conversations go sideways. They put Retell agents on inbound calls and now handle 100% of incoming volume with only 30% of calls transferring to a human, collecting around $280,000 a month. The pipeline is the same one we just walked through. The difference is orchestration discipline and a long tail of small decisions about turn-taking, barge-in, and prompt design. (More customer stories here.)

The common thread across all three: none of them tried to invent the pipeline. They picked a platform that had the orchestration solved, focused their work on the parts that were actually proprietary to their business — the prompt, the knowledge base, the function endpoints — and shipped.

Where the Latency Goes (And Where Most Builds Lose It)

If you remember nothing else from this article, remember this: STT and TTS are not where most of your latency hides. They're fast. The two places latency actually goes are turn-taking and LLM time-to-first-token.

Here's a typical 2026 budget breakdown for one conversational turn on a production stack:

  • Network round-trip: 30–80ms. Mostly geography and your SIP carrier. You can't do much here.

  • VAD + turn-taking decision: 150–300ms. This is the biggest variable. A bad turn-taking model will cost you 500ms+ of perceived latency without ever showing up in a benchmark.

  • STT final transcript: 50–100ms after end of speech. Streaming hides most of this in the previous stage.

  • LLM time-to-first-token: 150–400ms. Heavily dependent on model choice and prompt size.

  • TTS time-to-first-audio: 100–200ms.

  • Function call (if invoked): 100–500ms depending on the API.

A production-grade stack lands the speak-or-don't-speak parts of this in around 600ms total. A mediocre stack lands at 1.2–1.8 seconds. The mediocre stack feels like talking to a chatbot reading lines. The good one feels like a person.

The two big levers if you're trying to optimize: pick a fast LLM with low TTFT (GPT 4.1, Claude 4.6 Sonnet, Gemini 3.0 Flash all hit production targets), and use a turn-taking model trained on real conversation data, not a fixed silence threshold. (Why latency matters)

Common Misconceptions About How This Actually Works

A few things worth flagging.

"It's just three APIs glued together." It is, until you try to make it feel real-time. Then you realize gluing matters more than the APIs. The orchestration layer — VAD tuning, turn-taking model, streaming coordination, barge-in handling, function-call routing — is where production-grade stacks actually live. You can swap STT vendors in a day. You can't swap orchestration without rewriting half the system.

"Bigger LLM = better voice agent." Not really. For most voice use cases, a fast mid-tier model with a good prompt beats a slow flagship. Time-to-first-token matters more than raw reasoning quality, because the caller's perception is shaped almost entirely by latency. Retell lets you swap LLMs with a dropdown specifically because the right answer depends on the use case — heavy reasoning gets Claude 4.6 Sonnet, high-volume cheap gets GPT 5 nano, multilingual gets Gemini 3.0 Flash, the default is GPT 4.1.

"Streaming is a nice-to-have." It's the whole architecture. Without streaming, you wait for the caller to finish, then wait for STT to finish, then wait for the LLM to finish, then wait for TTS to finish, and you've spent 3+ seconds before a single byte of audio goes back. The whole reason 2026 voice agents feel human is that every stage starts emitting output before the previous stage is done.

"You need a custom-trained model to make this work for your use case." Almost always no. The 2026 stack is designed so the model stays generic and your prompt + knowledge base + functions do the customization. Custom-trained models are slower to iterate on, slower at inference, and obsolete the moment a new base model ships. Most teams that "needed a custom model" actually needed a better prompt and a better knowledge base.

"The voice is the hardest part." It's actually one of the easiest parts now. Default TTS voices are functionally indistinguishable from human in blind tests. The hardest parts are turn-taking and barge-in — the things callers don't consciously notice but absolutely feel.

What's Next

Real-time voice AI is a streaming pipeline: audio in → STT → LLM → TTS → audio out, with turn-taking and barge-in wrapping it. Every stage emits output before the previous stage finishes, the whole loop completes in under 700ms, and the orchestration is what separates production from demo. That's the architecture. It's not magic. It's a few specific engineering problems solved well.

Most operators don't need to build this themselves. They need to understand it well enough to know what they're buying, what to ask for, and where the build will fail if they pick the wrong vendor. If this article got you most of the way there, you're in good shape.

If you want to see the pipeline in action, the fastest path is to build something on it. Sign up free at dashboard.retellai.com — new accounts get $10 in credits, around 90 minutes of conversation. Or book a demo and we'll walk through the orchestration in the context of your actual call volume. If you'd rather hear the latency for yourself, call our live demo line and talk to an agent running on the pipeline above.

Frequently Asked Questions

Q: What does STT → LLM → TTS actually mean? A: It's the three core stages of a voice AI pipeline. STT (speech-to-text) turns the caller's audio into text. The LLM (large language model) reads that text plus your system prompt and decides what to say or which function to call. TTS (text-to-speech) turns the reply back into audio. Wrap turn-taking and barge-in handling around it and that's the whole stack.

Q: How fast does real-time voice AI need to be? A: Under ~700ms of end-to-end response time is the threshold where conversation feels human. Above that, callers start interrupting, repeating themselves, and hanging up. Production stacks like Retell run at around 600ms.

Q: Where does the latency actually go? A: Mostly into turn-taking and LLM time-to-first-token, not STT or TTS. A typical budget: network 50ms, VAD/turn-taking 200ms, LLM TTFT 250ms, TTS first audio 100ms. STT runs in parallel with the caller's speech, so it adds almost nothing on top.

Q: What's streaming and why does it matter? A: Every stage of the pipeline emits output before the previous stage finishes. STT emits partial transcripts every ~50ms. The LLM streams tokens as they generate. TTS streams audio in 200–400ms chunks. Without streaming, every stage waits for the last one and you spend 3+ seconds before a single byte of audio goes back to the caller.

Q: What's turn-taking and why is it hard? A: Turn-taking is the system that decides when the caller has finished speaking so the agent can respond. It's hard because humans pause mid-sentence, take breaths, and say "um" while they think. A naive timeout cuts callers off or feels slow. The 2026 answer is a small neural model trained on real conversational audio that updates a probability dozens of times per second.

Q: What's barge-in handling? A: It's what happens when the caller starts talking over the agent. A good stack stops TTS within 100ms, discards the rest of the planned reply, and starts a fresh STT stream from the caller's new audio. A bad stack keeps talking — the worst feeling on a phone call.

Q: Do I need to build the pipeline myself? A: Almost never in 2026. The orchestration — VAD, turn-taking, barge-in, streaming coordination, function-call routing — is the part where serious engineering investment goes. Most teams that try to build it themselves end up with a slower, worse version of what's available off the shelf. Build the parts that are proprietary to your business: prompt, knowledge base, function endpoints, workflows.

Q: Does the choice of LLM matter that much? A: Yes, but mostly for time-to-first-token, not raw quality. A fast mid-tier model with a good prompt beats a slow flagship for most voice use cases. Retell lets you swap LLMs with a dropdown — GPT 4.1 is the default, Claude 4.6 Sonnet for higher reasoning, GPT 5 nano for cheap volume, Gemini 3.0 Flash for multilingual. (Pricing.)

Q: How does function calling fit into the pipeline? A: When the LLM decides to call a function instead of speaking, the platform fires an HTTPS webhook with structured arguments the model extracted from the conversation, then waits for the response. That round-trip adds latency — usually a few hundred milliseconds for fast APIs, more for slow ones. For longer waits, the agent typically says "one moment while I check that" to fill the gap.

Q: What's the difference between voice AI and an IVR? A: An IVR is a fixed decision tree (press 1 for billing). Voice AI runs on the pipeline above — open-ended speech in, LLM reasoning in the middle, natural reply out. Callers don't navigate menus. They just talk.

ROI Calculator
Estimate Your ROI from Automating Calls

See how much your business could save by switching to AI-powered voice agents.

All done! 
Your submission has been sent to your email
Oops! Something went wrong while submitting the form.
   1
   8
20
Oops! Something went wrong while submitting the form.

ROI Result

2,000

Total Human Agent Cost

$5,000
/month

AI Agent Cost

$3,000
/month

Estimated Savings

$2,000
/month
Live Demo
Try Our Live Demo

A Demo Phone Number From Retell Clinic Office

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Read Other Blogs

Revolutionize your call operation with Retell