ON THIS PAGE

Most voice AI platforms bundle four pieces into one per-minute price: speech-to-text, the language model, text-to-speech, and the orchestration that stitches them into a real-time phone call. Telephony comes in through SIP, but the carrier minutes, the systems your agent talks to, and the conversation logic itself are still yours.

That's the whole answer. The rest of this guide is for buyers who keep getting stuck on the same evaluation question: "Wait, do we need separate subscriptions for telephony, TTS, STT, and LLMs, or is this already part of the platform?"

If you've sat in a vendor call comparing a voice AI platform side-by-side with Twilio, ElevenLabs, or OpenAI and asked which is cheaper, this article is the cleanest answer you'll get. Those vendors don't sit in the same column. Each one sells a single layer of the stack. A platform like Retell AI sells the orchestrated whole.

Knowing which layer your money is buying answers four practical questions:

Which contracts you sign
Which invoices you receive each month
Which engineering hours you spend before going live
Which parts of the stack you can swap when prices or models change

The voice AI stack at a glance

Every voice agent that picks up a phone runs the same chain. Audio comes in. It gets transcribed. A model reasons over the transcript. The response gets spoken back. All of it streams in parallel.

Layer	What it does	Common providers
Telephony	Routes audio between phone and server	Twilio, Telnyx, Vonage, Avaya
Speech-to-text (STT)	Turns audio into text	Deepgram, AssemblyAI, Whisper
Language model (LLM)	Reads, reasons, decides, replies	GPT-5, GPT-4o, Claude 4.5, Gemini 3.0
Text-to-speech (TTS)	Turns text into spoken audio	ElevenLabs, Cartesia, OpenAI, MiniMax
Orchestration	Streams, buffers, handles turn-taking and tool calls	The voice AI platform

The first four are commodities. Everyone uses roughly the same providers. The fifth is where teams quietly burn three to six months of engineering when they try to build it themselves.

The hidden cost of "we'll stitch it together ourselves": Streaming WebSockets. Sentence-boundary buffering. Voice activity detection. Barge-in recovery. Function calling mid-call. Retry logic across four APIs. SIP integration that handles 8 kHz codec quirks. None of it ships out of the box.

Do I need separate subscriptions for telephony, TTS, STT, and LLMs?

No. Not with a platform. With a voice AI platform, you sign one agreement and get one invoice. With raw infrastructure, you sign four.

Three buying paths show up in real evaluations:

1. Bundled platform: One contract, one bill, one dashboard. You pick a voice and an LLM from dropdowns; the platform handles licensing, API calls, and metering. This is what most teams pick for the first 90 days.

2. Raw infrastructure: Twilio + an STT provider + an LLM API + a TTS provider, glued together with your own orchestration code. Maximum control, four sets of credentials, three to six months of engineering before you take a real call.

3. Hybrid bring-your-own: The platform handles orchestration and components, but you bring your own SIP trunk, fine-tuned model, or voice keys. This is where most production deployments end up after the first quarter.

What does the per-minute price include?

A platform's headline rate covers orchestration. The components ride on top, and they're transparent line items, not hidden fees.

Here's how a typical mid-market call breaks down on the pricing page:

Component	Typical rate	Notes
Base orchestration	~$0.07/min	Fixed
Voice (Cartesia)	~$0.05–$0.07/min	ElevenLabs/OpenAI run higher
LLM	$0.003–$0.08/min	Gemini Flash cheapest, GPT-5 highest
Telephony (built-in)	~$0.015/min	Or $0 with your own SIP trunk
All-in (typical)	$0.13–$0.20/min	Mid-tier voice + competent model

Two notes before you model this for finance:

A $0.07 entry point isn't a $0.07 production cost. The headline covers orchestration only.
Add-ons sit outside the per-minute rate: streaming knowledge base (free for the first ten, then ~$0.005/min), concurrency beyond 20 calls, branded caller ID. None are required to ship.

Is telephony bundled, or do I still need Twilio?

Yes, and yes. It depends on what you already have.

You can buy a phone number directly inside the dashboard and start taking calls in under an hour. You can also bring an existing Twilio, Telnyx, Vonage, or Avaya number through SIP trunking and skip the resold carrier entirely. Both paths use the same orchestration layer underneath.

Quick decision rule:

Starting from zero? Use the built-in number. Faster, cheaper, no carrier setup.
Existing carrier contract or ported numbers? Bring your own SIP trunk. Keep your rates, your STIR/SHAKEN attestation, and your DID portfolio.
Already deep into a** **Twilio integration? Same answer: point the trunk at the platform and import the number.

You can switch later. Re-importing a number takes minutes, not migration plans.

Do I need to pay ElevenLabs separately?

No. The voice you pick from a dropdown is included in the per-minute price.

This trips up a lot of evaluation calls because ElevenLabs sells two distinct products:

Its own end-to-end Conversational AI product (sold direct, ~$0.10/min plus LLM costs)
Its voice models, licensed to platforms that embed them and resell on a per-minute basis

If you're using a voice AI platform, you're getting #2. No separate API key. No separate subscription. No separate invoice. The licensing and metering happen on the backend.

Same logic applies to Cartesia, OpenAI TTS, MiniMax, and the platform's own voice models. The only exception is enterprise voice cloning, which is configured separately for teams that need a brand-specific voice.

Do I need an OpenAI subscription for the LLM?

No. Inference is bundled. You pick the model from a dropdown and the per-minute price adjusts:

Cheapest: Gemini 3.0 Flash, GPT-5 Nano (fractions of a cent)
Mid-tier: GPT-4o, Claude Haiku
Reasoning-heavy: Claude 4.5 Sonnet, GPT-5

No OpenAI API key on your side. No Anthropic account. No Google Cloud project.

If you want to bring your own model (fine-tuned weights, an OpenAI Enterprise contract you've already signed, or a self-hosted Llama deployment), the platform supports a custom LLM WebSocket. The integration steps are in the documentation.

When does bring-your-own-model become the right call?

Strict data residency requirements
Existing LLM commitments you want to monetize
Domain-specific fine-tuning that materially beats general-purpose models on your use case

For everyone else, the bundled path is faster to deploy and easier to swap when a new model lands.

What you still need to bring

Here's the part that catches teams off guard. The platform handles the conversation. You handle the business it serves.

You bring	Platform handles
Knowledge base content (your service catalog, prices, hours, policies)	RAG retrieval and auto-sync
CRM and system of record	Webhook calls during the conversation
Calendar and booking system	Real-time availability checks and event creation
Conversation flow design	The drag-and-drop framework to build it in
Escalation rules and routing	The warm transfer with full context

A few notes worth surfacing:

The knowledge base auto-syncs from your website or document set, but the content of those documents is yours to keep current. Garbage in, garbage out, same as any RAG system.
CRM integration runs through webhooks, Make, n8n, Zapier, or a native HubSpot integration. The agent can read and write to Salesforce or HubSpot during a live call, but the schema and permissions live with you.
For book appointments flows, the integration runs against Cal.com, Google Calendar, Calendly, or whichever booking system holds your calendar of record.
Templates exist for common patterns like AI appointment setter, AI answering service, inbound support, but the actual questions your agent asks and what counts as a "qualified lead" are decisions only you can make.
For call transfer escalations, you define the threshold. Production deployments routinely automate up to 80 percent of inbound calls, but where you draw the line between AI and human is your call.

Bundle vs. build: the honest comparison

The cost gap is small. The time gap is enormous.

	Bundled platform	Bring-your-own build
Vendor relationships	1	4+
Time to first live call	Under an hour	3–6 months
Loaded per-minute cost	$0.13–$0.20	$0.13–$0.31
Engineering required	Prompt + flow design	Streaming pipeline + retry logic + observability + SIP handling
Compliance chain	Single BAA	One per vendor in the stack

A typical bring-your-own build pulls Twilio for telephony, Deepgram or AssemblyAI for transcription, GPT-5 or Claude 4.5 for reasoning, ElevenLabs or Cartesia for voice, and Pipecat or LiveKit for orchestration. The per-minute math lands in roughly the same neighborhood as a bundled platform.

What you pay for when you build is engineering time. Voice activity detection, barge-in handling, function calling that survives mid-call failures, observability that correlates four vendor dashboards into one trace, and SIP integration that gracefully handles the 8 kHz codec quirks of telephony audio. None of it ships out of the box.

Most modern transcription models are trained primarily on 16 kHz audio. Telephony hands you 8 kHz. The accuracy gap on a freshly built pipeline is real and noticeable to callers.

That gap between months of plumbing and days of prompt engineering is why most production voice agents in 2026 ship on a platform.

Where the boundaries sit

Twilio, ElevenLabs, OpenAI, and Retell AI aren't competing for the same dollar. They're stacked layers in the same architecture:

Twilio, Telnyx, Vonage → telephony. Move audio between phones and servers.
ElevenLabs, Cartesia, OpenAI TTS, MiniMax → voice synthesis. Turn text into speech.
OpenAI, Anthropic, Google → language model inference. The reasoning brain.
Retell AI → orchestration plus the agent framework, post call analysis, batch call, simulation testing, and the commercial wrapper that turns four invoices into one.

The right framing is rarely "Retell or Twilio." It's "Retell on top of Twilio." Same with OpenAI and ElevenLabs. The only real overlap sits at the orchestration layer, and the Retell vs ElevenLabs page covers that narrower comparison for teams choosing between a platform and ElevenLabs' own end-to-end product.

What about HIPAA, SOC 2, and GDPR?

Compliance lives at the platform layer. SOC 2 Type II, HIPAA with a self-service BAA, and GDPR are included without per-minute compliance surcharges.

Where this matters most is the unbundled build. Compliance doesn't stop at orchestration in a stitched-together stack. Every vendor in the chain (STT, LLM, TTS, telephony) has its own BAA process, its own data handling terms, and its own audit trail. Procurement teams in healthcare, insurance, and financial services usually pick the bundle for that reason alone.

One vendor approval moves faster than four.

How the bundle performs in production

Three deployments make the tradeoff concrete:

Anker: Voice automation across global support centers handling millions of calls a year across North America, Europe, and Asia. The agents hit 95%+ speech recognition accuracy in English markets, running on top of Anker's existing telephony rather than a four-vendor pipeline.

Medical Data Systems: Collections operations through the platform. The team now handles 100% of inbound calls with only a 30% transfer rate, collecting roughly $280,000 a month. That 30% transfer rate is a business decision, not a platform default.

Matic Insurance: After-hours operator bot handling support, appointment confirmation, and intake. In Q1, the bot handled around 8,000 calls, with answer rates that beat the human-led baseline. The bot sits on top of Matic's existing Twilio relationship.

The pattern across all three:

Platform owns the agent layer
Customer's existing systems own the data and workflows
Carrier relationship is a separate contract

That's the bundle-versus-bring-your-own line in real production deployments.

Frequently asked questions

Do I need an ElevenLabs subscription to use a voice AI platform?

No. Voices are embedded in the per-minute price. The exception is enterprise voice cloning, which is configured separately.

Is Twilio required to run a voice AI agent?

No. Most platforms support Telnyx, Vonage, Avaya, and any SIP-compatible carrier through direct trunking, plus their own built-in numbers. Twilio dominates the conversation only because it's the most common existing carrier.

Can I bring my own LLM?

Yes. Custom LLMs are supported through a WebSocket integration, including OpenAI Enterprise contracts, the Anthropic API, Gemini, and self-hosted models. You pay your model provider for inference and the platform for orchestration.

How does the bundled price compare to a DIY build on Twilio + OpenAI + ElevenLabs?

Loaded all-in costs land in roughly the same range: between $0.13 and $0.31 a minute. Per-minute economics aren't the deciding factor. The deciding factor is engineering time saved on the orchestration layer, which is typically months.

What does the orchestration layer include?

Streaming audio in chunks, sentence-boundary buffering between the language model and TTS, turn-taking and barge-in, function calling during a live call, retry and failover across the underlying APIs, simulation testing, transcripts with sentiment scoring, and a unified observability dashboard.

Can I use a voice AI platform for outbound calling at scale?

Yes. Batch calling supports outbound campaigns with 20 free concurrent calls on every account. Common use cases include AI telemarketing, lead qualification, debt collection follow-up, and appointment reminders. TCPA and STIR/SHAKEN compliance are configured at the carrier level.

What happens when the agent can't handle a call?

Warm transfer with full conversation context. The human picking up sees the transcript and the structured data the agent collected, so the caller doesn't repeat themselves. Escalation thresholds (failed clarification attempts, sensitive topics, explicit caller requests) are configured per agent.

Is the platform suitable for regulated industries?

Yes. SOC 2 Type II, HIPAA with a self-service BAA, GDPR, and PII redaction are included. On-premise deployment is available for teams with strict data residency requirements.

How long does it take to go from signup to a live agent?

Most teams have a working test call within an hour and a production-ready agent within 1–2 weeks. The platform processes 50+ million calls a month across 3,000+ businesses, so the deployment path is well-trodden.

The buying decision in one paragraph

The four real-time pieces (telephony, transcription, language model, voice synthesis) are commoditizing fast. Anyone can buy them. The orchestration that turns them into a phone call you'd let a customer hear is where the engineering compounds, and it's why a per-minute platform fee is worth paying in 2026.

The fastest way to feel where the line sits is to make a real call. Retell AI's signup includes $10 in usage credits, which is enough to deploy a test agent against your own number and watch where your existing systems plug in versus what the platform handles end-to-end.

From there, the natural next step is whichever workflow matters most:

AI customer support for inbound deflection
An AI IVR replacement for the phone tree your callers already hate
An outbound campaign for the leads sitting in your CRM

Build with the $10 in credits at retellai.com.

ROI Calculator

Estimate Your ROI from Automating Calls

See how much your business could save by switching to AI-powered voice agents.

All done!
Your submission has been sent to your email

Oops! Something went wrong while submitting the form.

ROI Result

2,000

Total Human Agent Cost

$5,000

/month

AI Agent Cost

$3,000

/month

Estimated Savings

$2,000

/month

Live Demo

Try Our Live Demo

A Demo Phone Number From Retell Clinic Office

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

What Is Bundled in a Voice AI Platform? Telephony, TTS, STT, LLMs, and What You Still Need to Bring

The voice AI stack at a glance

Do I need separate subscriptions for telephony, TTS, STT, and LLMs?

What does the per-minute price include?

Is telephony bundled, or do I still need Twilio?

Do I need to pay ElevenLabs separately?

Do I need an OpenAI subscription for the LLM?

What you still need to bring

Bundle vs. build: the honest comparison

Where the boundaries sit

What about HIPAA, SOC 2, and GDPR?

How the bundle performs in production

Frequently asked questions

The buying decision in one paragraph

ROI Result

Read Other Blogs

Revolutionize your call operation with Retell