ON THIS PAGE

How Do You Structure a Knowledge Base for Voice AI So Your Agent Stops Hallucinating?

Structure it as 512-token recursive Markdown chunks tagged with product, region, and audience metadata, retrieved at a 0.65+ similarity threshold with an explicit refusal instruction. This is the four-layer pattern (curation, chunking, metadata scoping, refusal-by-default retrieval) that production voice agents on platforms like Retell AI use to prevent the made-up-step failure mode.

The rest of this guide unpacks each layer with the exact configuration values, a reference structure modeled on enterprise support libraries like Lenovo's, and the testing protocol that catches hallucinations before they reach a caller.

What Will You Build?

A retrieval architecture that grounds every agent response in your verified content, blocks the model from inventing steps, and stays under the latency budget required for natural phone conversations.

By the end of this tutorial, your knowledge base will:

Return the right chunk on the first retrieval at least 90% of the time during testing
Add under 100ms of latency per turn, keeping total response time near 600ms
Refuse to answer when the answer is not in the source material instead of guessing
Filter retrieval by product, region, or workflow so multi-tenant content does not cross-contaminate
Stay current automatically as your underlying documentation changes

What Do You Need Before You Start?

A Retell AI account with a working agent (the knowledge base feature is included on every plan)
Your existing support documentation in a format you can export to Markdown
A list of the top 50 caller questions from the last 30 days of phone or chat tickets
Editor access to your help center or product docs (you will be rewriting some pages)
An hour to evaluate retrieval quality after the first build

How Do You Build a Voice AI Knowledge Base That Doesn't Hallucinate?

Step 1: How Do You Audit and Curate Source Content Before Indexing?

Archive every document that is stale, contradictory, or not the single source of truth for its topic before you index a single page. The largest source of voice agent hallucinations is not the model. It is contradictory or outdated content in the source material.

If two pages disagree about your refund window, the retriever has no way to pick the correct one, and the LLM will confidently read whichever chunk wins the similarity score. Pull every document, FAQ, and help article you plan to index. For each one, check three things: is it current as of this quarter, is it the single source of truth for its topic, and does it match what your senior support reps actually say on calls. If a document shouldn't be used to answer customer questions, it shouldn't be in your knowledge base at all. ElevenLabs

You should now have a curated set of documents that you would be comfortable sending verbatim to a customer.

Step 2: Why Markdown, and How Should You Format It?

Convert everything to structured Markdown with one H1 per document, a descriptive H2 per resolvable user question, and short paragraphs with explicit subjects. Voice agents retrieve text chunks, not rendered web pages. Markdown is the format that survives the chunking pipeline with the most semantic structure intact.

Retell's own documentation recommends Markdown over .txt because well-structured headings give the retriever clean boundaries to split on. Replace every "click here" or "as described above" with the concrete reference, because the chunk containing "above" may be retrieved without the chunk it points to. Each chunk gets read alone, so each chunk needs to make sense alone.

You should now have a folder of Markdown files where each file covers one product area and each H2 covers one resolvable user question.

Step 3: What Chunk Size Works Best for Voice AI?

Use recursive chunking at 512 tokens with 10-15% overlap, splitting on Markdown headings first, then paragraphs, then sentences. This is the benchmark-validated default for general RAG content, and it holds up well for voice support material specifically.

A well-tuned 512-token recursive splitter with 15% overlap and metadata enrichment outperforms an expensive semantic chunking approach on most real-world document sets. Long chunks pad the LLM context window with noise. Short chunks fragment instructions across multiple retrievals and cause the agent to skip steps mid-explanation. The hard rule: never split a numbered procedure across two chunks. If "Step 3" lives in chunk A and "Step 4" in chunk B, the retriever can return one without the other and your agent will skip an action. Substack

You should now have chunks where each unit either contains a complete procedure or contains contextual prose that makes sense without its neighbors.

Step 4: What Metadata Should You Tag Every Chunk With?

Tag every chunk with at minimum five fields: product, version, region, audience, and last_verified_date. This is what stops a Texas caller from hearing California return policies and stops a ThinkPad question from pulling ThinkCentre answers.

If a single agent sees KB content for multiple states or locations, retrieval can pull the wrong-state chunk (e.g., California policy for a Texas caller) unless scoping and metadata are carefully designed. The same problem applies to product lines, software versions, customer tiers, and support channels. At runtime, your agent passes the relevant filters with the query so the vector search only considers chunks that match the caller's context. In Retell, you can pass these as dynamic variables collected earlier in the call. Optimize Smart

You should now have a metadata schema where any single chunk can be uniquely scoped to one caller context.

Step 5: What Retrieval Threshold Stops Hallucinations?

Set the similarity threshold to 0.65 or higher and limit retrieval to 3-5 chunks. A retrieval system that always returns something is a hallucination machine in disguise.

When a caller asks about a feature you do not document, the retriever will still surface the closest semantic match. The LLM will receive that chunk and fluently weave it into a wrong answer. In Retell's knowledge base settings, two parameters control this behavior. The "Chunks to retrieve" parameter sets how many results feed into the LLM (default 3, recommended max 5 for voice). The "Similarity Threshold" sets the minimum cosine similarity for a chunk to be considered relevant (default 0.6). For software-support use cases where wrong information is worse than no information, raise the threshold to 0.7.

You should now see your retrieval system returning fewer, higher-quality matches and rejecting marginal ones.

Step 6: How Do You Force the Agent to Refuse Instead of Guessing?

Add this exact instruction to the agent prompt: "Only answer using the information in ## Related Knowledge Base Contexts. If that section is missing or does not contain relevant information, say there is no related information available and offer to transfer the call." This single instruction is the most effective anti-hallucination control in any voice agent.

Without it, the LLM will reach into its training data when retrieval comes back empty. This is the failure mode behind nearly every public chatbot disaster, including the Air Canada bereavement-fare case where the airline was held legally liable. The refusal pattern flips the failure mode from "confident wrong answer" to "honest let-me-get-a-human," which is exactly what callers using your software actually want when the system hits an edge case.

You should now see your agent saying "I do not have that documented; let me transfer you" on out-of-scope questions instead of inventing steps.

Step 7: How Do You Keep the Knowledge Base From Going Stale?

Enable auto-refresh on URL sources so Retell re-fetches every 24 hours, version your Markdown files in Git, and run a quarterly review of any file with a last_verified_date older than 90 days. Stale knowledge is hallucination's quiet partner.

If the knowledge base is outdated, RAG just retrieves the wrong answer faster. The fix is automating freshness instead of relying on someone remembering to re-upload files when product docs change. Pair auto-refresh with auto-crawling for help center subpaths so new articles get indexed automatically without manual intervention. CX Today

You should now have a knowledge base that updates itself when your underlying content updates, with no human in the loop.

Step 8: How Do You Test Retrieval Before Going Live?

Run your 50 real caller questions through the retrieval system and inspect what comes back before any LLM generation happens. Three things matter for each query: is the right chunk in the top 3, is the similarity score above your threshold, and would a human reading just the retrieved chunks be able to answer the question.

For any question where retrieval fails, the fix is almost always at the source. Either the relevant content is missing entirely, the chunking split a procedure across boundaries, or the metadata is filtering it out. Resist the urge to fix retrieval failures by adding instructions to the agent prompt. Prompt patches are how knowledge bases drift into unmaintainable mess. Aim for 90%+ retrieval accuracy on the test set before deploying to live calls.

You should now have a measured retrieval accuracy number for your top caller questions and a list of source-document fixes for the questions that failed.

Step 9: When Should You Use Conversation Flow Instead of a Single Prompt?

Use conversation flow with node-level knowledge bases for any voice agent where caller intent splits into distinct workflows, especially software support, healthcare scheduling, and multi-product environments. A flat knowledge base sitting under a single prompt is the loosest possible architecture.

For software support specifically, the highest-quality pattern is conversation flow where each node only retrieves from the slice of documentation relevant to that part of the call. A typical software-support flow has nodes for triage, account lookup, troubleshooting, escalation, and post-resolution confirmation. The troubleshooting node loads the troubleshooting KB. The account-lookup node does not load any KB at all because it should be calling an API. This structure is more reliable to maintain than one giant prompt with one giant KB. Pair it with built-in post call analysis so you can see which nodes are firing retrieval and which queries are coming back below threshold.

You should now have a deployment where retrieval scope tightens as the conversation narrows, instead of every turn searching every document.

How Should You Structure the Knowledge Base Itself? A Lenovo-Style Reference

Tier the knowledge base by audience and scope retrieval to the caller's tier. This is the structural pattern Lenovo's enterprise support library uses to keep teacher-facing classroom-management content from colliding with technical engineering content.

Lenovo established three levels of articles — general topics and product information, teacher-specific topics, and technical topics and issues, with focused articles, eliminated redundancies, and standardized naming conventions across all three tiers. Apply the same pattern to a voice AI knowledge base: Contiem

Tier 1: General product and pricing. Public-facing facts every caller might ask. Tagged audience: all. Index everything.

Tier 2: End-user how-to. Step-by-step procedures for the standard caller. Tagged audience: end_user, scoped by product and region. This tier carries the bulk of retrieval traffic.

Tier 3: Technical and admin. Configuration, integrations, and edge cases. Tagged audience: admin. Only retrieved when the caller has been identified as an administrator earlier in the call.

Internal runbooks, escalation matrices, and engineering notes go in a separate knowledge base entirely, never accessible to the customer-facing agent. The tiering is what prevents an end-user troubleshooting call from accidentally surfacing an internal escalation procedure.

What Are the Best Practices Once the Knowledge Base Is Live?

Should the Knowledge Base Contain Agent Instructions?

No. The knowledge base is for supplying supporting information, not agent behavior. If you find yourself uploading a Markdown file titled "How the agent should behave when X happens," that content belongs in the prompt or in a conversation flow node. Mixing them dilutes both: the retriever ranks behavior instructions against factual queries and pulls them at the wrong moments.

How Should You Write Headings for Voice Retrieval?

Lead with the user goal, not the feature name. "Configure two-factor authentication" becomes "Turn on two-factor login." The retriever matches against the caller's spoken wording, and natural-language questions match natural-language headings far better than they match product terminology.

Why Should Each Chunk Be Self-Contained?

Each chunk gets retrieved alone. Use full names instead of pronouns, full product names instead of "the platform," and repeat any conditional context in each step instead of saying "if you are using the admin console, then..." three paragraphs later. This single rule eliminates a surprising fraction of hallucinations because it removes the ambiguity that the LLM otherwise tries to resolve by guessing.

How Do You Debug a Hallucination After the Fact?

Capture the retrieved chunks, similarity scores, and metadata filters on every call alongside the transcript. Logging only the final agent response makes hallucination debugging nearly impossible: you see the wrong answer but not whether the retriever returned the wrong chunk or the right chunk got generated incorrectly. Most "hallucination" tickets turn out to be retrieval ranking problems, which are fixed at the source.

Why Does Tabular Data Retrieve Poorly?

The chunking pipeline cannot preserve the spatial relationships that make tables readable, so a table cell often gets retrieved without its column header. Rewrite critical tables as prose with explicit sentences. "The Pro plan supports 50 users and includes API access" beats a table cell that the retriever splits away from its column header.

What Are the Common Pitfalls and How Do You Avoid Them?

Why Is Dumping the Whole Help Center Into One KB a Mistake?

A knowledge base with 4,000 chunks where 50 are relevant for any given caller is worse than one with 400 chunks where 50 are relevant, because the retriever has 10x more competing matches to confuse it. Build narrow knowledge bases per workflow and link them at the node level instead.

Why Shouldn't You Patch Hallucinations in the Prompt?

When the agent says something wrong, the instinct is to add "do not say X" to the prompt. Three of these and the prompt becomes contradictory; ten and it becomes unmanageable. Figure out why the LLM said X. Almost always, a chunk in the KB suggested it, or the absence of a chunk forced the model to fall back on training data. Patch the source.

What's the Cost of Over-Retrieval?

Each additional chunk adds tokens to the prompt and milliseconds to the response. Setting "chunks to retrieve" to 10 because more context feels safer is a common mistake. Stay at 3 chunks for typical support content, increase to 5 only when caller questions span multiple topics, and never go higher unless you have measured it improving accuracy.

Why Do Public Chatbot Failures Keep Happening?

Because the agent is allowed to speak without an explicit refusal instruction. The Air Canada chatbot was held liable after generating a nonexistent bereavement policy that contradicted the airline's actual rules, and similar failures keep recurring across vendors who skip the refusal layer. Make the refusal explicit, test that it triggers, and treat any case where the agent invents information as a P0 bug. CanLII

Why Re-Test After Every Source Update?

Adding a new document changes the retrieval landscape for every existing query. A chunk that ranked first yesterday may rank third today. Keep the 50-question test set automated and re-run it whenever the underlying content changes meaningfully.

What Results Have Real Teams Seen?

How Did SWTCH Use This Pattern for EV Charger Support?

SWTCH deployed a Retell-powered voice agent named Lucas to handle EV charger support calls, where callers are typically standing at a dead charger with a low battery and no patience for a wrong instruction. The implementation reduced support costs by more than 50% and significantly improved SaaS margins, with the agent answering in seconds instead of minutes. The reliability bar was set by the use case: a wrong troubleshooting step is the difference between a working charger and a stranded driver.

How Did Anker Scale This Across Global Support?

Anker rolled out Retell across global consumer-electronics support, where callers ask product-specific questions across dozens of SKUs and multiple languages. The case study illustrates why metadata scoping matters at scale. Without product-level filtering on retrieval, a soundbar question can pull in a vacuum cleaner manual, and the agent will confidently combine them. With proper KB structure, the agent stays inside the product context for the entire call.

What's the Production Volume This Architecture Has Handled?

Retell AI now powers 50M+ real-time AI phone calls every month for clients across thousands of businesses, with no agent reported going off the rails across that volume. The architecture in this guide is the same one running underneath those calls. Yahoo Finance

Frequently Asked Questions

What Is the Best Chunk Size for a Voice AI Knowledge Base?

Recursive chunking at 512 tokens with 10-15% overlap is the benchmark-validated default. Smaller chunks (200-300 tokens) work for FAQ-style content; larger chunks (1024 tokens) work for narrative prose. Always split on Markdown headings first, then paragraphs, then sentences.

How Do I Prevent the LLM From Generating Information Not Found in the Knowledge Base?

Add an explicit refusal instruction to the agent prompt: "Only answer using the information in ## Related Knowledge Base Contexts. If that section is missing or does not contain relevant information, respond that there is no related information available." Combined with a similarity threshold of 0.65 or higher, this is the most effective single anti-hallucination control.

How Much Latency Does the Knowledge Base Add per Turn?

Under 100ms per turn on Retell's optimized retrieval pipeline, keeping the agent within the ~600ms total response window callers expect. If you see materially higher latency, check whether you are retrieving more chunks than you need or whether metadata filtering is being applied at query time rather than after retrieval.

Should I Use a Single Prompt or Conversation Flow With Node-Level Knowledge Bases?

Conversation flow wins for software support and any scenario where caller intent splits into distinct workflows. Node-level knowledge bases let each conversation state retrieve from a focused slice of content, which improves accuracy and makes maintenance straightforward. Single prompts work for narrow use cases like a single-product FAQ. Retell's deploy conversational AI guide covers the architectural choice in more detail.

How Often Should I Refresh the Knowledge Base?

Enable auto-refresh on URL sources so Retell re-fetches every 24 hours. For uploaded documents, run a manual review whenever the underlying product or policy changes and treat anything older than 90 days as needing verification.

Can I Train the Voice Agent on My Call Recordings Instead of Writing Documentation?

Yes, partially. You can use successful call transcripts and senior-rep recordings as source material for the knowledge base. Extract the Q-and-A pairs, convert them to Markdown, and index them alongside your formal documentation. This is especially useful for capturing the specific phrasing your top reps use, which often resolves issues faster than the official help-center copy. It does not replace structured documentation; it supplements it.

What Metadata Fields Are Most Important for Voice Agent Retrieval?

At minimum: product, version, region, audience, and last_verified_date. Add topic for granular routing in a conversation flow, and compliance_scope if you have regulated content (HIPAA, financial advice) that should never be retrieved outside specific call contexts.

How Do I Test Whether My Knowledge Base Is Working Before Going Live?

Build a test set of 50-100 real caller questions from the last 30 days of support tickets. For each, inspect the retrieved chunks before any LLM generation: is the right chunk in the top 3, is the similarity score above threshold, and could a human answer the question from just those chunks. Aim for 90%+ retrieval accuracy on the test set before deploying.

What Happens When a Caller Asks Something That Isn't in the Knowledge Base?

With a refusal instruction in place and a similarity threshold of 0.65 or higher, the agent says it does not have that information documented and either offers to take a message or warm-transfers via call transfer to a human agent with full conversation context. Without these controls, the agent falls back on the underlying LLM's training data, which is exactly the failure mode this guide is designed to prevent.

What Should You Do Next?

You now have a knowledge base architecture that grounds every agent response in verified content, scopes retrieval by caller context, refuses to answer when the answer isn't documented, and updates itself as your source material changes. This is the foundation that lets a voice agent handle software support, regulated industries, or any high-stakes call where a wrong step matters more than a fast one.

To extend this further, the same retrieval architecture supports use cases like AI customer support automation, lead qualification with product-specific routing, and AI-powered receptionists for healthcare practices where compliance scoping is non-negotiable. The same patterns also map to healthcare and insurance deployments where the cost of a hallucination is a regulatory issue, not just a customer-experience one.

Start building free with $10 in usage credits at retellai.com.

ROI Calculator

Estimate Your ROI from Automating Calls

See how much your business could save by switching to AI-powered voice agents.

All done!
Your submission has been sent to your email

Oops! Something went wrong while submitting the form.

ROI Result

2,000

Total Human Agent Cost

$5,000

/month

AI Agent Cost

$3,000

/month

Estimated Savings

$2,000

/month

Live Demo

Try Our Live Demo

A Demo Phone Number From Retell Clinic Office

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

How to Structure a Voice AI Knowledge Base

How Do You Structure a Knowledge Base for Voice AI So Your Agent Stops Hallucinating?

What Will You Build?

What Do You Need Before You Start?

How Do You Build a Voice AI Knowledge Base That Doesn't Hallucinate?

Step 1: How Do You Audit and Curate Source Content Before Indexing?

Step 2: Why Markdown, and How Should You Format It?

Step 3: What Chunk Size Works Best for Voice AI?

Step 4: What Metadata Should You Tag Every Chunk With?

Step 5: What Retrieval Threshold Stops Hallucinations?

Step 6: How Do You Force the Agent to Refuse Instead of Guessing?

Step 7: How Do You Keep the Knowledge Base From Going Stale?

Step 8: How Do You Test Retrieval Before Going Live?

Step 9: When Should You Use Conversation Flow Instead of a Single Prompt?

How Should You Structure the Knowledge Base Itself? A Lenovo-Style Reference

What Are the Best Practices Once the Knowledge Base Is Live?

Should the Knowledge Base Contain Agent Instructions?

How Should You Write Headings for Voice Retrieval?

Why Should Each Chunk Be Self-Contained?

How Do You Debug a Hallucination After the Fact?

Why Does Tabular Data Retrieve Poorly?

What Are the Common Pitfalls and How Do You Avoid Them?

Why Is Dumping the Whole Help Center Into One KB a Mistake?

Why Shouldn't You Patch Hallucinations in the Prompt?

What's the Cost of Over-Retrieval?

Why Do Public Chatbot Failures Keep Happening?

Why Re-Test After Every Source Update?

What Results Have Real Teams Seen?

How Did SWTCH Use This Pattern for EV Charger Support?

How Did Anker Scale This Across Global Support?

What's the Production Volume This Architecture Has Handled?

Frequently Asked Questions

What Is the Best Chunk Size for a Voice AI Knowledge Base?

How Do I Prevent the LLM From Generating Information Not Found in the Knowledge Base?

How Much Latency Does the Knowledge Base Add per Turn?

Should I Use a Single Prompt or Conversation Flow With Node-Level Knowledge Bases?

How Often Should I Refresh the Knowledge Base?

Can I Train the Voice Agent on My Call Recordings Instead of Writing Documentation?

What Metadata Fields Are Most Important for Voice Agent Retrieval?

How Do I Test Whether My Knowledge Base Is Working Before Going Live?

What Happens When a Caller Asks Something That Isn't in the Knowledge Base?

What Should You Do Next?

ROI Result

Read Other Blogs

Revolutionize your call operation with Retell