Structure it as 512-token recursive Markdown chunks tagged with product, region, and audience metadata, retrieved at a 0.65+ similarity threshold with an explicit refusal instruction. This is the four-layer pattern (curation, chunking, metadata scoping, refusal-by-default retrieval) that production voice agents on platforms like Retell AI use to prevent the made-up-step failure mode.
The rest of this guide unpacks each layer with the exact configuration values, a reference structure modeled on enterprise support libraries like Lenovo's, and the testing protocol that catches hallucinations before they reach a caller.
A retrieval architecture that grounds every agent response in your verified content, blocks the model from inventing steps, and stays under the latency budget required for natural phone conversations.
By the end of this tutorial, your knowledge base will:
Archive every document that is stale, contradictory, or not the single source of truth for its topic before you index a single page. The largest source of voice agent hallucinations is not the model. It is contradictory or outdated content in the source material.
If two pages disagree about your refund window, the retriever has no way to pick the correct one, and the LLM will confidently read whichever chunk wins the similarity score. Pull every document, FAQ, and help article you plan to index. For each one, check three things: is it current as of this quarter, is it the single source of truth for its topic, and does it match what your senior support reps actually say on calls. If a document shouldn't be used to answer customer questions, it shouldn't be in your knowledge base at all. ElevenLabs
You should now have a curated set of documents that you would be comfortable sending verbatim to a customer.
Convert everything to structured Markdown with one H1 per document, a descriptive H2 per resolvable user question, and short paragraphs with explicit subjects. Voice agents retrieve text chunks, not rendered web pages. Markdown is the format that survives the chunking pipeline with the most semantic structure intact.
Retell's own documentation recommends Markdown over .txt because well-structured headings give the retriever clean boundaries to split on. Replace every "click here" or "as described above" with the concrete reference, because the chunk containing "above" may be retrieved without the chunk it points to. Each chunk gets read alone, so each chunk needs to make sense alone.
You should now have a folder of Markdown files where each file covers one product area and each H2 covers one resolvable user question.
Use recursive chunking at 512 tokens with 10-15% overlap, splitting on Markdown headings first, then paragraphs, then sentences. This is the benchmark-validated default for general RAG content, and it holds up well for voice support material specifically.
A well-tuned 512-token recursive splitter with 15% overlap and metadata enrichment outperforms an expensive semantic chunking approach on most real-world document sets. Long chunks pad the LLM context window with noise. Short chunks fragment instructions across multiple retrievals and cause the agent to skip steps mid-explanation. The hard rule: never split a numbered procedure across two chunks. If "Step 3" lives in chunk A and "Step 4" in chunk B, the retriever can return one without the other and your agent will skip an action. Substack
You should now have chunks where each unit either contains a complete procedure or contains contextual prose that makes sense without its neighbors.
Tag every chunk with at minimum five fields: product, version, region, audience, and last_verified_date. This is what stops a Texas caller from hearing California return policies and stops a ThinkPad question from pulling ThinkCentre answers.
If a single agent sees KB content for multiple states or locations, retrieval can pull the wrong-state chunk (e.g., California policy for a Texas caller) unless scoping and metadata are carefully designed. The same problem applies to product lines, software versions, customer tiers, and support channels. At runtime, your agent passes the relevant filters with the query so the vector search only considers chunks that match the caller's context. In Retell, you can pass these as dynamic variables collected earlier in the call. Optimize Smart
You should now have a metadata schema where any single chunk can be uniquely scoped to one caller context.
Set the similarity threshold to 0.65 or higher and limit retrieval to 3-5 chunks. A retrieval system that always returns something is a hallucination machine in disguise.
When a caller asks about a feature you do not document, the retriever will still surface the closest semantic match. The LLM will receive that chunk and fluently weave it into a wrong answer. In Retell's knowledge base settings, two parameters control this behavior. The "Chunks to retrieve" parameter sets how many results feed into the LLM (default 3, recommended max 5 for voice). The "Similarity Threshold" sets the minimum cosine similarity for a chunk to be considered relevant (default 0.6). For software-support use cases where wrong information is worse than no information, raise the threshold to 0.7.
You should now see your retrieval system returning fewer, higher-quality matches and rejecting marginal ones.
Add this exact instruction to the agent prompt: "Only answer using the information in ## Related Knowledge Base Contexts. If that section is missing or does not contain relevant information, say there is no related information available and offer to transfer the call." This single instruction is the most effective anti-hallucination control in any voice agent.
Without it, the LLM will reach into its training data when retrieval comes back empty. This is the failure mode behind nearly every public chatbot disaster, including the Air Canada bereavement-fare case where the airline was held legally liable. The refusal pattern flips the failure mode from "confident wrong answer" to "honest let-me-get-a-human," which is exactly what callers using your software actually want when the system hits an edge case.
You should now see your agent saying "I do not have that documented; let me transfer you" on out-of-scope questions instead of inventing steps.
Enable auto-refresh on URL sources so Retell re-fetches every 24 hours, version your Markdown files in Git, and run a quarterly review of any file with a last_verified_date older than 90 days. Stale knowledge is hallucination's quiet partner.
If the knowledge base is outdated, RAG just retrieves the wrong answer faster. The fix is automating freshness instead of relying on someone remembering to re-upload files when product docs change. Pair auto-refresh with auto-crawling for help center subpaths so new articles get indexed automatically without manual intervention. CX Today
You should now have a knowledge base that updates itself when your underlying content updates, with no human in the loop.
Run your 50 real caller questions through the retrieval system and inspect what comes back before any LLM generation happens. Three things matter for each query: is the right chunk in the top 3, is the similarity score above your threshold, and would a human reading just the retrieved chunks be able to answer the question.
For any question where retrieval fails, the fix is almost always at the source. Either the relevant content is missing entirely, the chunking split a procedure across boundaries, or the metadata is filtering it out. Resist the urge to fix retrieval failures by adding instructions to the agent prompt. Prompt patches are how knowledge bases drift into unmaintainable mess. Aim for 90%+ retrieval accuracy on the test set before deploying to live calls.
You should now have a measured retrieval accuracy number for your top caller questions and a list of source-document fixes for the questions that failed.
Use conversation flow with node-level knowledge bases for any voice agent where caller intent splits into distinct workflows, especially software support, healthcare scheduling, and multi-product environments. A flat knowledge base sitting under a single prompt is the loosest possible architecture.
For software support specifically, the highest-quality pattern is conversation flow where each node only retrieves from the slice of documentation relevant to that part of the call. A typical software-support flow has nodes for triage, account lookup, troubleshooting, escalation, and post-resolution confirmation. The troubleshooting node loads the troubleshooting KB. The account-lookup node does not load any KB at all because it should be calling an API. This structure is more reliable to maintain than one giant prompt with one giant KB. Pair it with built-in post call analysis so you can see which nodes are firing retrieval and which queries are coming back below threshold.
You should now have a deployment where retrieval scope tightens as the conversation narrows, instead of every turn searching every document.
Tier the knowledge base by audience and scope retrieval to the caller's tier. This is the structural pattern Lenovo's enterprise support library uses to keep teacher-facing classroom-management content from colliding with technical engineering content.
Lenovo established three levels of articles — general topics and product information, teacher-specific topics, and technical topics and issues, with focused articles, eliminated redundancies, and standardized naming conventions across all three tiers. Apply the same pattern to a voice AI knowledge base: Contiem
Tier 1: General product and pricing. Public-facing facts every caller might ask. Tagged audience: all. Index everything.
Tier 2: End-user how-to. Step-by-step procedures for the standard caller. Tagged audience: end_user, scoped by product and region. This tier carries the bulk of retrieval traffic.
Tier 3: Technical and admin. Configuration, integrations, and edge cases. Tagged audience: admin. Only retrieved when the caller has been identified as an administrator earlier in the call.
Internal runbooks, escalation matrices, and engineering notes go in a separate knowledge base entirely, never accessible to the customer-facing agent. The tiering is what prevents an end-user troubleshooting call from accidentally surfacing an internal escalation procedure.
No. The knowledge base is for supplying supporting information, not agent behavior. If you find yourself uploading a Markdown file titled "How the agent should behave when X happens," that content belongs in the prompt or in a conversation flow node. Mixing them dilutes both: the retriever ranks behavior instructions against factual queries and pulls them at the wrong moments.
Lead with the user goal, not the feature name. "Configure two-factor authentication" becomes "Turn on two-factor login." The retriever matches against the caller's spoken wording, and natural-language questions match natural-language headings far better than they match product terminology.
Each chunk gets retrieved alone. Use full names instead of pronouns, full product names instead of "the platform," and repeat any conditional context in each step instead of saying "if you are using the admin console, then..." three paragraphs later. This single rule eliminates a surprising fraction of hallucinations because it removes the ambiguity that the LLM otherwise tries to resolve by guessing.
Capture the retrieved chunks, similarity scores, and metadata filters on every call alongside the transcript. Logging only the final agent response makes hallucination debugging nearly impossible: you see the wrong answer but not whether the retriever returned the wrong chunk or the right chunk got generated incorrectly. Most "hallucination" tickets turn out to be retrieval ranking problems, which are fixed at the source.
The chunking pipeline cannot preserve the spatial relationships that make tables readable, so a table cell often gets retrieved without its column header. Rewrite critical tables as prose with explicit sentences. "The Pro plan supports 50 users and includes API access" beats a table cell that the retriever splits away from its column header.
A knowledge base with 4,000 chunks where 50 are relevant for any given caller is worse than one with 400 chunks where 50 are relevant, because the retriever has 10x more competing matches to confuse it. Build narrow knowledge bases per workflow and link them at the node level instead.
When the agent says something wrong, the instinct is to add "do not say X" to the prompt. Three of these and the prompt becomes contradictory; ten and it becomes unmanageable. Figure out why the LLM said X. Almost always, a chunk in the KB suggested it, or the absence of a chunk forced the model to fall back on training data. Patch the source.
Each additional chunk adds tokens to the prompt and milliseconds to the response. Setting "chunks to retrieve" to 10 because more context feels safer is a common mistake. Stay at 3 chunks for typical support content, increase to 5 only when caller questions span multiple topics, and never go higher unless you have measured it improving accuracy.
Because the agent is allowed to speak without an explicit refusal instruction. The Air Canada chatbot was held liable after generating a nonexistent bereavement policy that contradicted the airline's actual rules, and similar failures keep recurring across vendors who skip the refusal layer. Make the refusal explicit, test that it triggers, and treat any case where the agent invents information as a P0 bug. CanLII
Adding a new document changes the retrieval landscape for every existing query. A chunk that ranked first yesterday may rank third today. Keep the 50-question test set automated and re-run it whenever the underlying content changes meaningfully.
SWTCH deployed a Retell-powered voice agent named Lucas to handle EV charger support calls, where callers are typically standing at a dead charger with a low battery and no patience for a wrong instruction. The implementation reduced support costs by more than 50% and significantly improved SaaS margins, with the agent answering in seconds instead of minutes. The reliability bar was set by the use case: a wrong troubleshooting step is the difference between a working charger and a stranded driver.
Anker rolled out Retell across global consumer-electronics support, where callers ask product-specific questions across dozens of SKUs and multiple languages. The case study illustrates why metadata scoping matters at scale. Without product-level filtering on retrieval, a soundbar question can pull in a vacuum cleaner manual, and the agent will confidently combine them. With proper KB structure, the agent stays inside the product context for the entire call.
Retell AI now powers 50M+ real-time AI phone calls every month for clients across thousands of businesses, with no agent reported going off the rails across that volume. The architecture in this guide is the same one running underneath those calls. Yahoo Finance
Recursive chunking at 512 tokens with 10-15% overlap is the benchmark-validated default. Smaller chunks (200-300 tokens) work for FAQ-style content; larger chunks (1024 tokens) work for narrative prose. Always split on Markdown headings first, then paragraphs, then sentences.
Add an explicit refusal instruction to the agent prompt: "Only answer using the information in ## Related Knowledge Base Contexts. If that section is missing or does not contain relevant information, respond that there is no related information available." Combined with a similarity threshold of 0.65 or higher, this is the most effective single anti-hallucination control.
Under 100ms per turn on Retell's optimized retrieval pipeline, keeping the agent within the ~600ms total response window callers expect. If you see materially higher latency, check whether you are retrieving more chunks than you need or whether metadata filtering is being applied at query time rather than after retrieval.
Conversation flow wins for software support and any scenario where caller intent splits into distinct workflows. Node-level knowledge bases let each conversation state retrieve from a focused slice of content, which improves accuracy and makes maintenance straightforward. Single prompts work for narrow use cases like a single-product FAQ. Retell's deploy conversational AI guide covers the architectural choice in more detail.
Enable auto-refresh on URL sources so Retell re-fetches every 24 hours. For uploaded documents, run a manual review whenever the underlying product or policy changes and treat anything older than 90 days as needing verification.
Yes, partially. You can use successful call transcripts and senior-rep recordings as source material for the knowledge base. Extract the Q-and-A pairs, convert them to Markdown, and index them alongside your formal documentation. This is especially useful for capturing the specific phrasing your top reps use, which often resolves issues faster than the official help-center copy. It does not replace structured documentation; it supplements it.
At minimum: product, version, region, audience, and last_verified_date. Add topic for granular routing in a conversation flow, and compliance_scope if you have regulated content (HIPAA, financial advice) that should never be retrieved outside specific call contexts.
Build a test set of 50-100 real caller questions from the last 30 days of support tickets. For each, inspect the retrieved chunks before any LLM generation: is the right chunk in the top 3, is the similarity score above threshold, and could a human answer the question from just those chunks. Aim for 90%+ retrieval accuracy on the test set before deploying.
With a refusal instruction in place and a similarity threshold of 0.65 or higher, the agent says it does not have that information documented and either offers to take a message or warm-transfers via call transfer to a human agent with full conversation context. Without these controls, the agent falls back on the underlying LLM's training data, which is exactly the failure mode this guide is designed to prevent.
You now have a knowledge base architecture that grounds every agent response in verified content, scopes retrieval by caller context, refuses to answer when the answer isn't documented, and updates itself as your source material changes. This is the foundation that lets a voice agent handle software support, regulated industries, or any high-stakes call where a wrong step matters more than a fast one.
To extend this further, the same retrieval architecture supports use cases like AI customer support automation, lead qualification with product-specific routing, and AI-powered receptionists for healthcare practices where compliance scoping is non-negotiable. The same patterns also map to healthcare and insurance deployments where the cost of a hallucination is a regulatory issue, not just a customer-experience one.
Start building free with $10 in usage credits at retellai.com.
See how much your business could save by switching to AI-powered voice agents.
Total Human Agent Cost
AI Agent Cost
Estimated Savings
A Demo Phone Number From Retell Clinic Office

Start building smarter conversations today.

