The Monday after Christmas, your support queue shows 412 tickets before 9 a.m. Most of them are the same four questions: "Where is my refund?", "Can I exchange this for a larger size?", "How do I start a return?", "Was my package received?" Your two-person team will spend the whole week looking up orders, copying tracking numbers, and issuing shipping labels one at a time. Industry data shows retailers expect 17% of holiday sales to come back.
This guide shows you how to build an AI system that handles refund and exchange requests end-to-end: phone and chat conversations, policy checks, fraud scoring, label generation, and real-time status updates. By the end, you'll have a working setup that deflects the routine 70-80% of return tickets and escalates only the cases that genuinely need a human, built on Retell AI.
An always-on returns and exchanges system that handles inbound calls and messages, verifies orders against your OMS, applies policy rules, generates labels, and pushes status updates to the customer without a human touching the ticket.
By the end of this tutorial, your setup will:
Before you start, you'll need:
Start with a working agent before you wire in anything complex. Sign up at retellai.com, open the agent builder, and pick a warm, clear voice from the ElevenLabs library. Set the opening line to something your brand would actually say: "Hi, this is Maya from [Brand] — are you calling about a return, an exchange, or something else?"
Make a test call to the temporary number the dashboard assigns. You'll hear the agent answer in roughly 600ms and route based on what you say. At this point it cannot yet look up orders — it only routes intent. That's fine. Confirm the AI voice agent answers cleanly, handles interruption when you talk over it, and identifies "return," "refund," and "exchange" as the three primary intents.
Outcome: a live agent that answers calls and correctly classifies returns, refunds, and exchange requests as separate flows.
Most brands try to build one giant flow and regret it. Build three short ones instead: refund request, exchange request, and "where is my refund" status check. Each flow needs a greeting, an order lookup step, a policy check, a decision branch, and a confirmation message.
For the refund flow: greet → ask for order number or email → look up order → check return window → check condition question → decide approve/decline/escalate → generate label → confirm. For the exchange flow, insert a "suggest same-item different size" step before approving a refund — exchanges preserve revenue that refunds destroy. For the status flow, skip the policy check entirely and pull the refund state from your payments provider.
Use the drag-and-drop agentic framework to wire each branch. Keep the conversation short — returns callers want speed, not small talk. Aim for under 90 seconds from greeting to label delivery on a clean case.
Outcome: three mapped flows that cover 80% of real return conversations.
The agent needs live order data or it's just a chatbot with a voice. Open the function-calling section of your agent and configure an HTTP request to your store's order API. For Shopify, that's the Admin API's /orders.json endpoint filtered by email or order number. For WooCommerce, the /wp-json/wc/v3/orders endpoint. For custom systems, whatever endpoint your OMS exposes.
Set the webhook timeout to at least 5 seconds — order APIs can be slow during peak hours. Pass the order number or email the caller provides as query parameters. Map the response fields you need: order_date, fulfillment_status, financial_status, line_items, total_price. Add a fallback branch that asks the caller to spell their email letter by letter if the first lookup fails.
Test with five real order numbers from your database. The agent should say "Got it, I see your order for the navy hoodie from November 12" within 2-3 seconds of the caller finishing their request.
Outcome: the agent can pull any real customer order and speak the relevant details back during the call.
Generic AI won't enforce your specific policy. You have to spell it out. In the agent's decision logic, add a policy-check node after order lookup. The node evaluates four things in order: purchase date against return window, product category against exceptions, customer return history against your fraud threshold, and item condition based on the caller's answers.
Example rule set for an apparel brand: 30-day window from delivery, final-sale items excluded, customers with more than 6 returns in 90 days flagged for human review, swimwear and underwear non-returnable. Write each rule as a simple conditional the knowledge base can reference. Include the exact policy language callers hear when a request is declined — "This item was purchased 47 days ago and our return window is 30 days, but I can offer you 20% store credit as a one-time courtesy" is better than "Your return is denied."
Outcome: the agent approves eligible returns, declines ineligible ones with a specific reason, and offers policy-compliant alternatives like store credit or partial refund.
Return fraud is roughly 9% of all returns according to NRF, and some categories run far higher. Your agent needs to catch the obvious patterns without slowing down legitimate customers. Build a scoring function that runs in parallel with the policy check and flags any of these signals: more than 3 returns in 30 days, return-to-purchase ratio above 50%, orders with a history of "never arrived" claims contradicted by carrier proof of delivery, or first-time customers requesting high-value refunds on multiple items from a single order.
Route flagged cases to a call transfer with full context — the human agent picks up with a note like "Caller on order 10847, three returns in last 21 days, requesting refund on item showing delivered and signed for." Don't tell the caller they're flagged. Say "Let me get a supervisor who can look at this with you."
Outcome: obvious fraud attempts reach human review with evidence attached, while clean returns sail through without friction.
Once the policy check passes, the agent needs to act, not just promise action. Wire a second function call to your label provider's API — Shippo, EasyPost, and ShipStation all expose a /transactions endpoint that returns a label URL given an order address and service level. Configure the agent to trigger this call the moment a return is approved, then SMS or email the label to the customer during the conversation.
In parallel, fire a webhook to your WMS or 3PL so the warehouse knows a return is inbound. Include the order number, expected items, and the reason code so receiving can route damaged-on-arrival items to inspection instead of restock. This is the step that separates real automation from fancy chatbots — the inventory workflow starts while the caller is still on the line.
Outcome: the customer receives a working label within 30 seconds of approval, and the warehouse sees the incoming return in its dashboard before the call ends.
Exchanges are where margin gets protected. When a caller says "I need to return these jeans, they're too big," a refund-first agent loses revenue. An inventory-aware agent says "I can see these in a size smaller at our warehouse — want me to ship those out today and include a prepaid label for the returns?"
Add a function call that checks your inventory API for the same SKU in the requested alternative size or color before branching to the refund path. If stock is available, offer the exchange first and process both sides of the transaction — outbound shipment of the new item, inbound label for the old one — in a single confirmation step. The AI customer support flow handles the follow-up SMS with both tracking numbers.
Outcome: 30-40% of refund requests convert to exchanges, preserving the revenue instead of reversing it.
Before you go live, run simulation tests against 20-30 realistic scenarios: clean return inside window, return outside window, exchange with stock, exchange without stock, serial returner, wardrobing suspect, caller who doesn't know their order number, accented or non-native English speaker, caller who interrupts mid-policy explanation. Fix any failure points you find.
When you flip the switch, turn on post call analysis with custom KPIs: approval rate, exchange conversion rate, transfer rate, average handle time, and fraud-flag precision. Set a weekly review cadence for the first month — read 20 call transcripts, look for knowledge gaps, adjust the policy wording or escalation thresholds based on what you see. Plan for a 2-week tuning period. Most brands land at 70-80% full containment in week one, climbing to 85-90% after the first round of adjustments.
Outcome: a live returns agent with measurable KPIs and a repeatable improvement loop.
Every refund is a reversed sale. Every exchange preserves the transaction. When a caller gives a reason that implies an exchange works (wrong size, wrong color, didn't love the fit), have the agent offer the alternative first. Brands that reorder the conversation this way see 30-40% of refund intents convert to exchanges. Don't force it — if the caller says "I just want my money back," respect that and move on.
Written policy reads differently than spoken policy. "Items must be in original condition with tags attached" works on a help page. On a call it sounds stiff. Rewrite each rule in plain spoken English and record yourself saying it — if you stumble, the caller will too. Store these rewritten lines in the knowledge base so the agent uses them verbatim.
Transferring to a human on the first unclear answer kills containment. Letting the agent loop forever frustrates the caller. Set a hard ceiling: three clarification attempts, then either resolve or escalate. Most callers rephrase successfully on attempt two; attempt three is where you learn whether your knowledge base has a real gap or the caller genuinely needs a human.
Automated returns succeed or fail on edge cases your policy didn't anticipate. Read 15-20 full transcripts every week for the first month. You'll find patterns — a product line driving 30% of returns with the same complaint, a policy rule the agent is interpreting too strictly, a phrase callers keep using that your knowledge base doesn't recognize. Feed the fixes back in weekly and containment climbs visibly.
The agent approves an exchange for a size-6 dress, then the warehouse ships a size 6 that's been out of stock for three weeks. The customer waits, calls back angry, and now you have two tickets instead of one. Fix: require real-time inventory lookup before the agent ever speaks the words "I can ship that out today." If stock is uncertain, the agent promises a replacement order and schedules an update, not a guaranteed ship date.
Teams get excited about edge cases — gift returns, international returns, B2B wholesale returns — and try to build them in week one. The result is a brittle flow that fails on the 80% of simple cases. Fix: launch with domestic, single-item, inside-window returns. Add exceptions one per week after that, measured against transcript data.
This question isn't support, it's status. The answer is always in your payments provider — Stripe, Braintree, Shopify Payments. Don't route it through the agent's policy logic or escalate it to a human. Wire a direct function call to the payments API, pull the refund state, and have the agent read it back. Brands that do this right deflect 90%+ of refund-status calls with zero human involvement.
Return volume spikes 3-5x above monthly average in January for most apparel and gift-heavy brands. If your agent is configured for normal volume, your transfer rate will explode and human agents will drown. Fix: raise your concurrent-call capacity before December 26, not after, and audit your fraud thresholds — refund abuse also spikes in the same window.
Damaged items disputed by the customer, international customs disputes, B2B partial returns, and any case involving a chargeback are still human work. Don't try to automate them. Tell the caller honestly: "This one needs someone from our team to look at it — I'm connecting you now with the full context of what we've discussed." Callers respect honest handoffs far more than a bot that pretends.
SWTCH deployed an AI voice agent named Lucas to handle high-volume EV driver support calls. The team saw support costs drop by more than 50% while answer times collapsed from minutes to seconds. CEO Carter Li noted the agent improved SaaS margins while keeping service quality intact. Read the SWTCH case study.
Anker rebuilt global customer support on AI voice agents to handle product questions, warranty claims, and returns across multiple languages. The result was human-quality conversations at enterprise scale, without the staffing cost of running round-the-clock multilingual coverage. The consumer electronics brand now runs the same agent framework across its international markets.
Medical Data Systems scaled inbound call handling to 100% AI coverage with only a 30% transfer rate to humans. The team collects approximately $280,000 per month through AI voice agents while maintaining the compliance standards their industry demands — a model ecommerce brands can adapt for returns work that touches payments and PII.
Most brands go from signup to a live agent handling basic return flows in 3-5 days. The longer work is encoding your policy rules and connecting your OMS, which typically adds another week. Budget two full weeks from signup to production if your policy is straightforward, four weeks if you have complex category exceptions or international flows.
Yes. The agent connects to Shopify's Admin API for order lookup, customer history, and refund initiation, and to your chosen shipping provider for label generation. No Shopify Plus requirement for the basic integration. If you're running a 3PL, you'll add a webhook to their receiving endpoint so returns show up in their dashboard.
Retell AI pricing starts at $0.07 per minute of call time with no platform fees, and every new account gets $10 in free credits — enough for roughly 140 minutes of agent testing. For a brand handling 1,000 return calls per month averaging 2 minutes each, that's about $140/month in agent costs, compared to $15-25/hour for a human agent handling the same volume.
The agent escalates to a human with full context. The human sees the order details, what the caller asked for, what policy check triggered, and any fraud signals flagged. Transfer takes under 2 seconds and the caller doesn't have to repeat anything. Most teams see transfer rates settle at 15-25% once the agent is tuned, with the remainder resolved end-to-end.
For routine cases inside policy, yes. For high-value, disputed, or flagged cases, you should keep humans in the loop regardless of how capable the AI gets. Configure your agent to auto-approve refunds under a dollar threshold you define (many brands start at $100), and escalate anything above it or anything with a fraud signal. This balance protects margin without creating backlog.
The agent can flag patterns that correlate with fraud — high return frequency, delivery disputes contradicted by carrier records, orders with wardrobing signals — and route them for human review. It cannot definitively identify fraud, and you shouldn't want it to. The goal is to catch obvious abuse and let humans make the final call on ambiguous cases.
"Where is my refund" is the single highest-volume return-related call in ecommerce. The agent looks up the refund state directly in your payments provider, tells the caller the current status and the expected completion date, and offers a callback if something looks stuck. This one flow alone typically deflects 25-35% of all returns-related call volume.
Most callers can tell within the first 10-15 seconds regardless of how you frame it. The question is whether they care. With ~600ms response latency and natural turn-taking, callers stay in the conversation because the experience is faster than waiting on hold for a human. Be upfront — "I'm an AI assistant" in the greeting actually improves trust versus trying to pass as human.
Yes. The platform supports 31+ languages with native pronunciation through ElevenLabs, and 50+ with OpenAI TTS. Set language detection on the first caller utterance and the agent switches automatically. This matters more for brands selling internationally — a single agent covers English, Spanish, French, and German customers without separate deployments.
Track five numbers weekly: containment rate (percentage of calls resolved without transfer), exchange conversion rate (percentage of refund intents turned into exchanges), average handle time, fraud-flag precision (how often flagged cases were actually fraud), and CSAT from post-call surveys. If containment is below 70% after three weeks, your knowledge base has gaps. If exchange conversion is below 20%, your recommendation logic needs work.
You now have a working returns and exchanges agent that handles eligibility checks, generates labels, pushes exchanges over refunds where possible, and escalates fraud signals and edge cases to humans with full context.
From here, extend the same agent framework to handle outbound refund-confirmation calls, post-exchange satisfaction follow-ups, or abandoned-return reminders for customers who started a return but never shipped the item. You can also deploy the same structure for order tracking, delivery dispute handling, or full AI customer support across every returns-adjacent workflow.
Start building free with $10 in usage credits at retellai.com.
See how much your business could save by switching to AI-powered voice agents.
Total Human Agent Cost
AI Agent Cost
Estimated Savings
A Demo Phone Number From Retell Clinic Office

Start building smarter conversations today.

