RAG vs fine-tuning: which one do you actually need for customer support?
A practical guide to picking between retrieval-augmented generation and fine-tuning when building an AI for customer support — and why the answer is usually RAG.
If you've spent any time researching how to build a customer support AI, you've run into the same two options: fine-tuning the model on your data, or using RAG (retrieval-augmented generation) to pull from your data at query time. The internet has confidently told you that each is the obvious right answer.
Here's the boring truth: for almost every customer support use case, RAG is the right answer. Fine-tuning is a more specialized tool. This post explains why, with the trade-offs laid out so you can decide for your own case.
What each one actually is
Fine-tuning takes a pre-trained model (like GPT-5 or Llama) and continues training it on your specific data. The model's weights change. After fine-tuning, the model has internalized your data in the same way it internalized everything else from its training.
RAG leaves the model alone. When a question comes in, you search your data, find the most relevant passages, and stuff them into the prompt alongside the question. The model uses that context to answer. The model itself never sees your data outside of any individual request.
Where the two diverge in practice
Updating information
Your refund policy changes. You launch a new product. A pricing tier gets renamed.
- RAG: Update the source document. Re-index. Done in minutes. The next query uses the new info.
- Fine-tuning: Run a new training job. Validate the new weights. Deploy. Hours to days, every time something changes.
For customer support, information changes constantly. RAG handles this without breaking a sweat.
Citing sources
When the agent says "Our refund window is 30 days," your customer (or your legal team) may want to know where that number came from.
- RAG: Trivial — the source passage is right there. The agent can quote it and link to the original page.
- Fine-tuning: The model "knows" the policy the way it knows the capital of France. There's no source to point to. You're trusting that what came out matches the training data, with no way to verify per-response.
For anything where correctness matters (policies, prices, availability), this is a big deal.
Hallucinations
- RAG: The model is grounded in retrieved text. With a good system prompt ("if the context doesn't contain the answer, say so"), hallucinations drop dramatically.
- Fine-tuning: The model now confidently knows your stuff — and confidently makes up adjacent things that it doesn't. Fine-tuning often makes hallucination harder to detect, not easier, because the made-up answers sound more domain-appropriate.
Cost
- RAG: Pay for embeddings (cheap, one-time per document) and a slightly larger prompt at query time. No training infrastructure.
- Fine-tuning: Pay for the training run (significant), then pay a premium per token for serving the fine-tuned model. And pay again every time you update.
For a typical support workload, RAG is an order of magnitude cheaper to run.
Latency
- RAG: Adds a retrieval step (~50–200ms with a decent vector store) plus a slightly longer prompt. Negligible for chat.
- Fine-tuning: Same latency as the base model. Marginally faster if your fine-tune lets you use a shorter system prompt.
Wash, basically.
When fine-tuning actually wins
Fine-tuning isn't pointless. It's the right tool when:
- Style is the product. You need the model to write in a very specific voice — a regulated industry's house style, a strict format, a company-specific tone — that's hard to enforce with prompting alone.
- You have lots of high-quality input/output examples. Thousands of real Q&A pairs from your support history, cleaned and validated.
- The knowledge is stable. You're teaching the model a way of reasoning or a domain vocabulary, not facts that will be stale next quarter.
- Latency or cost at scale matters more than flexibility. A fine-tuned smaller model can sometimes beat a bigger model + RAG on inference cost, if you have the volume to justify it.
For most customer support, none of these apply. Voice can be handled with prompting. Knowledge is constantly changing. Volume is rarely high enough that the cost math flips.
The hybrid case
You can do both. Fine-tune to lock in tone or format, then use RAG for the actual content. This is occasionally the right answer for large, mature teams.
For 95% of customer support builds, this is over-engineering. Start with RAG, get something live, iterate on the system prompt for voice. Most teams never need to add fine-tuning.
How this maps to Fabrile
Fabrile is RAG-first by design. Documents go into a vector store, retrieval happens on every query, and the model (GPT-5 or GPT-5-mini, depending on agent config) is prompted with the relevant context. You don't have to think about it — the embedding pipeline, the chunking strategy, and the retrieval logic are handled.
If you ever do need fine-tuning, the workflow is to do it externally and then point your agent at a custom model. But for the support use case this post is about: don't bother. Add your documents, write a tight system prompt, and ship.
TL;DR
- Default to RAG. It's faster to update, cheaper, more accurate, and citable.
- Reach for fine-tuning when you need a specific style or format that prompting can't pin down, and you have enough examples to teach it.
- Don't over-think it. Most teams that fine-tune for support end up rebuilding on RAG within a year.
Start simple. Ship. Iterate.