
RAG in Production: What the Tutorials Don't Tell You
Retrieval-Augmented Generation has become one of the most important patterns in applied AI. The idea is straightforward: instead of relying solely on a language model's training data, you retrieve relevant documents from your own knowledge base and include them in the context.
Every tutorial makes it look easy. Chunk your documents, embed them, store them in a vector database, retrieve the top-k results, and pass them to the LLM. A working demo takes a weekend.
A working production system takes months.
I have been building RAG systems since 2021, and the gap between prototype and production is where most of the real engineering happens. This post covers the problems that tutorials skip.
The chunking problem is harder than it looks
Every RAG tutorial starts with chunking: splitting your documents into smaller pieces that can be embedded and retrieved individually.
The standard approach is fixed-size chunks with some overlap. Simple, reasonable, and often inadequate.
The problem is that meaningful information does not respect arbitrary character boundaries. A fixed-size chunk might split a table in half, separate a conclusion from its supporting argument, or combine the end of one section with the beginning of an unrelated one.
In production, I have found that chunking strategy has a larger impact on retrieval quality than the choice of embedding model. Here are the approaches that work:
Semantic chunking
Instead of splitting at fixed intervals, split at semantic boundaries: section headings, paragraph breaks, topic shifts. This requires understanding the structure of your documents, which varies by format.
For structured documents (technical docs, legal contracts, financial reports), I use format-aware parsers that respect the document hierarchy. A section with subsections becomes multiple chunks that preserve the parent context.
Hierarchical chunking
Store chunks at multiple levels of granularity: paragraphs, sections, and full documents. At retrieval time, you can fetch fine-grained chunks for specific answers and coarse-grained chunks for broader context.
This is more complex to implement but dramatically improves quality for knowledge bases with diverse content.
Metadata-enriched chunks
Every chunk should carry metadata: source document, section title, creation date, author, document type. This metadata enables filtered retrieval (only search within this document category) and helps the LLM understand the context of the retrieved information.
Evaluation is the hardest unsolved problem
The single biggest gap in most RAG implementations is evaluation.
How do you know if your system is producing good answers? How do you measure whether a change to your chunking strategy, embedding model, or retrieval parameters improved or degraded quality?
In a traditional search system, you can measure precision and recall against a labeled dataset. In a RAG system, you need to evaluate two things:
- Retrieval quality — Did the system retrieve the right documents?
- Generation quality — Did the LLM produce a correct, grounded answer from those documents?
Both are difficult to measure at scale.
Building an evaluation dataset
The first step is to build a golden dataset: a set of questions with known correct answers and the source documents that contain those answers.
This is labor-intensive but essential. I recommend starting with 100-200 question-answer pairs that cover the key use cases of your system. These should be reviewed by domain experts, not generated by an LLM.
Retrieval metrics
For retrieval, I measure:
- Recall@k — Of the relevant documents, how many appeared in the top-k results?
- Mean Reciprocal Rank (MRR) — How high in the result list is the first relevant document?
- Context relevance — Evaluated by an LLM judge: is the retrieved context sufficient to answer the question?
Generation metrics
For generation, I measure:
- Faithfulness — Does the answer contain only information that is supported by the retrieved context? This catches hallucination.
- Answer relevance — Does the answer address the question that was asked?
- Correctness — Compared against the golden answer, is the generated answer factually correct?
I use a combination of automated metrics and LLM-as-judge evaluation, with periodic human review to calibrate the automated assessments.
Continuous evaluation
Evaluation is not a one-time activity. Every change to the system — new documents, updated embeddings, modified prompts, different retrieval parameters — can affect quality. I run evaluation suites as part of the CI/CD pipeline for RAG systems, treating quality regressions the same as test failures.
Embedding drift is real
Embedding models are not static. When you update your embedding model (to a newer version, a fine-tuned variant, or a different provider), the new embeddings are not compatible with the old ones.
This means you need to re-embed your entire corpus whenever you change embedding models. For small datasets this is trivial. For production datasets with millions of documents, this is a significant engineering challenge.
I design RAG systems with embedding versioning from the start:
- every vector in the database includes the embedding model version
- re-embedding can be done incrementally without downtime
- queries use the same embedding model version as the stored vectors
- the system can support multiple embedding versions during a migration
Retrieval is more important than model choice
Teams spend weeks evaluating LLMs and minutes configuring retrieval. This is backwards.
In my experience, the quality of a RAG system is determined primarily by retrieval quality. If the right context is not in the prompt, even the best LLM will produce a wrong or hallucinated answer. If the right context is present, even a smaller model will produce a useful response.
The factors that most impact retrieval quality:
- Chunking strategy — The most impactful and least discussed factor
- Query reformulation — Rewriting the user's query to improve retrieval (e.g., HyDE, query expansion)
- Hybrid search — Combining vector similarity with keyword search (BM25) to catch both semantic and lexical matches
- Reranking — Using a cross-encoder to rerank the initial retrieval results before passing them to the LLM
- Metadata filtering — Restricting search to relevant document categories, date ranges, or sources
I typically spend 70% of optimization time on retrieval and 30% on prompt engineering and generation.
Operational realities
Running a RAG system in production involves operational concerns that never appear in tutorials.
Latency budget
A RAG query involves: embedding the query, searching the vector database, optionally reranking, constructing the prompt, and generating the response. Each step adds latency.
In production, I set a latency budget for each component and monitor it. Degradation in any component affects the user experience.
Cost management
LLM API costs scale with context length. More retrieved chunks mean longer prompts mean higher costs. There is a direct tradeoff between context quality and cost.
I use techniques like context compression (summarizing retrieved chunks before including them) and dynamic k-selection (retrieving more or fewer chunks based on the query complexity) to manage costs.
Document freshness
Knowledge bases change. New documents are added, old documents are updated or deprecated. The RAG system must handle ingestion pipelines that keep the vector store synchronized with the source of truth.
I design ingestion as an event-driven pipeline: when a source document changes, it is re-chunked, re-embedded, and the old vectors are replaced. This is more complex than batch re-indexing but ensures the system always reflects the current state of the knowledge base.
Hallucination mitigation
Even with good retrieval, LLMs can hallucinate. In production, I implement several mitigation strategies:
- Citation enforcement — The prompt instructs the LLM to cite specific chunks, and the system validates that cited chunks were actually retrieved
- Confidence scoring — When the retrieved context does not clearly answer the question, the system indicates low confidence rather than guessing
- Answer grounding checks — A post-generation step that verifies the answer is supported by the retrieved context
The infrastructure gap
Most tutorials use a single vector database and a single LLM API. Production systems need:
- vector database with high availability and horizontal scaling
- embedding service that can handle batch and real-time workloads
- ingestion pipeline with error handling and idempotency
- evaluation infrastructure for continuous quality monitoring
- monitoring and alerting for retrieval quality, latency, and cost
- versioning for embeddings, prompts, and retrieval configurations
This is real distributed systems engineering, and it requires the same discipline that any production system demands.
Closing thought
RAG is a powerful pattern, but the distance from demo to production is substantial. The tutorials teach you the happy path. Production teaches you everything else: chunking edge cases, evaluation gaps, embedding drift, latency budgets, cost management, and operational reliability.
If you are building a RAG system that needs to work in production, I can help with architecture, evaluation strategy, retrieval optimization, and the operational infrastructure that makes the difference between a prototype and a system your organization can rely on.