Building Production RAG Pipelines That Actually Work
Most RAG demos look impressive. Most RAG production systems disappoint. The gap between them comes down to retrieval quality, not model capability.
- AI Engineering
- LangChain
- RAG
- Azure OpenAI
Retrieval-Augmented Generation has a demo problem. The demos look transformative — ask a question, get a perfect answer sourced from your internal documents. Then you build it in production, and the answers are wrong, hallucinated, or missing context that was clearly in the source material.
The failure mode isn't the language model. The failure mode is the retrieval layer.
Why Standard RAG Fails in Production
The canonical RAG pattern — chunk documents, embed chunks, retrieve top-k by cosine similarity, pass to LLM — works well in demos because demos are curated. The questions match the documents. The documents are clean and well-structured. The answers are verifiable.
Production environments have none of these properties. Documents are inconsistent. Questions are ambiguous. The embedding space doesn't cluster the way you expect when you're working with enterprise content — meeting transcripts, compliance documents, engineering specs — rather than clean Wikipedia articles.
The specific failure modes I see most in production:
Semantic mismatch: A user asks "what's our SLA for P1 incidents?" and the relevant document says "Priority One tickets must be resolved within four hours." The embedding similarity between those two sentences is lower than you'd expect because of vocabulary drift between how questions are posed and how policy documents are written.
Chunk boundary fragmentation: A critical piece of context spans two chunks. Neither chunk alone answers the question. The retriever returns one of them, and the LLM correctly reports uncertainty or — worse — hallucinates a plausible answer.
Stale retrieval: The index was built three weeks ago. The document it retrieves has been superseded. The LLM answers confidently from outdated information.
The Architecture That Works
After building RAG systems across several enterprise deployments, the pattern that reliably performs in production looks meaningfully different from the canonical demo approach.
Hybrid Search
Don't rely on dense vector search alone. Combine it with BM25 sparse retrieval using a reciprocal rank fusion strategy. Dense retrieval is strong on semantic similarity; sparse retrieval is strong on exact term matching. For enterprise content where precise terminology matters (product codes, policy identifiers, regulatory references), hybrid search consistently outperforms either approach alone.
from langchain.retrievers import EnsembleRetriever
from langchain_community.retrievers import BM25Retriever
from langchain_community.vectorstores import AzureSearch
dense_retriever = AzureSearch(...).as_retriever(search_kwargs={"k": 10})
sparse_retriever = BM25Retriever.from_documents(docs, k=10)
ensemble = EnsembleRetriever(
retrievers=[dense_retriever, sparse_retriever],
weights=[0.6, 0.4],
)
Contextual Chunking
Chunk at semantic boundaries, not fixed token counts. Use a sentence splitter that respects paragraph breaks. Where possible, prefix each chunk with a summary of its parent document — this dramatically improves retrieval quality for documents where context is distributed across sections.
For tables and structured data embedded in documents, extract them separately and generate natural-language descriptions alongside the raw content. Embedding natural language descriptions of tabular data significantly outperforms embedding the raw table text.
Re-ranking
Retrieve more candidates than you'll pass to the LLM (top-20 or top-30), then apply a cross-encoder re-ranker to reorder by relevance before truncating to the final context window. Cross-encoders are computationally more expensive than embedding similarity, but they're dramatically more accurate at relevance ranking.
from langchain.retrievers.document_compressors import CrossEncoderReranker
from langchain_community.cross_encoders import HuggingFaceCrossEncoder
reranker = CrossEncoderReranker(
model=HuggingFaceCrossEncoder(model_name="cross-encoder/ms-marco-MiniLM-L-6-v2"),
top_n=5,
)
Freshness-Aware Indexing
Integrate your RAG index with your document management lifecycle. Implement incremental indexing triggered by document change events (via CDC or webhook) rather than scheduled full re-indexing. Track document versions in metadata and include version context in prompts where recency matters.
Evaluating What You've Built
The hardest part of production RAG is evaluation. You need to know whether your changes improved things before you can confidently iterate.
Build an evaluation harness from day one. Curate a ground-truth Q&A dataset from your actual users' questions. Measure retrieval recall (are the right chunks being retrieved?) separately from answer quality (is the LLM answering correctly from those chunks?). These are different problems with different solutions.
RAGAS is a useful framework for automated RAG evaluation. Combine it with human review for high-stakes use cases.
The gap between a RAG demo and a production RAG system is real — but it's entirely bridgeable with the right retrieval architecture. The model is the easy part.