Blog post
RAG Pipeline Design: Chunking, Retrieval, Reranking, and Generation
RAG is not a single call to a vector database. Production RAG pipelines have five stages, each with tradeoffs that determine whether the system retrieves the right context.
- Category
- ai
- Published
RAG in one sentence
Retrieval-augmented generation answers user questions by finding relevant documents from a corpus and including them in the model's context window, instead of relying on knowledge baked into model weights.
The five stages
1. Document ingestion and preprocessing
Before chunking, you need clean text. PDFs have headers, footers, page numbers, and multi-column layouts that confuse chunkers. HTML has navigation, ads, and boilerplate. Preprocessing steps:
- Extract text while preserving structural signals (headings, lists, tables).
- Remove boilerplate (navigation, copyright notices, duplicate headers).
- Normalize whitespace and encoding.
- Tag documents with metadata (source URL, date, document type, author) — you will filter on this at retrieval time.
2. Chunking strategy
The chunk is the unit of retrieval. Too large and you send irrelevant content to the model; too small and you lose context that makes the chunk meaningful.
Fixed-size chunking (split every N tokens) is simple and fast. Start here. Common starting points: 512 tokens with 64-token overlap. The overlap prevents answers from being split across chunk boundaries.
Semantic chunking splits on sentence or paragraph boundaries. Better for documents with clear structure. Use a sentence boundary detector, not just newlines.
Hierarchical chunking indexes chunks at multiple granularities (paragraph + section + document). Retrieve at the paragraph level but include the section as additional context. Useful for long documents where the answer is a paragraph but requires section-level context to interpret.
Document-aware chunking respects document structure: each FAQ entry is a chunk, each API endpoint description is a chunk. If your documents have known structure, exploit it.
3. Embedding and indexing
Each chunk is converted to a dense vector using an embedding model. The embedding model determines the quality of similarity matching — a general-purpose embedding may not capture domain-specific terminology well.
chunk → embedding_model → vector(dim=1536)
Common embedding models: text-embedding-3-small / text-embedding-3-large (OpenAI), embed-english-v3 (Cohere), bge-large-en (open source, strong for retrieval tasks).
Store vectors in a vector database: Pinecone, Weaviate, Qdrant, pgvector (PostgreSQL extension). Also store the original chunk text and metadata alongside the vector.
4. Retrieval
At query time, embed the user's question using the same embedding model and find the top-K nearest vectors (cosine or dot product similarity).
Hybrid retrieval combines dense (vector) and sparse (BM25/TF-IDF keyword) search. Dense retrieval is strong for semantic similarity; sparse retrieval is strong for exact term matching. Combining them with Reciprocal Rank Fusion or a learned combiner outperforms either alone on most benchmarks.
Metadata filtering narrows retrieval before the vector search: only search chunks from documents dated after 2024, only search documents tagged as product documentation, only search chunks from the user's tenant. This reduces noise and improves precision.
HyDE (Hypothetical Document Embeddings): generate a hypothetical answer to the question, embed it, and use that embedding for retrieval. Counterintuitive but effective — the hypothetical answer is often more similar to relevant documents than the raw question.
5. Reranking
Top-K retrieval returns the most similar chunks, not necessarily the most useful ones. A reranker reorders the results using a more expensive cross-encoder model that reads both the query and the chunk together, rather than comparing embeddings independently.
query + chunk → cross_encoder → relevance_score
Common rerankers: Cohere Rerank, bge-reranker-large, Jina Reranker. Retrieve top-20 with vector search, rerank to top-5, send top-5 to the LLM.
The cost is worth it: reranking consistently improves answer quality by reducing irrelevant context in the model's window.
Generation
Assemble the context from the top-K reranked chunks and construct the prompt:
System: You are a helpful assistant. Answer based only on the provided context.
If the answer is not in the context, say so.
Context:
[chunk 1]
[chunk 2]
[chunk 3]
User: {question}
The system prompt instruction to stay within context (grounding) reduces hallucination. Monitor faithfulness (see LLM Evaluation Frameworks) to measure how often the model drifts outside the provided context.
Production failure modes
Retrieval misses the answer. The answer is in the corpus but not retrieved. Causes: chunking split the relevant passage, wrong embedding model for the domain, query and document use different terminology (use HyDE or query expansion). Fix by measuring context recall.
Irrelevant context dilutes the answer. Retrieved chunks are topically related but not specifically useful. The model produces a vague answer because the signal is buried in noise. Fix with reranking and tighter metadata filtering.
Chunk without context. A chunk says "the value is set in step 3" but does not say what step 3 is, because that is in the previous chunk. Fix with overlap, hierarchical chunking, or by including the section heading in each chunk.
Stale index. Documents updated but embeddings not reindexed. Implement incremental indexing triggered by document changes, not just nightly full rebuilds.
Start simple, measure, then optimize
The order of optimization effort that pays off most:
- Get clean text into the corpus.
- Implement chunking with overlap.
- Add metadata filtering to narrow retrieval scope.
- Add a reranker.
- Try hybrid retrieval.
- Experiment with chunk size and embedding model.
Each step should be validated with an eval dataset. Intuition about what will improve quality is often wrong. Measure first, optimize second.