Module 3 · Phase 2: Knowledge & state · Weeks 6–8

RAG Done Properly

RAG shows up in nearly half of agent take-home assignments — but building a pipeline is table stakes. The senior differentiator is *measuring* it: ingest → chunk → embed → index → retrieve → rerank → generate with citations, with an evaluation harness running from day one.

After this module you can

▸Explain what embeddings are, why cosine similarity finds meaning, and where dense retrieval structurally fails
▸Choose and defend a chunking strategy (fixed-size vs. structural, size, overlap) with your own numbers
▸Stand up Qdrant locally and implement hybrid search: BM25 + dense vectors fused with RRF
▸Add a cross-encoder reranking stage and explain the bi-encoder/cross-encoder trade-off
▸Apply query rewriting, decomposition, and HyDE when the user's question is a bad search query
▸Build a labeled eval set and report precision@k, recall@k, MRR, faithfulness, and answer relevance

Lessons

Why RAG, and the Anatomy of the Pipeline

Models know nothing about your private data and their world knowledge is frozen at training time. Retrieval-augmented generation fixes both — but only as well as its weakest stage. Meet the pipeline and the geometry of embeddings.

Chunking: The Highest-Leverage Decision

Chunks are the unit of everything downstream — embedding, retrieval, citation, grounding. Cut them badly and no reranker, no fusion trick, no bigger model can repair the damage.

Vector DBs & Hybrid Search (BM25 + Dense + RRF)

In-memory numpy stops scaling fast; a vector DB gives you ANN search, filters, and persistence. But dense retrieval alone has famous blind spots — production systems fuse it with BM25 keyword search using reciprocal rank fusion.

Reranking & Query Rewriting (incl. HyDE)

First-stage retrieval is built for recall: get the answer somewhere in the top 50, cheap. A cross-encoder reranker then buys you precision in the top 5 — the biggest quality win per engineering hour. And when the user's question is a lousy search query, rewrite it before retrieving.

Grounded Generation & Evaluation from Day One

The senior differentiator: citations the reader can check, a labeled eval set, retrieval metrics (precision@k, recall@k, MRR), and generation metrics (faithfulness, answer relevance). If you can't produce a metrics table, you don't know if your RAG works.

12 questions · pass ≥ 80%

Lab: RAG Pipeline with Eval Harnessportfolio

Build a full RAG pipeline over a real corpus of 100+ documents — structural chunking, hybrid retrieval (BM25 + dense with RRF), cross-encoder reranking, grounded generation with citations — plus a labeled eval set and a committed metrics report comparing every retrieval configuration. This is a portfolio piece: a hiring manager should grasp the architecture and your results in three minutes.

Best external resources

Curated reading, docs, and tools that pair with this module.

RAGAS documentation

The eval framework for Lab 03 — faithfulness, relevance, context metrics.

Local vector DB for the lab; the hybrid-search tutorial is directly relevant.

Eugene Yan — writing

Best practitioner essays on RAG patterns and evaluation design.

SBERT — Cross-encoders

Reranking implementation you'll use in the lab.

Anthropic — Contextual Retrieval

Prepend LLM-generated context to chunks: −49% retrieval failures, −67% with reranking. Measured, replicable.

DeepLearning.AI short courses

Free 1–2 hr targeted fills: advanced retrieval, reranking, RAG evaluation.