Module 3: RAG Done Properly · Lesson 2 of 5 · 25 min
Chunking: The Highest-Leverage Decision
Chunks are the unit of everything downstream — embedding, retrieval, citation, grounding. Cut them badly and no reranker, no fusion trick, no bigger model can repair the damage.
A chunk is what gets embedded, what gets retrieved, and what the model reads as evidence. That triple duty creates tension: small chunks embed crisply (one idea per vector) but may lack the surrounding context needed to actually answer; large chunks carry context but their embeddings blur into topic soup and they burn prompt budget. And if a fact straddles a chunk boundary — question in one chunk, answer in the next — neither chunk retrieves well. This is why chunking is the highest-leverage decision in the pipeline: errors here are unrecoverable downstream.
Fixed-size vs. structural
| Strategy | How | Strengths | Weaknesses |
|---|---|---|---|
| Fixed-size | Every N tokens/words, hard cut | Trivial, uniform, predictable budget | Cuts mid-sentence/mid-thought; ignores document structure |
| Fixed-size + overlap | Windows share 10–20% of content | Facts near boundaries appear intact in at least one chunk | Index bloat; near-duplicate retrievals |
| Structural | Split on headings/paragraphs, then size-cap | Chunks align with authors' units of meaning; heading path makes great citation metadata | Needs format-aware parsing; sections vary wildly in size |
| Semantic | Split where embedding similarity between consecutive sentences drops | Adapts to unstructured prose | Slower, fussier, rarely beats structural on well-formatted docs |
def fixed_size_chunks(text: str, size: int = 350, overlap: int = 50) -> list[str]:
"""Split into word windows of ~size words, consecutive windows sharing
'overlap' words so boundary-straddling facts survive in one piece."""
words = text.split()
chunks, start = [], 0
while start < len(words):
end = min(start + size, len(words))
chunks.append(" ".join(words[start:end]))
if end == len(words):
break
start = end - overlap # step back to create the overlap
return chunksWord-based sizing is a fine proxy (a word is roughly 1.3 tokens in English); swap in a real tokenizer when you need exact budgets. The overlap is the load-bearing part: without it, any fact within a few sentences of a cut is fragmented across two chunks and retrieves poorly from both.
import re
def structural_chunks(md: str, doc_id: str,
max_words: int = 350, overlap: int = 50) -> list[dict]:
"""Split on headings first; size-cap oversized sections with the
fixed-size chunker. Keep the heading path for citations, and prepend
it to the text we embed so section context reaches the vector."""
parts = re.split(r"(?m)^(#{1,4}\s.*)$", md)
chunks, heading = [], "(intro)"
for part in parts:
part = part.strip()
if not part:
continue
if re.match(r"^#{1,4}\s", part):
heading = part.lstrip("#").strip()
continue
for i, piece in enumerate(fixed_size_chunks(part, max_words, overlap)):
chunks.append({
"doc_id": doc_id,
"heading": heading,
"position": i,
"text": piece,
"embed_text": f"{heading}\n{piece}", # heading rides into the embedding
})
return chunksTwo tricks worth stealing: (1) the metadata (
doc_id, heading, position) is what makes citations possible later — store it now or regret it; (2) embedding heading + text instead of bare text injects section context into the vector, so a chunk that just says "set this flag to true" still retrieves for queries about the feature its heading names.- Chunk size: start around 250–500 words for technical docs. Smaller for FAQ-like corpora (one Q&A per chunk), larger for narrative prose.
- Overlap: 10–20% of chunk size. More than that mostly buys you duplicate retrievals.
- Never mix units across the corpus without recording which chunker produced each chunk — you can't A/B what you can't attribute.
- Chunking is an eval-set question, not a taste question: re-run retrieval metrics (Lesson 5) for each candidate strategy and let precision@5 decide.
⚠ Bad chunking cannot be fixed downstream
A reranker can only reorder the chunks that exist. Fusion can only merge rankings of the chunks that exist. If the answer was sliced in half at index time, every downstream stage is optimizing over damaged goods. When RAG quality disappoints, look at the actual chunks first — read twenty of them raw before touching any other dial.
Key takeaways
- ▸Chunks serve three masters at once: embedding quality, retrieval unit, and grounding evidence.
- ▸Small chunks embed crisply but lose context; large chunks blur; overlap heals boundary cuts.
- ▸Structural chunking (headings/paragraphs + size cap) beats naive fixed-size on formatted docs.
- ▸Store
doc_id, heading path, and position with every chunk — citations depend on it. - ▸Chunking mistakes are unrecoverable downstream; choose by measured retrieval metrics, not vibes.