Chunking
Splitting documents into smaller, retrievable pieces before embedding them so retrieval returns the right span instead of the wrong book.
In one line
Splitting documents into smaller, retrievable pieces before embedding so retrieval returns the right span instead of an entire wrong book.
What it actually means
You can’t just embed a 200-page PDF as one vector — meaning gets averaged into mush and you can’t fit the whole thing back into the prompt. Instead you split into chunks (say 200–800 tokens) with some overlap (10–20%), embed each, and store them with a pointer to the source. Strategies range from naive fixed-size, to sentence-aware, to recursive splitters that respect headings and paragraphs, to semantic chunking that groups by topic shifts. Smarter chunkers cost more but recover more answers per query.
Why it matters
Chunking quietly determines retrieval ceiling. Chunks that are too big dilute relevance; too small fragment the answer across multiple rows you’ll never both retrieve. Most “the model can’t find the answer” issues are really “the answer is split between chunk 14 and chunk 15 and we only retrieved 14”. Tuning chunk size, overlap, and respect for structure pays for itself fast.
Example
def chunk(text, size=600, overlap=100):
tokens = tokenize(text)
out = []
i = 0
while i < len(tokens):
out.append(detokenize(tokens[i : i + size]))
i += size - overlap
return out
You’ll hear it when
- Building any RAG pipeline.
- Investigating low retrieval recall.
- Indexing PDFs, code, or transcripts.
- Comparing fixed-size vs structural vs semantic chunkers.
- Tuning chunk overlap to balance recall against index size.