Agents & Tools · advanced
Semantic Chunking
Semantic chunking embeds each sentence and inserts a chunk boundary wherever consecutive embeddings diverge sharply — producing chunks that respect topic boundaries rather than character counts.
Explanation
Algorithm: embed every sentence, compute cosine distance between adjacent sentences, place chunk breaks where distance exceeds a threshold (e.g. the 95th percentile of all distances). Topical shifts produce big distance jumps; tightly related sentences stay together.
Trade-off: more compute up front (one embedding per sentence vs. zero for character splitting) but consistently better retrieval on heterogeneous corpora — long docs, mixed-topic FAQs, transcripts.
Often combined with a chunk-size cap so you don't end up with one 5000-token chunk on a section the algorithm didn't want to split.
Examples
- A meeting transcript: semantic chunker breaks on topic-change moments rather than arbitrary token windows.
- A wiki where some pages are 200 words and some are 50000 — semantic chunking handles both without tuning.
When to use semantic chunking
When documents have variable topic density and recursive chunking is producing low-quality retrievals.
Frequently asked
What is Semantic Chunking?
Semantic chunking embeds each sentence and inserts a chunk boundary wherever consecutive embeddings diverge sharply — producing chunks that respect topic boundaries rather than character counts.
What is an example of semantic chunking?
A meeting transcript: semantic chunker breaks on topic-change moments rather than arbitrary token windows.
How is Semantic Chunking related to Chunking?
Semantic Chunking and Chunking are both agents & tools concepts. Chunking is the process of splitting source documents into smaller passages before embedding them for retrieval. Chunk size and boundaries control how relevant retrievals will be.
When should I use semantic chunking?
When documents have variable topic density and recursive chunking is producing low-quality retrievals.
Is Semantic Chunking considered advanced?
Semantic Chunking is generally considered advanced-level material in the AI and LLM space.