Retrieval & Augmentation Systems / Data Processing

Chunking

Intermediate [3/5]
Text splitting Document segmentation Content partitioning

Definition

Chunking is the process of breaking large documents into smaller, manageable pieces for processing by LLMs. It's a critical step in RAG systems because it determines what unit of text gets embedded, retrieved, and provided as context.

Good chunking preserves meaning while keeping pieces small enough to be relevant and fit in context windows.

Key Concepts

  • Chunk size: Number of tokens/characters per chunk (typically 200-1000)
  • Overlap: Shared content between adjacent chunks to preserve context
  • Semantic boundaries: Splitting at natural breaks (paragraphs, sections)
  • Metadata: Information attached to chunks (source, page, section)

Examples

Chunking Strategies
Common Approaches
1. FIXED SIZE Split every N characters/tokens Pro: Simple, predictable Con: May split mid-sentence 2. SENTENCE-BASED Split at sentence boundaries Pro: Complete thoughts Con: Variable chunk sizes 3. PARAGRAPH-BASED Split at paragraph breaks Pro: Natural boundaries Con: Paragraphs vary wildly in size 4. SEMANTIC Split at topic/section changes Pro: Coherent chunks Con: Requires more processing 5. RECURSIVE Try large splits, then smaller if needed Pro: Respects structure Con: More complex
Choose your strategy based on document type and retrieval needs.
Overlap Explained
Why Chunks Need Overlap
WITHOUT OVERLAP: Chunk 1: "The company was founded in 2020." Chunk 2: "It grew rapidly to 500 employees." Query: "When was the company founded and how big is it?" → Might only retrieve Chunk 1, missing growth info WITH 20% OVERLAP: Chunk 1: "The company was founded in 2020. It grew rapidly..." Chunk 2: "...founded in 2020. It grew rapidly to 500 employees." Query: Same query → Either chunk provides context about both facts
Overlap ensures context isn't lost at chunk boundaries.
Chunk Size Tradeoffs
Finding the Right Size
TOO SMALL (50-100 tokens): ✗ Missing context ✗ More chunks to search ✗ Fragmented information ✓ Precise retrieval TOO LARGE (2000+ tokens): ✗ Diluted relevance ✗ Wastes context window ✗ May include irrelevant info ✓ Complete context OPTIMAL (200-500 tokens): ✓ Enough context ✓ Focused content ✓ Good retrieval precision ✓ Efficient context usage Rule of thumb: Match chunk size to the typical answer size for your use case
The ideal chunk size depends on your specific application.

Interactive Exercise

Design Chunking Strategy

You're building a RAG system for a technical documentation site with:

  • API reference docs (short, structured)
  • Tutorial articles (long-form, sequential)
  • FAQ pages (Q&A format)

How would you chunk each type? What size and overlap?

Pro Tips
  • Always include metadata (source, section, page number)
  • Test retrieval quality with different chunk sizes
  • Consider document type when choosing strategy
  • 10-20% overlap is a good starting point

Related Terms