Chunking | HyperKit.ai

Definition

Chunking is the process of breaking large documents into smaller, manageable pieces for processing by LLMs. It's a critical step in RAG systems because it determines what unit of text gets embedded, retrieved, and provided as context.

Good chunking preserves meaning while keeping pieces small enough to be relevant and fit in context windows.

Key Concepts

Chunk size: Number of tokens/characters per chunk (typically 200-1000)
Overlap: Shared content between adjacent chunks to preserve context
Semantic boundaries: Splitting at natural breaks (paragraphs, sections)
Metadata: Information attached to chunks (source, page, section)

Examples

Chunking Strategies

Common Approaches

1. FIXED SIZE
   Split every N characters/tokens
   Pro: Simple, predictable
   Con: May split mid-sentence

2. SENTENCE-BASED
   Split at sentence boundaries
   Pro: Complete thoughts
   Con: Variable chunk sizes

3. PARAGRAPH-BASED
   Split at paragraph breaks
   Pro: Natural boundaries
   Con: Paragraphs vary wildly in size

4. SEMANTIC
   Split at topic/section changes
   Pro: Coherent chunks
   Con: Requires more processing

5. RECURSIVE
   Try large splits, then smaller if needed
   Pro: Respects structure
   Con: More complex

Choose your strategy based on document type and retrieval needs.

Overlap Explained

Why Chunks Need Overlap

WITHOUT OVERLAP:
Chunk 1: "The company was founded in 2020."
Chunk 2: "It grew rapidly to 500 employees."

Query: "When was the company founded and how big is it?"
→ Might only retrieve Chunk 1, missing growth info

WITH 20% OVERLAP:
Chunk 1: "The company was founded in 2020.
         It grew rapidly..."
Chunk 2: "...founded in 2020. It grew rapidly
         to 500 employees."

Query: Same query
→ Either chunk provides context about both facts

Overlap ensures context isn't lost at chunk boundaries.

Chunk Size Tradeoffs

Finding the Right Size

TOO SMALL (50-100 tokens):
✗ Missing context
✗ More chunks to search
✗ Fragmented information
✓ Precise retrieval

TOO LARGE (2000+ tokens):
✗ Diluted relevance
✗ Wastes context window
✗ May include irrelevant info
✓ Complete context

OPTIMAL (200-500 tokens):
✓ Enough context
✓ Focused content
✓ Good retrieval precision
✓ Efficient context usage

Rule of thumb: Match chunk size to the
typical answer size for your use case

The ideal chunk size depends on your specific application.

Interactive Exercise

✂

Design Chunking Strategy

You're building a RAG system for a technical documentation site with:

API reference docs (short, structured)
Tutorial articles (long-form, sequential)
FAQ pages (Q&A format)

How would you chunk each type? What size and overlap?

Pro Tips

Always include metadata (source, section, page number)
Test retrieval quality with different chunk sizes
Consider document type when choosing strategy
10-20% overlap is a good starting point

Definition

Key Concepts

Examples

Interactive Exercise

Related Terms