Context Management | HyperKit.ai

Definition

Context Management is the practice of strategically organizing, prioritizing, and maintaining the information provided to LLMs within their context window. Effective context management maximizes model performance while working within token limits.

This includes decisions about what to include, where to place it, when to summarize or remove information, and how to structure multi-turn conversations.

Key Concepts

Context window: Maximum tokens the model can process at once
Recency bias: Models attend more to recent content
Primacy effect: First content also gets attention priority
Lost in middle: Middle content often gets less attention

Examples

Strategy

Context Window Management Strategies

CONTEXT WINDOW ANATOMY:

┌─────────────────────────────────────────────────┐
│ SYSTEM PROMPT (High attention - first position) │
│ - Role definition                               │
│ - Core instructions                             │
│ - Output format requirements                    │
├─────────────────────────────────────────────────┤
│ REFERENCE MATERIAL (Medium attention)           │
│ - Retrieved documents (RAG)                     │
│ - Examples (few-shot)                           │
│ - Background information                        │
├─────────────────────────────────────────────────┤
│ ⚠️ "LOST IN THE MIDDLE" ZONE                    │
│ - Older conversation history                    │
│ - Less critical context                         │
│ - May receive less attention                    │
├─────────────────────────────────────────────────┤
│ RECENT CONTEXT (High attention - recency)       │
│ - Recent conversation turns                     │
│ - Current task specifics                        │
├─────────────────────────────────────────────────┤
│ CURRENT QUERY (Highest attention)               │
│ - User's current request                        │
└─────────────────────────────────────────────────┘

ATTENTION PRIORITY:
Start ████████████ High (primacy)
Middle █████ Lower (lost in middle)
End ████████████████ Highest (recency)

PLACEMENT STRATEGIES:

1. CRITICAL INFO: Place at start OR end, not middle
2. EXAMPLES: Near the query for format consistency
3. LONG DOCS: Summarize middle, keep key parts at ends
4. CONVERSATION: Summarize old turns, keep recent full

Implementation

Multi-Turn Conversation Management

CONVERSATION CONTEXT STRATEGIES:

1. SLIDING WINDOW:
   Keep last N turns, drop oldest

   Turn 1: [Dropped]
   Turn 2: [Dropped]
   Turn 3: [Dropped]
   Turn 4: [Kept] User: "Let's discuss Python"
   Turn 5: [Kept] Assistant: "Sure, what aspect?"
   Turn 6: [Kept] User: "How do decorators work?"
   Turn 7: [Current response]

   Pros: Simple, preserves recent context
   Cons: Loses important early context

2. SUMMARIZATION BUFFER:
   Summarize old turns, keep recent full

   [Summary of turns 1-10]:
   "User asked about Python decorators.
   Discussed @property, @staticmethod.
   User understood basic syntax."

   Turn 11: [Full] User: "What about @classmethod?"
   Turn 12: [Full] Assistant: [response]
   Turn 13: [Current]

   Pros: Preserves key info from history
   Cons: May lose nuance in summary

3. HIERARCHICAL CONTEXT:

   ┌─────────────────────────────────────────┐
   │ Level 1: Session summary (always kept)  │
   │ "Discussing Python OOP concepts"        │
   ├─────────────────────────────────────────┤
   │ Level 2: Topic summaries (compressed)   │
   │ "Covered: classes, inheritance, magic   │
   │  methods"                               │
   ├─────────────────────────────────────────┤
   │ Level 3: Recent turns (full detail)     │
   │ Last 5 turns with complete content      │
   └─────────────────────────────────────────┘

4. SELECTIVE RETENTION:
   Keep turns based on relevance to current query

   def select_context(history, current_query):
       relevant = []
       for turn in history:
           if is_relevant(turn, current_query):
               relevant.append(turn)
       return summarize_old(relevant[:-5]) + relevant[-5:]

TOKEN BUDGET ALLOCATION:
┌────────────────────┬──────────┬───────────┐
│ Component          │ Priority │ % Budget  │
├────────────────────┼──────────┼───────────┤
│ System prompt      │ High     │ 5-10%     │
│ Current query      │ Critical │ 5-10%     │
│ Recent turns       │ High     │ 20-30%    │
│ Retrieved docs     │ Medium   │ 30-40%    │
│ History summary    │ Low      │ 10-20%    │
│ Response reserve   │ Critical │ 15-25%    │
└────────────────────┴──────────┴───────────┘

Interactive Exercise

✎

Design Context Strategy

You're building a customer support chatbot with a 16K token context window. Conversations can last 50+ turns. How would you manage context to maintain quality throughout long conversations?

Pro Tips

Always reserve tokens for the model's response (at least 15%)
Put critical instructions at start AND repeat at end
Use structured formats (JSON, bullets) for denser information
Monitor for context degradation in long conversations

Definition

Key Concepts

Examples

Interactive Exercise

Related Terms