Working Memory | HyperKit.ai

Definition

Working memory in LLM systems refers to the information actively held in the context window during a conversation or task. Like human working memory, it has limited capacity but provides immediate access to recent information.

This includes the conversation history, any retrieved documents, and intermediate reasoning steps—everything the model can "see" when generating its next response.

Key Concepts

Context window: The token limit of working memory
Recency bias: Recent information often gets more attention
Capacity management: Strategies for what to keep vs. discard
Session-scoped: Cleared when conversation ends

Examples

Concept

Working Memory in Practice

Context Window (Working Memory):
┌─────────────────────────────────────────┐
│ System: You are a helpful assistant     │
│                                         │
│ User: My name is Alice                  │
│ Assistant: Nice to meet you, Alice!     │
│                                         │
│ User: I'm planning a trip to Japan      │
│ Assistant: Japan is wonderful! When...  │
│                                         │
│ User: What's my name?                   │  ← Can recall
│ Assistant: Your name is Alice.          │     from working
│                                         │     memory
│ [More conversation...]                  │
│                                         │
│ ~~~~~~~ CONTEXT LIMIT ~~~~~~~           │
│ [Older messages get truncated]          │  ← Information
└─────────────────────────────────────────┘     lost

Management Strategy

Optimizing Working Memory

def manage_working_memory(messages, max_tokens=4000):
    """Keep working memory within limits while preserving key info"""

    # Always keep system message
    system = messages[0]

    # Summarize old conversations
    if total_tokens(messages) > max_tokens:
        old_messages = messages[1:-10]  # Keep recent 10
        summary = summarize(old_messages)

        return [
            system,
            {"role": "system", "content": f"Previous context: {summary}"},
            *messages[-10:]  # Recent messages
        ]

    return messages

# Example working memory state:
{
    "system_prompt": "You are a travel assistant",
    "summary": "User Alice is planning Japan trip in March",
    "recent_turns": [
        {"user": "What about hotels in Tokyo?"},
        {"assistant": "Here are top recommendations..."},
        {"user": "Which is closest to Shibuya?"}
    ]
}

Interactive Exercise

✎

Design Memory Management

You have a 4000 token limit. How would you manage this conversation?

Scenario: A 20-message coding help session where the user shared code snippets early on, and is now debugging.

Pro Tips

Prioritize: system prompt > recent turns > summaries > old details
Code snippets and data structures often need full preservation
Summarize chitchat, preserve technical details
Consider using external storage for overflow

Definition

Key Concepts

Examples

Interactive Exercise

Related Terms