Memory and Context / Memory Types

Working Memory

Intermediate [3/5]
Short-term memory Session memory Active context

Definition

Working memory in LLM systems refers to the information actively held in the context window during a conversation or task. Like human working memory, it has limited capacity but provides immediate access to recent information.

This includes the conversation history, any retrieved documents, and intermediate reasoning steps—everything the model can "see" when generating its next response.

Key Concepts

  • Context window: The token limit of working memory
  • Recency bias: Recent information often gets more attention
  • Capacity management: Strategies for what to keep vs. discard
  • Session-scoped: Cleared when conversation ends

Examples

Concept
Working Memory in Practice
Context Window (Working Memory): ┌─────────────────────────────────────────┐ │ System: You are a helpful assistant │ │ │ │ User: My name is Alice │ │ Assistant: Nice to meet you, Alice! │ │ │ │ User: I'm planning a trip to Japan │ │ Assistant: Japan is wonderful! When... │ │ │ │ User: What's my name? │ ← Can recall │ Assistant: Your name is Alice. │ from working │ │ memory │ [More conversation...] │ │ │ │ ~~~~~~~ CONTEXT LIMIT ~~~~~~~ │ │ [Older messages get truncated] │ ← Information └─────────────────────────────────────────┘ lost
Management Strategy
Optimizing Working Memory
def manage_working_memory(messages, max_tokens=4000): """Keep working memory within limits while preserving key info""" # Always keep system message system = messages[0] # Summarize old conversations if total_tokens(messages) > max_tokens: old_messages = messages[1:-10] # Keep recent 10 summary = summarize(old_messages) return [ system, {"role": "system", "content": f"Previous context: {summary}"}, *messages[-10:] # Recent messages ] return messages # Example working memory state: { "system_prompt": "You are a travel assistant", "summary": "User Alice is planning Japan trip in March", "recent_turns": [ {"user": "What about hotels in Tokyo?"}, {"assistant": "Here are top recommendations..."}, {"user": "Which is closest to Shibuya?"} ] }

Interactive Exercise

Design Memory Management

You have a 4000 token limit. How would you manage this conversation?

Scenario: A 20-message coding help session where the user shared code snippets early on, and is now debugging.

Pro Tips
  • Prioritize: system prompt > recent turns > summaries > old details
  • Code snippets and data structures often need full preservation
  • Summarize chitchat, preserve technical details
  • Consider using external storage for overflow

Related Terms