Advanced Prompt Engineering / Advanced Prompt Structures

Prompt Caching

Advanced [4/5]
Segment caching Context reuse

Definition

Prompt caching stores frequently used prompt segments for efficiency. When you have large static context (like documentation or system prompts) that's reused across many requests, caching allows the model to skip reprocessing this content.

This technique reduces latency and costs by avoiding redundant computation on identical prompt prefixes.

Key Concepts

  • Prefix caching: Caching the beginning portion of prompts
  • Cache hits: When the cached prefix matches a new request
  • TTL (Time to Live): How long cached content remains valid
  • Cost reduction: Cached tokens typically cost less to process

Examples

Without Caching
Repeated Processing
Request 1: [Large system prompt] + [User query A] → Process all tokens Request 2: [Large system prompt] + [User query B] → Process all tokens again (redundant) Request 3: [Large system prompt] + [User query C] → Process all tokens again (redundant)
The same system prompt is processed repeatedly.
With Caching
Cached Prefix
Request 1: [Large system prompt*] + [User query A] → Process all, cache system prompt (*) Request 2: [Cache hit] + [User query B] → Only process new user query Request 3: [Cache hit] + [User query C] → Only process new user query Result: ~90% reduction in tokens processed
Cached prefix is reused, only new content is processed.

Interactive Exercise

Identify Cacheable Content

Which parts of this prompt would benefit from caching?

"You are a legal assistant... [500 words of legal context]... Now analyze this contract: [user's contract]"

Pro Tips
  • Put static content at the beginning of prompts for cache hits
  • Larger cached prefixes = greater cost savings
  • Monitor cache hit rates to optimize prompt structure
  • Consider cache invalidation when system prompts change

Related Terms