Prompt Caching | HyperKit.ai

Definition

Prompt caching stores frequently used prompt segments for efficiency. When you have large static context (like documentation or system prompts) that's reused across many requests, caching allows the model to skip reprocessing this content.

This technique reduces latency and costs by avoiding redundant computation on identical prompt prefixes.

Key Concepts

Prefix caching: Caching the beginning portion of prompts
Cache hits: When the cached prefix matches a new request
TTL (Time to Live): How long cached content remains valid
Cost reduction: Cached tokens typically cost less to process

Examples

Without Caching

Repeated Processing

Request 1: [Large system prompt] + [User query A]
→ Process all tokens

Request 2: [Large system prompt] + [User query B]
→ Process all tokens again (redundant)

Request 3: [Large system prompt] + [User query C]
→ Process all tokens again (redundant)

The same system prompt is processed repeatedly.

With Caching

Cached Prefix

Request 1: [Large system prompt*] + [User query A]
→ Process all, cache system prompt (*)

Request 2: [Cache hit] + [User query B]
→ Only process new user query

Request 3: [Cache hit] + [User query C]
→ Only process new user query

Result: ~90% reduction in tokens processed

Cached prefix is reused, only new content is processed.

Interactive Exercise

✎

Identify Cacheable Content

Which parts of this prompt would benefit from caching?

"You are a legal assistant... [500 words of legal context]... Now analyze this contract: [user's contract]"

Pro Tips

Put static content at the beginning of prompts for cache hits
Larger cached prefixes = greater cost savings
Monitor cache hit rates to optimize prompt structure
Consider cache invalidation when system prompts change

Definition

Key Concepts

Examples

Interactive Exercise

Related Terms