Cross-Attention | HyperKit.ai

Definition

Cross-Attention is an attention mechanism where queries come from one sequence (e.g., decoder) while keys and values come from a different sequence (e.g., encoder). This enables models to attend to relevant parts of an input when generating output.

Unlike self-attention (where Q, K, V all come from the same sequence), cross-attention bridges two different representations, enabling tasks like translation, image captioning, and retrieval-augmented generation.

Key Concepts

Query source: Typically from decoder/output sequence
Key-Value source: Typically from encoder/input sequence
Alignment: Learning which input parts matter for each output
Conditioning: How output generation depends on input

Examples

Comparison

Self-Attention vs Cross-Attention

SELF-ATTENTION (within one sequence):

Input: "The cat sat"
        ↓
┌─────────────────────────────────────────┐
│ Q, K, V all from same sequence          │
│                                         │
│ "The" attends to: [The, cat, sat]       │
│ "cat" attends to: [The, cat, sat]       │
│ "sat" attends to: [The, cat, sat]       │
└─────────────────────────────────────────┘

Each token attends to ALL tokens in same sequence.

---

CROSS-ATTENTION (between two sequences):

Source (Encoder): "The cat sat"
Target (Decoder): "Le chat"
                    ↓
┌─────────────────────────────────────────┐
│ Q from target, K/V from source          │
│                                         │
│ "Le" attends to: [The, cat, sat]        │
│    → high attention on "The"            │
│                                         │
│ "chat" attends to: [The, cat, sat]      │
│    → high attention on "cat"            │
└─────────────────────────────────────────┘

Target tokens attend to source to find relevant info.

---

MATHEMATICAL COMPARISON:

Self-Attention:
Q = W_q × X    (X is input sequence)
K = W_k × X    (same X)
V = W_v × X    (same X)

Attention = softmax(QK^T / √d) × V

Cross-Attention:
Q = W_q × Y    (Y is target/decoder sequence)
K = W_k × X    (X is source/encoder sequence)
V = W_v × X    (same X as K)

Attention = softmax(QK^T / √d) × V

Key difference: Q and K/V come from different sources!

VISUALIZATION:

Self-Attention matrix (5×5 for 5 tokens):
┌─────────────────────────┐
│  T1   T2   T3   T4   T5 │ ← Same sequence
├─────────────────────────┤
│ 0.3  0.2  0.2  0.1  0.2 │ T1
│ 0.1  0.4  0.2  0.2  0.1 │ T2
│ 0.2  0.3  0.3  0.1  0.1 │ T3  Same
│ 0.1  0.2  0.2  0.4  0.1 │ T4  sequence
│ 0.2  0.1  0.2  0.2  0.3 │ T5
└─────────────────────────┘

Cross-Attention matrix (3×5 decoder×encoder):
┌─────────────────────────┐
│  E1   E2   E3   E4   E5 │ ← Encoder (source)
├─────────────────────────┤
│ 0.7  0.1  0.1  0.05 0.05│ D1 ─┐
│ 0.1  0.6  0.2  0.05 0.05│ D2  │ Decoder
│ 0.05 0.1  0.7  0.1  0.05│ D3 ─┘ (target)
└─────────────────────────┘

Applications

Cross-Attention Use Cases

CROSS-ATTENTION APPLICATIONS:

1. MACHINE TRANSLATION (Original use):

Encoder: "I love programming"
         ↓ encode
         [hidden states]

Decoder generating "J'adore":
- "J'" queries encoder, attends to "I"
- "adore" queries encoder, attends to "love"
- "programmer" queries, attends to "programming"

2. IMAGE CAPTIONING (Vision-Language):

Image → CNN/ViT → [patch embeddings]
                      ↑ K, V
Caption decoder generates: "A cat sitting..."
   ↓ Q queries image patches
   "cat" → high attention on cat region
   "sitting" → attention on pose

3. RETRIEVAL-AUGMENTED GENERATION (RAG):

Retrieved docs → encode → [doc embeddings]
                             ↑ K, V
LLM decoder generates answer:
   ↓ Q queries retrieved context
   Attends to relevant passages

4. TEXT-TO-IMAGE (Diffusion models):

Text: "A sunset over mountains"
       ↓ encode
       [text embeddings] → K, V

Image being denoised:
[noisy image patches] → Q
       ↓
Cross-attention tells model WHERE to put:
- "sunset" → sky region attention
- "mountains" → bottom region attention

---

ARCHITECTURE PATTERNS:

Encoder-Decoder (T5, BART):
┌──────────────┐     ┌──────────────┐
│   Encoder    │     │   Decoder    │
│ ┌──────────┐ │     │ ┌──────────┐ │
│ │Self-Attn │ │     │ │Self-Attn │ │
│ └──────────┘ │     │ └────┬─────┘ │
│ ┌──────────┐ │     │ ┌────┴─────┐ │
│ │   FFN    │ │────→│ │Cross-Attn│ │
│ └──────────┘ │     │ └──────────┘ │
└──────────────┘     │ ┌──────────┐ │
                     │ │   FFN    │ │
                     │ └──────────┘ │
                     └──────────────┘

Decoder-Only with Context (GPT-style + RAG):
┌────────────────────────────────────┐
│ [Context tokens] [Query tokens]    │
│         ↑              ↑           │
│       K,V from      Q from         │
│       context       query          │
│                                    │
│ Implemented as masked self-attention│
│ where query can attend to context  │
└────────────────────────────────────┘

Interactive Exercise

✎

Identify Attention Type

For each scenario, identify whether self-attention or cross-attention is being used:

1. GPT generating the next word based on previous words
2. A translation model using source sentence to generate target
3. BERT encoding a sentence for classification
4. DALL-E using text prompt to guide image generation
5. A summarizer attending to the original document

Pro Tips

Cross-attention is key for conditioning generation on external info
In decoder-only models, context + query is handled via masked self-attention
Cross-attention creates a "bridge" between encoder and decoder
Attention weights in cross-attention show alignment/relevance

Definition

Key Concepts

Examples

Interactive Exercise

Related Terms