Model Architecture / Attention Mechanisms

Cross-Attention

Advanced [4/5]
Encoder-decoder attention Source-target attention Context attention

Definition

Cross-Attention is an attention mechanism where queries come from one sequence (e.g., decoder) while keys and values come from a different sequence (e.g., encoder). This enables models to attend to relevant parts of an input when generating output.

Unlike self-attention (where Q, K, V all come from the same sequence), cross-attention bridges two different representations, enabling tasks like translation, image captioning, and retrieval-augmented generation.

Key Concepts

  • Query source: Typically from decoder/output sequence
  • Key-Value source: Typically from encoder/input sequence
  • Alignment: Learning which input parts matter for each output
  • Conditioning: How output generation depends on input

Examples

Comparison
Self-Attention vs Cross-Attention
SELF-ATTENTION (within one sequence): Input: "The cat sat" ↓ ┌─────────────────────────────────────────┐ │ Q, K, V all from same sequence │ │ │ │ "The" attends to: [The, cat, sat] │ │ "cat" attends to: [The, cat, sat] │ │ "sat" attends to: [The, cat, sat] │ └─────────────────────────────────────────┘ Each token attends to ALL tokens in same sequence. --- CROSS-ATTENTION (between two sequences): Source (Encoder): "The cat sat" Target (Decoder): "Le chat" ↓ ┌─────────────────────────────────────────┐ │ Q from target, K/V from source │ │ │ │ "Le" attends to: [The, cat, sat] │ │ → high attention on "The" │ │ │ │ "chat" attends to: [The, cat, sat] │ │ → high attention on "cat" │ └─────────────────────────────────────────┘ Target tokens attend to source to find relevant info. --- MATHEMATICAL COMPARISON: Self-Attention: Q = W_q × X (X is input sequence) K = W_k × X (same X) V = W_v × X (same X) Attention = softmax(QK^T / √d) × V Cross-Attention: Q = W_q × Y (Y is target/decoder sequence) K = W_k × X (X is source/encoder sequence) V = W_v × X (same X as K) Attention = softmax(QK^T / √d) × V Key difference: Q and K/V come from different sources! VISUALIZATION: Self-Attention matrix (5×5 for 5 tokens): ┌─────────────────────────┐ │ T1 T2 T3 T4 T5 │ ← Same sequence ├─────────────────────────┤ │ 0.3 0.2 0.2 0.1 0.2 │ T1 │ 0.1 0.4 0.2 0.2 0.1 │ T2 │ 0.2 0.3 0.3 0.1 0.1 │ T3 Same │ 0.1 0.2 0.2 0.4 0.1 │ T4 sequence │ 0.2 0.1 0.2 0.2 0.3 │ T5 └─────────────────────────┘ Cross-Attention matrix (3×5 decoder×encoder): ┌─────────────────────────┐ │ E1 E2 E3 E4 E5 │ ← Encoder (source) ├─────────────────────────┤ │ 0.7 0.1 0.1 0.05 0.05│ D1 ─┐ │ 0.1 0.6 0.2 0.05 0.05│ D2 │ Decoder │ 0.05 0.1 0.7 0.1 0.05│ D3 ─┘ (target) └─────────────────────────┘
Applications
Cross-Attention Use Cases
CROSS-ATTENTION APPLICATIONS: 1. MACHINE TRANSLATION (Original use): Encoder: "I love programming" ↓ encode [hidden states] Decoder generating "J'adore": - "J'" queries encoder, attends to "I" - "adore" queries encoder, attends to "love" - "programmer" queries, attends to "programming" 2. IMAGE CAPTIONING (Vision-Language): Image → CNN/ViT → [patch embeddings] ↑ K, V Caption decoder generates: "A cat sitting..." ↓ Q queries image patches "cat" → high attention on cat region "sitting" → attention on pose 3. RETRIEVAL-AUGMENTED GENERATION (RAG): Retrieved docs → encode → [doc embeddings] ↑ K, V LLM decoder generates answer: ↓ Q queries retrieved context Attends to relevant passages 4. TEXT-TO-IMAGE (Diffusion models): Text: "A sunset over mountains" ↓ encode [text embeddings] → K, V Image being denoised: [noisy image patches] → Q ↓ Cross-attention tells model WHERE to put: - "sunset" → sky region attention - "mountains" → bottom region attention --- ARCHITECTURE PATTERNS: Encoder-Decoder (T5, BART): ┌──────────────┐ ┌──────────────┐ │ Encoder │ │ Decoder │ │ ┌──────────┐ │ │ ┌──────────┐ │ │ │Self-Attn │ │ │ │Self-Attn │ │ │ └──────────┘ │ │ └────┬─────┘ │ │ ┌──────────┐ │ │ ┌────┴─────┐ │ │ │ FFN │ │────→│ │Cross-Attn│ │ │ └──────────┘ │ │ └──────────┘ │ └──────────────┘ │ ┌──────────┐ │ │ │ FFN │ │ │ └──────────┘ │ └──────────────┘ Decoder-Only with Context (GPT-style + RAG): ┌────────────────────────────────────┐ │ [Context tokens] [Query tokens] │ │ ↑ ↑ │ │ K,V from Q from │ │ context query │ │ │ │ Implemented as masked self-attention│ │ where query can attend to context │ └────────────────────────────────────┘

Interactive Exercise

Identify Attention Type

For each scenario, identify whether self-attention or cross-attention is being used:

1. GPT generating the next word based on previous words
2. A translation model using source sentence to generate target
3. BERT encoding a sentence for classification
4. DALL-E using text prompt to guide image generation
5. A summarizer attending to the original document

Pro Tips
  • Cross-attention is key for conditioning generation on external info
  • In decoder-only models, context + query is handled via masked self-attention
  • Cross-attention creates a "bridge" between encoder and decoder
  • Attention weights in cross-attention show alignment/relevance

Related Terms