Model Architecture / Transformer Components

Self-Attention

Advanced [4/5]
Intra-attention Self-attention layer

Definition

Self-attention is attention applied within a single sequence—each position attends to all other positions in the same sequence. Unlike cross-attention (between encoder and decoder), self-attention helps the model understand relationships within the input text itself.

This allows every token to directly consider every other token, regardless of distance, capturing global context.

Key Concepts

  • All-to-all connections: Each token can attend to every other token
  • Causal masking: For generation, tokens can only see past tokens
  • Contextual embeddings: Same word gets different representations based on context
  • O(n²) complexity: Quadratic in sequence length

Examples

Visualization
Self-Attention Matrix
Sentence: "The bank by the river" SELF-ATTENTION MATRIX (bidirectional): The bank by the river The [0.3 0.2 0.1 0.1 0.3] bank [0.1 0.3 0.1 0.1 0.4] ← "bank" attends to "river" by [0.2 0.2 0.2 0.2 0.2] the [0.2 0.2 0.1 0.3 0.2] river [0.1 0.4 0.1 0.1 0.3] ← "river" attends to "bank" Each row shows what one word attends to. "bank" strongly attends to "river" → learns it's a riverbank! CAUSAL SELF-ATTENTION (for generation): The bank by the river The [1.0 - - - - ] bank [0.4 0.6 - - - ] by [0.2 0.3 0.5 - - ] the [0.2 0.2 0.2 0.4 - ] river [0.1 0.3 0.1 0.2 0.3 ] "-" = masked (can't see future tokens)
Disambiguation
Context-Dependent Meaning
SELF-ATTENTION enables contextual understanding: Sentence 1: "I went to the bank to deposit money" "bank" self-attends to: "deposit", "money" → Learns: financial institution meaning Sentence 2: "I sat by the bank watching the river" "bank" self-attends to: "river", "sat" → Learns: riverbank meaning SAME WORD, DIFFERENT REPRESENTATIONS! Without self-attention: "bank" → single static embedding → Can't distinguish meanings With self-attention: "bank" → dynamic embedding based on context → Different vector for each usage This is why transformers excel at understanding nuance!

Interactive Exercise

Identify Self-Attention Benefits

Why is self-attention better than RNNs for long-range dependencies?

Hint: Think about how information flows in each architecture.

Pro Tips
  • Self-attention is parallelizable; RNNs are sequential
  • Constant path length between any two tokens in self-attention
  • Causal masking is crucial for autoregressive generation
  • Efficient attention variants (sparse, linear) address O(n²) cost

Related Terms