Self-Attention | HyperKit.ai

Definition

Self-attention is attention applied within a single sequence—each position attends to all other positions in the same sequence. Unlike cross-attention (between encoder and decoder), self-attention helps the model understand relationships within the input text itself.

This allows every token to directly consider every other token, regardless of distance, capturing global context.

Key Concepts

All-to-all connections: Each token can attend to every other token
Causal masking: For generation, tokens can only see past tokens
Contextual embeddings: Same word gets different representations based on context
O(n²) complexity: Quadratic in sequence length

Examples

Visualization

Self-Attention Matrix

Sentence: "The bank by the river"

SELF-ATTENTION MATRIX (bidirectional):
           The   bank   by   the   river
    The   [0.3   0.2   0.1   0.1   0.3]
    bank  [0.1   0.3   0.1   0.1   0.4] ← "bank" attends to "river"
    by    [0.2   0.2   0.2   0.2   0.2]
    the   [0.2   0.2   0.1   0.3   0.2]
    river [0.1   0.4   0.1   0.1   0.3] ← "river" attends to "bank"

Each row shows what one word attends to.
"bank" strongly attends to "river" → learns it's a riverbank!

CAUSAL SELF-ATTENTION (for generation):
           The   bank   by   the   river
    The   [1.0    -     -     -     -  ]
    bank  [0.4   0.6    -     -     -  ]
    by    [0.2   0.3   0.5    -     -  ]
    the   [0.2   0.2   0.2   0.4    -  ]
    river [0.1   0.3   0.1   0.2   0.3 ]

"-" = masked (can't see future tokens)

Disambiguation

Context-Dependent Meaning

SELF-ATTENTION enables contextual understanding:

Sentence 1: "I went to the bank to deposit money"
"bank" self-attends to: "deposit", "money"
→ Learns: financial institution meaning

Sentence 2: "I sat by the bank watching the river"
"bank" self-attends to: "river", "sat"
→ Learns: riverbank meaning

SAME WORD, DIFFERENT REPRESENTATIONS!

Without self-attention:
"bank" → single static embedding
→ Can't distinguish meanings

With self-attention:
"bank" → dynamic embedding based on context
→ Different vector for each usage

This is why transformers excel at understanding nuance!

Interactive Exercise

✎

Identify Self-Attention Benefits

Why is self-attention better than RNNs for long-range dependencies?

Hint: Think about how information flows in each architecture.

Pro Tips

Self-attention is parallelizable; RNNs are sequential
Constant path length between any two tokens in self-attention
Causal masking is crucial for autoregressive generation
Efficient attention variants (sparse, linear) address O(n²) cost

Definition

Key Concepts

Examples

Interactive Exercise

Related Terms