Model Architecture / Transformer Components

Positional Encoding

Advanced [4/5]
Position embeddings Positional embeddings PE

Definition

Positional encoding adds information about token position to the input embeddings. Since self-attention is permutation-invariant (treats input as a set), positional encodings are essential for the model to understand word order.

Without positional information, "dog bites man" and "man bites dog" would be indistinguishable to the model.

Key Concepts

  • Sinusoidal: Original transformer uses sin/cos functions (fixed)
  • Learned: Modern models learn position embeddings during training
  • Relative: RoPE, ALiBi encode relative positions for longer contexts
  • Added to embeddings: Position info combined with token embeddings

Examples

Why Needed
Self-Attention is Position-Blind
WITHOUT POSITIONAL ENCODING: Input 1: "The cat ate the fish" Input 2: "The fish ate the cat" Self-attention sees: {The, cat, ate, the, fish} {The, fish, ate, the, cat} (same set of tokens!) Both inputs produce IDENTICAL attention patterns! The model can't distinguish word order. WITH POSITIONAL ENCODING: Input 1: "The₀ cat₁ ate₂ the₃ fish₄" Input 2: "The₀ fish₁ ate₂ the₃ cat₄" Now each token carries position info: - "cat" at position 1 ≠ "cat" at position 4 - Order is preserved in the representation Embedding = TokenEmbedding + PositionalEncoding "cat₁" = embed("cat") + pos_embed(1) "cat₄" = embed("cat") + pos_embed(4)
Types
Different Positional Encoding Methods
1. SINUSOIDAL (Original Transformer): PE(pos, 2i) = sin(pos / 10000^(2i/d)) PE(pos, 2i+1) = cos(pos / 10000^(2i/d)) Properties: - Fixed, no learned parameters - Can generalize to longer sequences - Each dimension has different frequency 2. LEARNED ABSOLUTE (GPT, BERT): pos_embed = nn.Embedding(max_length, d_model) output = token_embed + pos_embed[position] Properties: - Learned during training - Limited to max_length positions - Simple and effective 3. ROTARY (RoPE - LLaMA, GPT-NeoX): Rotates query/key vectors based on position Encodes RELATIVE position in attention scores Properties: - Better length generalization - Captures relative position naturally - State-of-the-art for long contexts 4. ALiBi (BLOOM): Adds linear bias to attention scores bias = -m × |query_pos - key_pos| Properties: - No positional embeddings at all! - Extrapolates to longer contexts well - Simple and efficient

Interactive Exercise

Position Impact

If a model was trained with max position 2048, what happens when given a sequence of 3000 tokens?

Consider different encoding types.

Pro Tips
  • RoPE is now the most popular choice for new models
  • Length generalization is an active research area
  • Position interpolation can extend trained models
  • Some architectures explore "position-free" designs

Related Terms