Positional Encoding | HyperKit.ai

Definition

Positional encoding adds information about token position to the input embeddings. Since self-attention is permutation-invariant (treats input as a set), positional encodings are essential for the model to understand word order.

Without positional information, "dog bites man" and "man bites dog" would be indistinguishable to the model.

Key Concepts

Sinusoidal: Original transformer uses sin/cos functions (fixed)
Learned: Modern models learn position embeddings during training
Relative: RoPE, ALiBi encode relative positions for longer contexts
Added to embeddings: Position info combined with token embeddings

Examples

Why Needed

Self-Attention is Position-Blind

WITHOUT POSITIONAL ENCODING:

Input 1: "The cat ate the fish"
Input 2: "The fish ate the cat"

Self-attention sees: {The, cat, ate, the, fish}
                     {The, fish, ate, the, cat}
                     (same set of tokens!)

Both inputs produce IDENTICAL attention patterns!
The model can't distinguish word order.

WITH POSITIONAL ENCODING:

Input 1: "The₀ cat₁ ate₂ the₃ fish₄"
Input 2: "The₀ fish₁ ate₂ the₃ cat₄"

Now each token carries position info:
- "cat" at position 1 ≠ "cat" at position 4
- Order is preserved in the representation

Embedding = TokenEmbedding + PositionalEncoding
"cat₁" = embed("cat") + pos_embed(1)
"cat₄" = embed("cat") + pos_embed(4)

Types

Different Positional Encoding Methods

1. SINUSOIDAL (Original Transformer):
PE(pos, 2i)   = sin(pos / 10000^(2i/d))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d))

Properties:
- Fixed, no learned parameters
- Can generalize to longer sequences
- Each dimension has different frequency

2. LEARNED ABSOLUTE (GPT, BERT):
pos_embed = nn.Embedding(max_length, d_model)
output = token_embed + pos_embed[position]

Properties:
- Learned during training
- Limited to max_length positions
- Simple and effective

3. ROTARY (RoPE - LLaMA, GPT-NeoX):
Rotates query/key vectors based on position
Encodes RELATIVE position in attention scores

Properties:
- Better length generalization
- Captures relative position naturally
- State-of-the-art for long contexts

4. ALiBi (BLOOM):
Adds linear bias to attention scores
bias = -m × |query_pos - key_pos|

Properties:
- No positional embeddings at all!
- Extrapolates to longer contexts well
- Simple and efficient

Interactive Exercise

✎

Position Impact

If a model was trained with max position 2048, what happens when given a sequence of 3000 tokens?

Consider different encoding types.

Pro Tips

RoPE is now the most popular choice for new models
Length generalization is an active research area
Position interpolation can extend trained models
Some architectures explore "position-free" designs

Definition

Key Concepts

Examples

Interactive Exercise

Related Terms