Transformer | HyperKit.ai

Definition

The Transformer is the neural network architecture behind all modern LLMs. Introduced in the 2017 paper "Attention Is All You Need," it revolutionized AI by enabling models to process entire sequences in parallel using a mechanism called attention.

GPT, Claude, Llama, and virtually all current language models are built on transformer architecture.

Key Concepts

Self-attention: Allows each token to attend to all other tokens
Parallel processing: Processes entire sequences at once (unlike RNNs)
Positional encoding: Adds position information since order matters
Multi-head attention: Multiple attention patterns learned simultaneously

Examples

Attention Mechanism

How Attention Works

Sentence: "The cat sat on the mat because it was tired"

What does "it" refer to?

ATTENTION VISUALIZATION:
"The" → low attention
"cat" → HIGH ATTENTION ← "it" looks here!
"sat" → low attention
"on"  → low attention
"the" → low attention
"mat" → some attention
"because" → low attention
"it"  → (self)
"was" → low attention
"tired" → some attention

The attention mechanism learns that "it"
should attend strongly to "cat" (what was tired)
rather than "mat" (which could grammatically fit)

Attention lets the model understand relationships between words regardless of distance.

Transformer Architecture

Simplified Structure

INPUT: "Hello world"
         ↓
┌─────────────────────────────┐
│    Token Embeddings        │  Convert words to vectors
│    + Positional Encoding   │  Add position info
└─────────────────────────────┘
         ↓
┌─────────────────────────────┐
│   Transformer Block (×N)    │
│  ┌───────────────────────┐  │
│  │  Multi-Head Attention │  │  Each token attends
│  └───────────────────────┘  │  to all others
│           ↓                 │
│  ┌───────────────────────┐  │
│  │    Feed Forward       │  │  Process information
│  └───────────────────────┘  │
└─────────────────────────────┘
         ↓
┌─────────────────────────────┐
│     Output Layer           │  Predict next token
└─────────────────────────────┘
         ↓
OUTPUT: probability distribution over vocabulary

Multiple transformer blocks are stacked to create deep models.

Transformer Variants

Different Transformer Types

ENCODER-ONLY (BERT-style)
• Bidirectional attention
• Good for: classification, embeddings
• Examples: BERT, RoBERTa

DECODER-ONLY (GPT-style)
• Causal/autoregressive attention
• Good for: text generation
• Examples: GPT-4, Claude, Llama

ENCODER-DECODER (T5-style)
• Full sequence-to-sequence
• Good for: translation, summarization
• Examples: T5, BART

Modern LLMs mostly use DECODER-ONLY
architecture with causal attention

Different transformer configurations serve different purposes.

Interactive Exercise

🤖

Understand Attention

In this sentence, which words would "she" most attend to?

"Mary told John that she would help him with the project."

Rank the attention from highest to lowest for the word "she".

Pro Tips

Attention is what enables long-range dependencies in text
More attention heads = more relationship patterns learned
Transformer scale (parameters) correlates with capability
Understanding attention helps with prompt engineering

Definition

Key Concepts

Examples

Interactive Exercise

Related Terms