Model Architecture & Training / Neural Network Architectures

Transformer

Advanced [4/5]
Transformer architecture Attention-based model

Definition

The Transformer is the neural network architecture behind all modern LLMs. Introduced in the 2017 paper "Attention Is All You Need," it revolutionized AI by enabling models to process entire sequences in parallel using a mechanism called attention.

GPT, Claude, Llama, and virtually all current language models are built on transformer architecture.

Key Concepts

  • Self-attention: Allows each token to attend to all other tokens
  • Parallel processing: Processes entire sequences at once (unlike RNNs)
  • Positional encoding: Adds position information since order matters
  • Multi-head attention: Multiple attention patterns learned simultaneously

Examples

Attention Mechanism
How Attention Works
Sentence: "The cat sat on the mat because it was tired" What does "it" refer to? ATTENTION VISUALIZATION: "The" → low attention "cat" → HIGH ATTENTION ← "it" looks here! "sat" → low attention "on" → low attention "the" → low attention "mat" → some attention "because" → low attention "it" → (self) "was" → low attention "tired" → some attention The attention mechanism learns that "it" should attend strongly to "cat" (what was tired) rather than "mat" (which could grammatically fit)
Attention lets the model understand relationships between words regardless of distance.
Transformer Architecture
Simplified Structure
INPUT: "Hello world" ↓ ┌─────────────────────────────┐ │ Token Embeddings │ Convert words to vectors │ + Positional Encoding │ Add position info └─────────────────────────────┘ ↓ ┌─────────────────────────────┐ │ Transformer Block (×N) │ │ ┌───────────────────────┐ │ │ │ Multi-Head Attention │ │ Each token attends │ └───────────────────────┘ │ to all others │ ↓ │ │ ┌───────────────────────┐ │ │ │ Feed Forward │ │ Process information │ └───────────────────────┘ │ └─────────────────────────────┘ ↓ ┌─────────────────────────────┐ │ Output Layer │ Predict next token └─────────────────────────────┘ ↓ OUTPUT: probability distribution over vocabulary
Multiple transformer blocks are stacked to create deep models.
Transformer Variants
Different Transformer Types
ENCODER-ONLY (BERT-style) • Bidirectional attention • Good for: classification, embeddings • Examples: BERT, RoBERTa DECODER-ONLY (GPT-style) • Causal/autoregressive attention • Good for: text generation • Examples: GPT-4, Claude, Llama ENCODER-DECODER (T5-style) • Full sequence-to-sequence • Good for: translation, summarization • Examples: T5, BART Modern LLMs mostly use DECODER-ONLY architecture with causal attention
Different transformer configurations serve different purposes.

Interactive Exercise

🤖
Understand Attention

In this sentence, which words would "she" most attend to?

"Mary told John that she would help him with the project."

Rank the attention from highest to lowest for the word "she".

Pro Tips
  • Attention is what enables long-range dependencies in text
  • More attention heads = more relationship patterns learned
  • Transformer scale (parameters) correlates with capability
  • Understanding attention helps with prompt engineering

Related Terms