Model Internals / Text Processing

Tokenizer

Beginner [2/5]
Tokenization Text encoder Subword tokenizer

Definition

A tokenizer converts text into tokens—the fundamental units that language models process. Modern tokenizers typically use subword algorithms (like BPE) that break text into pieces smaller than words but larger than characters, balancing vocabulary size with representation quality.

The same tokenizer must be used for training and inference to ensure consistent text-to-token mapping.

Key Concepts

  • Vocabulary: Fixed set of tokens the model knows (~32K-100K)
  • Subword tokenization: Splits rare words, keeps common ones whole
  • Token IDs: Integer indices into the vocabulary
  • Special tokens: [CLS], [SEP], <|endoftext|>, etc.

Examples

Tokenization
How Text Becomes Tokens
Text: "Tokenization is surprisingly important!" GPT-4 / cl100k_base tokenizer: ["Token", "ization", " is", " surprisingly", " important", "!"] → 6 tokens Claude tokenizer: ["Token", "ization", " is", " surpr", "isingly", " important", "!"] → 7 tokens OBSERVATIONS: - "Tokenization" split into "Token" + "ization" - Space often attached to following word (" is") - Common words stay whole ("important") - Rare words get split more ("surprisingly") WORD vs SUBWORD vs CHARACTER: Word-level: ["Tokenization", "is", "surprisingly", "important"] → Small vocab problem: can't handle new words Character: ["T","o","k","e","n","i","z","a","t","i","o","n",...] → Too many tokens, loses meaning Subword: ["Token", "ization", " is", ...] → Best of both! Efficient + handles new words
Implementation
Using Tokenizers
# OpenAI tiktoken import tiktoken enc = tiktoken.encoding_for_model("gpt-4") text = "Hello, world!" tokens = enc.encode(text) print(tokens) # [9906, 11, 1917, 0] print(len(tokens)) # 4 tokens # Decode back to text decoded = enc.decode(tokens) print(decoded) # "Hello, world!" # Count tokens (for cost estimation) def count_tokens(text, model="gpt-4"): enc = tiktoken.encoding_for_model(model) return len(enc.encode(text)) # HuggingFace transformers from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("gpt2") tokens = tokenizer.encode("Hello, world!") decoded = tokenizer.decode(tokens) # See individual tokens token_strings = tokenizer.tokenize("Hello, world!") print(token_strings) # ['Hello', ',', 'Ġworld', '!'] # Ġ represents a space

Interactive Exercise

Estimate Token Count

Roughly how many tokens would this text have?

Text: "Artificial intelligence and machine learning are transforming industries worldwide."

Hint: ~4 characters per token on average for English

Pro Tips
  • Rule of thumb: 1 token ≈ 4 characters or ≈ 0.75 words in English
  • Non-English text often uses more tokens per word
  • Code uses more tokens than prose (special characters)
  • Always test with actual tokenizer for accurate counts

Related Terms