Tokenization
How Text Becomes Tokens
Text: "Tokenization is surprisingly important!"
GPT-4 / cl100k_base tokenizer:
["Token", "ization", " is", " surprisingly", " important", "!"]
→ 6 tokens
Claude tokenizer:
["Token", "ization", " is", " surpr", "isingly", " important", "!"]
→ 7 tokens
OBSERVATIONS:
- "Tokenization" split into "Token" + "ization"
- Space often attached to following word (" is")
- Common words stay whole ("important")
- Rare words get split more ("surprisingly")
WORD vs SUBWORD vs CHARACTER:
Word-level: ["Tokenization", "is", "surprisingly", "important"]
→ Small vocab problem: can't handle new words
Character: ["T","o","k","e","n","i","z","a","t","i","o","n",...]
→ Too many tokens, loses meaning
Subword: ["Token", "ization", " is", ...]
→ Best of both! Efficient + handles new words
Implementation
Using Tokenizers
# OpenAI tiktoken
import tiktoken
enc = tiktoken.encoding_for_model("gpt-4")
text = "Hello, world!"
tokens = enc.encode(text)
print(tokens) # [9906, 11, 1917, 0]
print(len(tokens)) # 4 tokens
# Decode back to text
decoded = enc.decode(tokens)
print(decoded) # "Hello, world!"
# Count tokens (for cost estimation)
def count_tokens(text, model="gpt-4"):
enc = tiktoken.encoding_for_model(model)
return len(enc.encode(text))
# HuggingFace transformers
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("gpt2")
tokens = tokenizer.encode("Hello, world!")
decoded = tokenizer.decode(tokens)
# See individual tokens
token_strings = tokenizer.tokenize("Hello, world!")
print(token_strings) # ['Hello', ',', 'Ġworld', '!']
# Ġ represents a space