Tokenizer | HyperKit.ai

Definition

A tokenizer converts text into tokens—the fundamental units that language models process. Modern tokenizers typically use subword algorithms (like BPE) that break text into pieces smaller than words but larger than characters, balancing vocabulary size with representation quality.

The same tokenizer must be used for training and inference to ensure consistent text-to-token mapping.

Key Concepts

Vocabulary: Fixed set of tokens the model knows (~32K-100K)
Subword tokenization: Splits rare words, keeps common ones whole
Token IDs: Integer indices into the vocabulary
Special tokens: [CLS], [SEP], <|endoftext|>, etc.

Examples

Tokenization

How Text Becomes Tokens

Text: "Tokenization is surprisingly important!"

GPT-4 / cl100k_base tokenizer:
["Token", "ization", " is", " surprisingly", " important", "!"]
→ 6 tokens

Claude tokenizer:
["Token", "ization", " is", " surpr", "isingly", " important", "!"]
→ 7 tokens

OBSERVATIONS:
- "Tokenization" split into "Token" + "ization"
- Space often attached to following word (" is")
- Common words stay whole ("important")
- Rare words get split more ("surprisingly")

WORD vs SUBWORD vs CHARACTER:

Word-level:  ["Tokenization", "is", "surprisingly", "important"]
             → Small vocab problem: can't handle new words

Character:   ["T","o","k","e","n","i","z","a","t","i","o","n",...]
             → Too many tokens, loses meaning

Subword:     ["Token", "ization", " is", ...]
             → Best of both! Efficient + handles new words

Implementation

Using Tokenizers

# OpenAI tiktoken
import tiktoken

enc = tiktoken.encoding_for_model("gpt-4")

text = "Hello, world!"
tokens = enc.encode(text)
print(tokens)  # [9906, 11, 1917, 0]
print(len(tokens))  # 4 tokens

# Decode back to text
decoded = enc.decode(tokens)
print(decoded)  # "Hello, world!"

# Count tokens (for cost estimation)
def count_tokens(text, model="gpt-4"):
    enc = tiktoken.encoding_for_model(model)
    return len(enc.encode(text))

# HuggingFace transformers
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("gpt2")
tokens = tokenizer.encode("Hello, world!")
decoded = tokenizer.decode(tokens)

# See individual tokens
token_strings = tokenizer.tokenize("Hello, world!")
print(token_strings)  # ['Hello', ',', 'Ġworld', '!']
# Ġ represents a space

Interactive Exercise

✎

Estimate Token Count

Roughly how many tokens would this text have?

Text: "Artificial intelligence and machine learning are transforming industries worldwide."

Hint: ~4 characters per token on average for English

Pro Tips

Rule of thumb: 1 token ≈ 4 characters or ≈ 0.75 words in English
Non-English text often uses more tokens per word
Code uses more tokens than prose (special characters)
Always test with actual tokenizer for accurate counts

Definition

Key Concepts

Examples

Interactive Exercise

Related Terms