Token | HyperKit.ai

Definition

A token is the smallest unit of text that a language model processes. Rather than reading character by character or word by word, LLMs break text into tokens - which are typically common words, word pieces, or individual characters for rare words.

Understanding tokens is crucial because LLM pricing, context limits, and processing all work in terms of tokens rather than words or characters.

Key Concepts

Tokenization: The process of converting text into tokens
Vocabulary: The set of all tokens a model knows
Subword: A portion of a word that forms a token
Token limit: Maximum number of tokens in input + output

Examples

Tokenization Example

How Text Becomes Tokens

Text: "Hello, how are you?"
Tokens: ["Hello", ",", " how", " are", " you", "?"]
Count: 6 tokens

Text: "Tokenization is fascinating!"
Tokens: ["Token", "ization", " is", " fasci", "nating", "!"]
Count: 6 tokens

Text: "AI"
Tokens: ["AI"]
Count: 1 token

Common words are usually single tokens; uncommon or long words get split into subwords.

Token Estimation

Quick Rules of Thumb

English text:
• 1 token ≈ 4 characters
• 1 token ≈ 0.75 words
• 100 tokens ≈ 75 words

Code:
• Often uses more tokens due to special characters
• Variable names may be split into multiple tokens

Other languages:
• Non-Latin scripts often use more tokens per word
• Chinese: ~1.5-2 tokens per character

These are rough estimates - actual tokenization varies by model and content.

Interactive Exercise

🔢

Estimate Token Counts

Using the rule "1 token ≈ 4 characters", estimate the token count for these texts:

1. "The quick brown fox" (19 characters)

2. "Artificial Intelligence" (23 characters)

3. "Hello!" (6 characters)

Pro Tips

Both input AND output count toward token limits
Spaces are often included with the following word as one token
Most APIs charge per token (input and output separately)
Use tokenizer tools to get exact counts for important prompts

Definition

Key Concepts

Examples

Interactive Exercise

Related Terms