Foundation Concepts / Core Definitions

Token

Beginner [2/5]
Subword Text unit

Definition

A token is the smallest unit of text that a language model processes. Rather than reading character by character or word by word, LLMs break text into tokens - which are typically common words, word pieces, or individual characters for rare words.

Understanding tokens is crucial because LLM pricing, context limits, and processing all work in terms of tokens rather than words or characters.

Key Concepts

  • Tokenization: The process of converting text into tokens
  • Vocabulary: The set of all tokens a model knows
  • Subword: A portion of a word that forms a token
  • Token limit: Maximum number of tokens in input + output

Examples

Tokenization Example
How Text Becomes Tokens
Text: "Hello, how are you?" Tokens: ["Hello", ",", " how", " are", " you", "?"] Count: 6 tokens Text: "Tokenization is fascinating!" Tokens: ["Token", "ization", " is", " fasci", "nating", "!"] Count: 6 tokens Text: "AI" Tokens: ["AI"] Count: 1 token
Common words are usually single tokens; uncommon or long words get split into subwords.
Token Estimation
Quick Rules of Thumb
English text: • 1 token ≈ 4 characters • 1 token ≈ 0.75 words • 100 tokens ≈ 75 words Code: • Often uses more tokens due to special characters • Variable names may be split into multiple tokens Other languages: • Non-Latin scripts often use more tokens per word • Chinese: ~1.5-2 tokens per character
These are rough estimates - actual tokenization varies by model and content.

Interactive Exercise

🔢
Estimate Token Counts

Using the rule "1 token ≈ 4 characters", estimate the token count for these texts:

1. "The quick brown fox" (19 characters)

2. "Artificial Intelligence" (23 characters)

3. "Hello!" (6 characters)

Pro Tips
  • Both input AND output count toward token limits
  • Spaces are often included with the following word as one token
  • Most APIs charge per token (input and output separately)
  • Use tokenizer tools to get exact counts for important prompts

Related Terms