Inference · intermediate
Tokenization
Tokenization is the process of splitting raw text into the discrete tokens an LLM consumes. Most modern LLMs use a learned byte-pair-encoding (BPE) tokenizer.
Explanation
BPE starts with raw bytes and iteratively merges the most frequent adjacent pairs into single tokens, building a vocabulary up to a fixed size (often 50K-200K). Common subwords end up as single tokens; rare strings stay split.
Different models use different tokenizers, so token counts differ across providers for the same text. GPT-4 uses o200k_base; Claude has its own tokenizer; Llama uses SentencePiece-trained variants.
A tokenizer is a fixed mapping at inference time — no learning happens — but it dictates how efficiently the model can represent different languages, code, and special tokens.
Examples
- "reading" tokenizes to one token in many tokenizers.
- "antidisestablishmentarianism" tokenizes to multiple tokens.
Frequently asked
What is Tokenization?
Tokenization is the process of splitting raw text into the discrete tokens an LLM consumes. Most modern LLMs use a learned byte-pair-encoding (BPE) tokenizer.
What is an example of tokenization?
"reading" tokenizes to one token in many tokenizers.
How is Tokenization related to Token?
Tokenization and Token are both inference concepts. A token is the basic unit an LLM reads and writes — usually a word piece (3-4 characters). LLMs are priced and sized by tokens, not words.
Is Tokenization considered intermediate?
Tokenization is generally considered intermediate-level material in the AI and LLM space.