Skip to main content
ModelTerms

Inference · intermediate

Tokenization

Tokenization is the process of splitting raw text into the discrete tokens an LLM consumes. Most modern LLMs use a learned byte-pair-encoding (BPE) tokenizer.

Explanation

BPE starts with raw bytes and iteratively merges the most frequent adjacent pairs into single tokens, building a vocabulary up to a fixed size (often 50K-200K). Common subwords end up as single tokens; rare strings stay split.

Different models use different tokenizers, so token counts differ across providers for the same text. GPT-4 uses o200k_base; Claude has its own tokenizer; Llama uses SentencePiece-trained variants.

A tokenizer is a fixed mapping at inference time — no learning happens — but it dictates how efficiently the model can represent different languages, code, and special tokens.

Examples

  • "reading" tokenizes to one token in many tokenizers.
  • "antidisestablishmentarianism" tokenizes to multiple tokens.

Frequently asked

What is Tokenization?

Tokenization is the process of splitting raw text into the discrete tokens an LLM consumes. Most modern LLMs use a learned byte-pair-encoding (BPE) tokenizer.

What is an example of tokenization?

"reading" tokenizes to one token in many tokenizers.

How is Tokenization related to Token?

Tokenization and Token are both inference concepts. A token is the basic unit an LLM reads and writes — usually a word piece (3-4 characters). LLMs are priced and sized by tokens, not words.

Is Tokenization considered intermediate?

Tokenization is generally considered intermediate-level material in the AI and LLM space.

Side-by-side comparisons

Sources