Inference · intermediate

Tokenization

Tokenization is the process of splitting raw text into the discrete tokens an LLM consumes. Most modern LLMs use a learned byte-pair-encoding (BPE) tokenizer.

Published May 29, 2026

Explanation

BPE starts with raw bytes and iteratively merges the most frequent adjacent pairs into single tokens, building a vocabulary up to a fixed size (often 50K-200K). Common subwords end up as single tokens; rare strings stay split.

Different models use different tokenizers, so token counts differ across providers for the same text. GPT-4 uses o200k_base; Claude has its own tokenizer; Llama uses SentencePiece-trained variants.

A tokenizer is a fixed mapping at inference time — no learning happens — but it dictates how efficiently the model can represent different languages, code, and special tokens.

Examples

"reading" tokenizes to one token in many tokenizers.
"antidisestablishmentarianism" tokenizes to multiple tokens.

Frequently asked

What is Tokenization?

Tokenization is the process of splitting raw text into the discrete tokens an LLM consumes. Most modern LLMs use a learned byte-pair-encoding (BPE) tokenizer.

What is an example of tokenization?

"reading" tokenizes to one token in many tokenizers.

How is Tokenization related to Token?

Tokenization and Token are both inference concepts. A token is the basic unit an LLM reads and writes — usually a word piece (3-4 characters). LLMs are priced and sized by tokens, not words.

Is Tokenization considered intermediate?

Tokenization is generally considered intermediate-level material in the AI and LLM space.

TokenInference

A token is the basic unit an LLM reads and writes — usually a word piece (3-4 characters). LLMs are priced and sized by tokens, not words.

Context WindowInference

The context window is the maximum number of tokens an LLM can consider in a single call — prompt plus generated output combined.

EmbeddingArchitecture

An embedding is a list of numbers (a vector) that represents a piece of input — a word, a sentence, an image — in a space where similar things end up close together.

Token CountInference

Token count is the number of tokens in a piece of text under a specific tokenizer. The unit of LLM pricing, context limits, and rate limits.

Side-by-side comparisons

Sources

Hugging Face — Tokenizer summary