Architecture · intermediate

Decoder (decoder-only)

A decoder is a transformer module that generates a sequence one token at a time, using causal self-attention so each token only sees earlier ones. GPT-style LLMs are decoder-only.

Published May 29, 2026

Explanation

In a decoder, position t can attend only to positions less than t. This causality is what makes next-token prediction well-defined and what lets the model be used to generate fluent text autoregressively at inference time.

The vast majority of modern LLMs (GPT-4, Claude, Llama, Mistral, Gemini) are decoder-only. They drop the encoder entirely and just stack decoder layers. The simplicity scales well and the same model can both read and generate.

Examples

GPT-4 generating a paragraph token by token.
Claude continuing a conversation in the same decoder loop.

Frequently asked

What is Decoder?

A decoder is a transformer module that generates a sequence one token at a time, using causal self-attention so each token only sees earlier ones. GPT-style LLMs are decoder-only.

What is an example of decoder?

GPT-4 generating a paragraph token by token.

How is Decoder related to Encoder?

Decoder and Encoder are both architecture concepts. An encoder is a transformer module that reads an input sequence and produces a contextualized representation — a vector per token that captures meaning in context.

Is Decoder considered intermediate?

Decoder is generally considered intermediate-level material in the AI and LLM space.

EncoderArchitecture

An encoder is a transformer module that reads an input sequence and produces a contextualized representation — a vector per token that captures meaning in context.

Encoder-DecoderArchitecture

An encoder-decoder model has a separate encoder that reads the input and a decoder that generates the output, with cross-attention linking them. T5 and the original transformer are encoder-decoders.

TransformerArchitecture

The transformer is the neural network architecture behind virtually every modern large language model. It uses self-attention to model relationships between all positions in a sequence in parallel.

Large Language ModelFoundations

A large language model is a neural network trained on huge amounts of text to predict the next token in a sequence. GPT-4, Claude, and Gemini are all LLMs.

KV CacheArchitecture

The KV cache stores the key and value vectors of all earlier tokens during generation so the model does not recompute them at every step. It is the main memory cost of LLM inference.

Side-by-side comparisons

Sources

GPT-2 paper (OpenAI)