Architecture · intermediate
Transformer
The transformer is the neural network architecture behind virtually every modern large language model. It uses self-attention to model relationships between all positions in a sequence in parallel.
Explanation
Introduced in the 2017 paper "Attention Is All You Need", the transformer replaced earlier sequence models (RNNs, LSTMs) that processed text one token at a time. Transformers process the whole sequence in parallel, which makes them GPU-friendly and lets training scale to enormous datasets.
The core operation is self-attention: every token computes how much it should "attend" to every other token, then mixes their representations accordingly. This lets the model directly model long-range dependencies — "the cat that sat on the mat that was bought yesterday" can route information across the whole sentence.
GPT, Claude, Gemini, Llama, Mistral — they're all transformers. Variants differ in details (attention type, position encoding, sparsity) but share the same core.
Examples
- GPT-4: decoder-only transformer.
- BERT: encoder-only transformer.
- T5: encoder-decoder transformer.
When to use transformer
Default choice for any sequence task in 2026: text, code, audio, even protein sequences.
Frequently asked
What is Transformer?
The transformer is the neural network architecture behind virtually every modern large language model. It uses self-attention to model relationships between all positions in a sequence in parallel.
What is an example of transformer?
GPT-4: decoder-only transformer.
How is Transformer related to Attention?
Transformer and Attention are both architecture concepts. Attention is the mechanism a transformer uses to decide which earlier tokens matter most when producing each new one. It mixes information across positions by weighted sum.
When should I use transformer?
Default choice for any sequence task in 2026: text, code, audio, even protein sequences.
Is Transformer considered intermediate?
Transformer is generally considered intermediate-level material in the AI and LLM space.