Skip to main content
ModelTerms

Architecture · intermediate

Transformer

The transformer is the neural network architecture behind virtually every modern large language model. It uses self-attention to model relationships between all positions in a sequence in parallel.

Explanation

Introduced in the 2017 paper "Attention Is All You Need", the transformer replaced earlier sequence models (RNNs, LSTMs) that processed text one token at a time. Transformers process the whole sequence in parallel, which makes them GPU-friendly and lets training scale to enormous datasets.

The core operation is self-attention: every token computes how much it should "attend" to every other token, then mixes their representations accordingly. This lets the model directly model long-range dependencies — "the cat that sat on the mat that was bought yesterday" can route information across the whole sentence.

GPT, Claude, Gemini, Llama, Mistral — they're all transformers. Variants differ in details (attention type, position encoding, sparsity) but share the same core.

Examples

  • GPT-4: decoder-only transformer.
  • BERT: encoder-only transformer.
  • T5: encoder-decoder transformer.

When to use transformer

Default choice for any sequence task in 2026: text, code, audio, even protein sequences.

Frequently asked

What is Transformer?

The transformer is the neural network architecture behind virtually every modern large language model. It uses self-attention to model relationships between all positions in a sequence in parallel.

What is an example of transformer?

GPT-4: decoder-only transformer.

How is Transformer related to Attention?

Transformer and Attention are both architecture concepts. Attention is the mechanism a transformer uses to decide which earlier tokens matter most when producing each new one. It mixes information across positions by weighted sum.

When should I use transformer?

Default choice for any sequence task in 2026: text, code, audio, even protein sequences.

Is Transformer considered intermediate?

Transformer is generally considered intermediate-level material in the AI and LLM space.

AttentionArchitecture

Attention is the mechanism a transformer uses to decide which earlier tokens matter most when producing each new one. It mixes information across positions by weighted sum.

Self-AttentionArchitecture

Self-attention is attention applied within a single sequence: each token attends to every other token in the same input, including itself.

EncoderArchitecture

An encoder is a transformer module that reads an input sequence and produces a contextualized representation — a vector per token that captures meaning in context.

DecoderArchitecture

A decoder is a transformer module that generates a sequence one token at a time, using causal self-attention so each token only sees earlier ones. GPT-style LLMs are decoder-only.

Positional EncodingArchitecture

Positional encoding tells the transformer where each token sits in the sequence. Without it, "dog bites man" and "man bites dog" would look identical to the model.

Large Language ModelFoundations

A large language model is a neural network trained on huge amounts of text to predict the next token in a sequence. GPT-4, Claude, and Gemini are all LLMs.

Mixture of ExpertsArchitecture

Mixture of Experts is a transformer variant where each layer has many parallel "expert" feed-forward networks, but only a few are activated per token. Total parameters grow without growing per-token compute.

MambaArchitecture

Mamba is a state-space model architecture that replaces transformer attention with selective state updates. It scales linearly with sequence length and matches transformer quality on many tasks.

Side-by-side comparisons

Sources