Architecture · advanced

Encoder-Decoder (seq2seq)

An encoder-decoder model has a separate encoder that reads the input and a decoder that generates the output, with cross-attention linking them. T5 and the original transformer are encoder-decoders.

Published May 29, 2026

Explanation

This architecture maps cleanly onto translation: encode the source sentence, then decode in the target language. Cross-attention in the decoder attends to the encoder's representation of the input.

Encoder-decoder models tend to be stronger than decoder-only at tasks with a clear input-to-output structure (translation, summarization), but decoder-only models have largely taken over because they're simpler and benefit more from massive pretraining scale.

Examples

T5: every NLP task framed as text-to-text.
Original transformer: machine translation.

Frequently asked

What is Encoder-Decoder?

An encoder-decoder model has a separate encoder that reads the input and a decoder that generates the output, with cross-attention linking them. T5 and the original transformer are encoder-decoders.

What is an example of encoder-decoder?

T5: every NLP task framed as text-to-text.

How is Encoder-Decoder related to Encoder?

Encoder-Decoder and Encoder are both architecture concepts. An encoder is a transformer module that reads an input sequence and produces a contextualized representation — a vector per token that captures meaning in context.

Is Encoder-Decoder considered advanced?

Encoder-Decoder is generally considered advanced-level material in the AI and LLM space.

EncoderArchitecture

An encoder is a transformer module that reads an input sequence and produces a contextualized representation — a vector per token that captures meaning in context.

DecoderArchitecture

A decoder is a transformer module that generates a sequence one token at a time, using causal self-attention so each token only sees earlier ones. GPT-style LLMs are decoder-only.

TransformerArchitecture

The transformer is the neural network architecture behind virtually every modern large language model. It uses self-attention to model relationships between all positions in a sequence in parallel.

Side-by-side comparisons

Sources

T5 paper (arXiv)