Architecture

How modern AI models are structured internally.

Attention is the mechanism a transformer uses to decide which earlier tokens matter most when producing each new one. It mixes information across positions by weighted sum.

intermediate

Decoder

A decoder is a transformer module that generates a sequence one token at a time, using causal self-attention so each token only sees earlier ones. GPT-style LLMs are decoder-only.

intermediate

Embedding

An embedding is a list of numbers (a vector) that represents a piece of input — a word, a sentence, an image — in a space where similar things end up close together.

intermediate

Encoder

An encoder is a transformer module that reads an input sequence and produces a contextualized representation — a vector per token that captures meaning in context.

intermediate

Encoder-Decoder

An encoder-decoder model has a separate encoder that reads the input and a decoder that generates the output, with cross-attention linking them. T5 and the original transformer are encoder-decoders.

advanced

FlashAttention

FlashAttention is an algorithm that computes exact attention faster and with much less memory by carefully tiling the computation to fit in GPU SRAM rather than going to HBM.

advanced

KV Cache

The KV cache stores the key and value vectors of all earlier tokens during generation so the model does not recompute them at every step. It is the main memory cost of LLM inference.

advanced

Mamba

Mamba is a state-space model architecture that replaces transformer attention with selective state updates. It scales linearly with sequence length and matches transformer quality on many tasks.

advanced

Mixture of Experts

Mixture of Experts is a transformer variant where each layer has many parallel "expert" feed-forward networks, but only a few are activated per token. Total parameters grow without growing per-token compute.

advanced

Multi-Head Attention

Multi-head attention runs several attention operations in parallel, each with its own learned projection, then concatenates the results. This lets the model attend to different kinds of relationships at once.

advanced

Parameter Count

Parameter count is the total number of learnable weights in a model — "7B" means 7 billion parameters. It is the most cited model-size metric, though not always the most informative.

beginner

Positional Encoding

Positional encoding tells the transformer where each token sits in the sequence. Without it, "dog bites man" and "man bites dog" would look identical to the model.

advanced

Reasoning Model

A reasoning model spends extra compute thinking step-by-step before answering. OpenAI o1/o3, DeepSeek R1, and Anthropic's extended thinking are reasoning models.

intermediate

Rotary Position Embedding

RoPE encodes token position by rotating the query and key vectors in attention by an angle proportional to their position. It generalizes well to longer sequences than the model was trained on.

advanced

Self-Attention

Self-attention is attention applied within a single sequence: each token attends to every other token in the same input, including itself.

intermediate

Sliding-Window Attention

Sliding-window attention limits each token to attending only the most recent W tokens (e.g. 4K), making attention linear in sequence length. Mistral and Gemma use it.

advanced

Transformer

The transformer is the neural network architecture behind virtually every modern large language model. It uses self-attention to model relationships between all positions in a sequence in parallel.

intermediate