Category
Architecture
How modern AI models are structured internally.
Attention is the mechanism a transformer uses to decide which earlier tokens matter most when producing each new one. It mixes information across positions by weighted sum.
intermediateA decoder is a transformer module that generates a sequence one token at a time, using causal self-attention so each token only sees earlier ones. GPT-style LLMs are decoder-only.
intermediateAn embedding is a list of numbers (a vector) that represents a piece of input — a word, a sentence, an image — in a space where similar things end up close together.
intermediateAn encoder is a transformer module that reads an input sequence and produces a contextualized representation — a vector per token that captures meaning in context.
intermediateAn encoder-decoder model has a separate encoder that reads the input and a decoder that generates the output, with cross-attention linking them. T5 and the original transformer are encoder-decoders.
advancedFlashAttention is an algorithm that computes exact attention faster and with much less memory by carefully tiling the computation to fit in GPU SRAM rather than going to HBM.
advancedThe KV cache stores the key and value vectors of all earlier tokens during generation so the model does not recompute them at every step. It is the main memory cost of LLM inference.
advancedMamba is a state-space model architecture that replaces transformer attention with selective state updates. It scales linearly with sequence length and matches transformer quality on many tasks.
advancedMixture of Experts is a transformer variant where each layer has many parallel "expert" feed-forward networks, but only a few are activated per token. Total parameters grow without growing per-token compute.
advancedMulti-head attention runs several attention operations in parallel, each with its own learned projection, then concatenates the results. This lets the model attend to different kinds of relationships at once.
advancedParameter count is the total number of learnable weights in a model — "7B" means 7 billion parameters. It is the most cited model-size metric, though not always the most informative.
beginnerPositional encoding tells the transformer where each token sits in the sequence. Without it, "dog bites man" and "man bites dog" would look identical to the model.
advancedA reasoning model spends extra compute thinking step-by-step before answering. OpenAI o1/o3, DeepSeek R1, and Anthropic's extended thinking are reasoning models.
intermediateRoPE encodes token position by rotating the query and key vectors in attention by an angle proportional to their position. It generalizes well to longer sequences than the model was trained on.
advancedSelf-attention is attention applied within a single sequence: each token attends to every other token in the same input, including itself.
intermediateSliding-window attention limits each token to attending only the most recent W tokens (e.g. 4K), making attention linear in sequence length. Mistral and Gemma use it.
advancedThe transformer is the neural network architecture behind virtually every modern large language model. It uses self-attention to model relationships between all positions in a sequence in parallel.
intermediate