Skip to main content
ModelTerms

Architecture · advanced

Mamba (state-space model, SSM)

Mamba is a state-space model architecture that replaces transformer attention with selective state updates. It scales linearly with sequence length and matches transformer quality on many tasks.

Explanation

Standard attention is O(n²) in sequence length. Mamba processes a sequence as a recurrence with a learned, input-dependent state, achieving O(n) compute and constant memory per token at inference — appealing for very long contexts and edge deployment.

Mamba and its descendants (Mamba-2, Jamba — a hybrid Mamba+attention model) are the most prominent post-transformer architectures with real production traction. Most frontier labs still ship transformers, but hybrid designs are increasingly common.

The bet: as context windows stretch to millions of tokens, sub-quadratic attention becomes essential.

Examples

  • Mamba-2 reaching transformer-equivalent quality at the 1B-2B scale.
  • Jamba (AI21) combining Mamba blocks with sparse attention.

Frequently asked

What is Mamba?

Mamba is a state-space model architecture that replaces transformer attention with selective state updates. It scales linearly with sequence length and matches transformer quality on many tasks.

What is an example of mamba?

Mamba-2 reaching transformer-equivalent quality at the 1B-2B scale.

How is Mamba related to Transformer?

Mamba and Transformer are both architecture concepts. The transformer is the neural network architecture behind virtually every modern large language model. It uses self-attention to model relationships between all positions in a sequence in parallel.

Is Mamba considered advanced?

Mamba is generally considered advanced-level material in the AI and LLM space.

TransformerArchitecture

The transformer is the neural network architecture behind virtually every modern large language model. It uses self-attention to model relationships between all positions in a sequence in parallel.

AttentionArchitecture

Attention is the mechanism a transformer uses to decide which earlier tokens matter most when producing each new one. It mixes information across positions by weighted sum.

Context WindowInference

The context window is the maximum number of tokens an LLM can consider in a single call — prompt plus generated output combined.

InferenceInference

Inference is what happens when you actually run a trained model on new input. For LLMs that means generating tokens one at a time, with sampling and a KV cache.

Sliding-Window AttentionArchitecture

Sliding-window attention limits each token to attending only the most recent W tokens (e.g. 4K), making attention linear in sequence length. Mistral and Gemma use it.

Side-by-side comparisons

Sources