Skip to main content
ModelTerms

Architecture · beginner

Parameter Count (model size)

Parameter count is the total number of learnable weights in a model — "7B" means 7 billion parameters. It is the most cited model-size metric, though not always the most informative.

Explanation

Each weight is a number adjusted during training. Today's open models span 1B-400B+ parameters; frontier closed models are presumed to be larger, often as Mixture-of-Experts where the "total" and "active" counts differ.

Parameter count is correlated with capability but only roughly. A well-trained 70B can beat a poorly-trained 175B. The Chinchilla finding showed many older "huge" models were undertrained — too many parameters per token.

For inference, parameters directly determine memory: a 70B BF16 model needs ~140 GB of GPU memory just to hold weights.

Examples

  • Llama 3 family: 8B, 70B, 405B.
  • Mixtral 8x7B: 47B total / ~13B active per token.

Frequently asked

What is Parameter Count?

Parameter count is the total number of learnable weights in a model — "7B" means 7 billion parameters. It is the most cited model-size metric, though not always the most informative.

What is an example of parameter count?

Llama 3 family: 8B, 70B, 405B.

How is Parameter Count related to Large Language Model?

Parameter Count and Large Language Model are both architecture concepts. A large language model is a neural network trained on huge amounts of text to predict the next token in a sequence. GPT-4, Claude, and Gemini are all LLMs.

Is Parameter Count considered beginner?

Parameter Count is generally considered beginner-level material in the AI and LLM space.

Large Language ModelFoundations

A large language model is a neural network trained on huge amounts of text to predict the next token in a sequence. GPT-4, Claude, and Gemini are all LLMs.

Mixture of ExpertsArchitecture

Mixture of Experts is a transformer variant where each layer has many parallel "expert" feed-forward networks, but only a few are activated per token. Total parameters grow without growing per-token compute.

Scaling LawsTraining

Scaling laws are the empirical power-law relationship between model size, training data, training compute, and resulting loss. They predict that bigger, more data-fed models keep improving in a smooth, forecastable way.

Training ComputeTraining

Training compute is the total floating-point operations used to pretrain a model, usually expressed as FLOPs (e.g. 10^25 FLOPs). It is the headline number governments now regulate.

QuantizationInfrastructure

Quantization reduces model weights from 16- or 32-bit floats to lower-precision types (INT8, INT4) so the model needs less memory and runs faster, usually with minor quality loss.

Side-by-side comparisons

Sources