Skip to main content
ModelTerms

Training · intermediate

Gradient Descent (SGD, stochastic gradient descent)

Gradient descent is the optimization algorithm at the heart of training: nudge each weight in the direction that reduces the loss, with a small step size set by the learning rate.

Explanation

Compute the gradient of the loss with respect to the weights (via backprop), then update each weight by subtracting a small multiple of its gradient. Repeat many times across batches of training data. The model converges toward (a local minimum of) the loss landscape.

Modern training uses variants like Adam and AdamW that adapt the per-parameter step size using running averages of gradients. Stochastic gradient descent means using a small random batch of examples per step rather than the whole dataset.

Examples

  • A linear regression model learning the slope and intercept.
  • A 70B-parameter LLM trained with AdamW and a cosine learning-rate schedule.

Frequently asked

What is Gradient Descent?

Gradient descent is the optimization algorithm at the heart of training: nudge each weight in the direction that reduces the loss, with a small step size set by the learning rate.

What is an example of gradient descent?

A linear regression model learning the slope and intercept.

How is Gradient Descent related to Backpropagation?

Gradient Descent and Backpropagation are both training concepts. Backpropagation is the algorithm used to compute how each weight in a neural network should change to reduce error, by propagating gradients backward through the network.

Is Gradient Descent considered intermediate?

Gradient Descent is generally considered intermediate-level material in the AI and LLM space.

Side-by-side comparisons

Sources