Training · intermediate

Gradient Descent (SGD, stochastic gradient descent)

Gradient descent is the optimization algorithm at the heart of training: nudge each weight in the direction that reduces the loss, with a small step size set by the learning rate.

Published May 29, 2026

Explanation

Compute the gradient of the loss with respect to the weights (via backprop), then update each weight by subtracting a small multiple of its gradient. Repeat many times across batches of training data. The model converges toward (a local minimum of) the loss landscape.

Modern training uses variants like Adam and AdamW that adapt the per-parameter step size using running averages of gradients. Stochastic gradient descent means using a small random batch of examples per step rather than the whole dataset.

Examples

A linear regression model learning the slope and intercept.
A 70B-parameter LLM trained with AdamW and a cosine learning-rate schedule.

Frequently asked

What is Gradient Descent?

Gradient descent is the optimization algorithm at the heart of training: nudge each weight in the direction that reduces the loss, with a small step size set by the learning rate.

What is an example of gradient descent?

A linear regression model learning the slope and intercept.

How is Gradient Descent related to Backpropagation?

Gradient Descent and Backpropagation are both training concepts. Backpropagation is the algorithm used to compute how each weight in a neural network should change to reduce error, by propagating gradients backward through the network.

Is Gradient Descent considered intermediate?

Gradient Descent is generally considered intermediate-level material in the AI and LLM space.

BackpropagationTraining

Backpropagation is the algorithm used to compute how each weight in a neural network should change to reduce error, by propagating gradients backward through the network.

Learning RateTraining

The learning rate is the step size used to update weights during training. Too high and training diverges; too low and it crawls or gets stuck.

Loss FunctionTraining

A loss function measures how wrong a model's predictions are. Training minimizes it. For LLMs the loss is the cross-entropy of predicted vs. actual next tokens.

Neural NetworkFoundations

A neural network is a stack of simple mathematical units ("neurons") that learn to transform inputs into outputs by adjusting numeric weights during training.

Side-by-side comparisons

Sources

Wikipedia — Stochastic gradient descent