Training · intermediate

Loss Function (objective, cost function)

A loss function measures how wrong a model's predictions are. Training minimizes it. For LLMs the loss is the cross-entropy of predicted vs. actual next tokens.

Published May 29, 2026

Explanation

For LLMs, the loss is almost always cross-entropy over the vocabulary: the negative log-probability the model assigned to the correct next token. Average it over all positions and all training examples.

Lower loss usually means a better model, but not always — especially after RLHF, where the goal shifts to maximizing a learned reward rather than matching specific text.

Examples

Cross-entropy loss in next-token prediction.
Reward model loss in RLHF: how well it ranks pairs of responses.

Frequently asked

What is Loss Function?

A loss function measures how wrong a model's predictions are. Training minimizes it. For LLMs the loss is the cross-entropy of predicted vs. actual next tokens.

What is an example of loss function?

Cross-entropy loss in next-token prediction.

How is Loss Function related to Gradient Descent?

Loss Function and Gradient Descent are both training concepts. Gradient descent is the optimization algorithm at the heart of training: nudge each weight in the direction that reduces the loss, with a small step size set by the learning rate.

Is Loss Function considered intermediate?

Loss Function is generally considered intermediate-level material in the AI and LLM space.

Gradient DescentTraining

Gradient descent is the optimization algorithm at the heart of training: nudge each weight in the direction that reduces the loss, with a small step size set by the learning rate.

BackpropagationTraining

Backpropagation is the algorithm used to compute how each weight in a neural network should change to reduce error, by propagating gradients backward through the network.

PretrainingTraining

Pretraining is the initial training phase where an LLM learns to predict the next token on trillions of tokens of general text. It produces a base model that can be adapted later.

PerplexityEvaluation

Perplexity measures how "surprised" a language model is by held-out text. Lower is better. It is the natural intrinsic eval for next-token prediction.

Side-by-side comparisons

Sources

Wikipedia — Loss function