Skip to main content
ModelTerms

Training · intermediate

Learning Rate

The learning rate is the step size used to update weights during training. Too high and training diverges; too low and it crawls or gets stuck.

Explanation

Pick the wrong learning rate and nothing else matters: the loss will not go down. Modern training uses a schedule — typically a brief warmup, a peak (e.g., 1e-4), and a cosine decay to near zero.

Different parameters often want different effective learning rates, which is why adaptive optimizers like Adam dominate. Fine-tuning typically uses a learning rate 10-100x smaller than pretraining to avoid destroying the pretrained representations.

Examples

  • Pretraining: peak LR around 1e-4 with cosine decay.
  • LoRA fine-tuning: 1e-4 to 3e-4 typical.
  • Full fine-tuning: 1e-5 to 5e-5 typical.

Frequently asked

What is Learning Rate?

The learning rate is the step size used to update weights during training. Too high and training diverges; too low and it crawls or gets stuck.

What is an example of learning rate?

Pretraining: peak LR around 1e-4 with cosine decay.

How is Learning Rate related to Gradient Descent?

Learning Rate and Gradient Descent are both training concepts. Gradient descent is the optimization algorithm at the heart of training: nudge each weight in the direction that reduces the loss, with a small step size set by the learning rate.

Is Learning Rate considered intermediate?

Learning Rate is generally considered intermediate-level material in the AI and LLM space.

Gradient DescentTraining

Gradient descent is the optimization algorithm at the heart of training: nudge each weight in the direction that reduces the loss, with a small step size set by the learning rate.

PretrainingTraining

Pretraining is the initial training phase where an LLM learns to predict the next token on trillions of tokens of general text. It produces a base model that can be adapted later.

Fine-tuningTraining

Fine-tuning continues training a pretrained model on a smaller, task-specific dataset, adjusting its weights to specialize behavior or knowledge.

Loss FunctionTraining

A loss function measures how wrong a model's predictions are. Training minimizes it. For LLMs the loss is the cross-entropy of predicted vs. actual next tokens.

Scaling LawsTraining

Scaling laws are the empirical power-law relationship between model size, training data, training compute, and resulting loss. They predict that bigger, more data-fed models keep improving in a smooth, forecastable way.

Side-by-side comparisons

Sources