Training · intermediate
Learning Rate
The learning rate is the step size used to update weights during training. Too high and training diverges; too low and it crawls or gets stuck.
Explanation
Pick the wrong learning rate and nothing else matters: the loss will not go down. Modern training uses a schedule — typically a brief warmup, a peak (e.g., 1e-4), and a cosine decay to near zero.
Different parameters often want different effective learning rates, which is why adaptive optimizers like Adam dominate. Fine-tuning typically uses a learning rate 10-100x smaller than pretraining to avoid destroying the pretrained representations.
Examples
- Pretraining: peak LR around 1e-4 with cosine decay.
- LoRA fine-tuning: 1e-4 to 3e-4 typical.
- Full fine-tuning: 1e-5 to 5e-5 typical.
Frequently asked
What is Learning Rate?
The learning rate is the step size used to update weights during training. Too high and training diverges; too low and it crawls or gets stuck.
What is an example of learning rate?
Pretraining: peak LR around 1e-4 with cosine decay.
How is Learning Rate related to Gradient Descent?
Learning Rate and Gradient Descent are both training concepts. Gradient descent is the optimization algorithm at the heart of training: nudge each weight in the direction that reduces the loss, with a small step size set by the learning rate.
Is Learning Rate considered intermediate?
Learning Rate is generally considered intermediate-level material in the AI and LLM space.