(Article #11 in the Build Your Own Small Language Model series)
Small Language Models (SLMs) are extremely sensitive to the learning rate — the number that determines how big each learning step is. But the learning rate doesn’t have to stay constant. In fact, the most stable and successful training runs depend heavily on learning rate schedules.
A schedule tells the optimizer:
- how fast to start learning
- how quickly to reach full speed
- how to slow down toward the end
- when to take tiny refinement steps
This article explains the schedules used in modern SLM training, why they’re necessary, and the best options for training your own specialized model.
1. Why Learning Rate Schedules Exist
If the learning rate stays constant:
- too high → unstable
- too low → slow
- changing dataset difficulty → mismatch
- training early vs late → different needs
SLMs benefit from different learning speeds at different stages of training.
✔ Early training
Weights are random → updates must be gentle.
(Large changes cause instability.)
✔ Middle training
Model has grasped basics → can learn aggressively.
✔ Late training
Model needs refinement → small, precise steps required.
A schedule handles this progression automatically.
2. The Most Important Schedule: Warmup
Warmup increases the learning rate slowly from 0 to your target LR.
Without warmup:
- gradients explode
- training loss oscillates
- early instability ruins the run
- especially harmful for LoRA fine-tuning
With warmup:
- stable gradients
- predictable training
- smoother convergence
- fewer early spikes
Warmup typically lasts:
100–500 steps for small datasets
500–2000 steps for large datasets
3. The Three Most Common Learning Rate Schedules
A. Linear Warmup → Linear Decay (Most Popular)
This is the default in many training scripts — and for good reason.
- Start slow
- Increase to peak LR
- Decrease linearly over the full run
When to use
✔ LoRA fine-tuning
✔ SLMs under 1B parameters
✔ Domain-specific training
✔ Stable, predictable learning
B. Cosine Decay
After warmup, LR follows a cosine curve downward.
Advantages:
- smoother late-stage learning
- less aggressive changes
- often higher final accuracy
When to use:
✔ Mid-size datasets
✔ Longer training runs
✔ High-quality synthetic tasks
C. Constant LR + Warmup
This is simple, but less ideal.
When useful:
- very small datasets
- debugging runs
- quick experimental training cycles
Not recommended for serious training.
4. What Happens if the Schedule Is Wrong?
❌ No Warmup
Huge loss spikes
Training collapses
Weights destabilize
❌ No Decay
Model “plateaus” early
Overfits quickly
Fails to refine logic
❌ Decay too steep
Model stops learning too soon
Underfits
Loss stagnates
❌ Decay too slow
Model learns aggressively for too long
Overfits
Schedules are not optional — they shape the entire learning process.
5. Recommended Schedules for Granite-350M Excel SLM Training
✔ Best All-Purpose
learning_rate = 2e-4
lr_scheduler = linear
warmup_steps = 300
✔ For large datasets (80,000+ samples)
learning_rate = 1e-4
lr_scheduler = cosine
warmup_steps = 500–1000
✔ For laptop hardware
learning_rate = 5e-5
lr_scheduler = linear
warmup_steps = 200
✔ For LoRA training
learning_rate = 2e-4 to 1e-3
warmup_steps = 100
6. Visualizing the Learning Rate Curve
A good LR schedule looks like this:
- Phase 1: ramp up
- Phase 2: plateau
- Phase 3: gradual decrease
A bad LR schedule looks like:
- sudden spikes
- flat lines
- sharp cliffs
- jagged stair-steps
A single plot can tell you immediately whether the schedule is healthy.
7. Final Tips for Effective LR Scheduling
✔ If loss oscillates → LR too high
✔ If loss flatlines → LR too low
✔ If model collapses early → warmup missing
✔ If model overfits → decay too slow
✔ If model underfits → decay too fast
Schedules are where mathematics meets intuition.
Learning rate schedules are one of the most powerful tools in SLM training. They keep learning stable early, aggressive in the middle, and refined at the end. For small models trained on domain-specific datasets — like your Excel SLM — choosing the right schedule may be the single most important hyperparameter decision you make.