Learning Rate Schedules — Warmup, Decay & Why They Matter

(Article #11 in the Build Your Own Small Language Model series)

Small Language Models (SLMs) are extremely sensitive to the learning rate — the number that determines how big each learning step is. But the learning rate doesn’t have to stay constant. In fact, the most stable and successful training runs depend heavily on learning rate schedules.

A schedule tells the optimizer:

  • how fast to start learning
  • how quickly to reach full speed
  • how to slow down toward the end
  • when to take tiny refinement steps

This article explains the schedules used in modern SLM training, why they’re necessary, and the best options for training your own specialized model.

1. Why Learning Rate Schedules Exist

If the learning rate stays constant:

  • too high → unstable
  • too low → slow
  • changing dataset difficulty → mismatch
  • training early vs late → different needs

SLMs benefit from different learning speeds at different stages of training.

✔ Early training

Weights are random → updates must be gentle.
(Large changes cause instability.)

✔ Middle training

Model has grasped basics → can learn aggressively.

✔ Late training

Model needs refinement → small, precise steps required.

A schedule handles this progression automatically.

2. The Most Important Schedule: Warmup

Warmup increases the learning rate slowly from 0 to your target LR.

Without warmup:

  • gradients explode
  • training loss oscillates
  • early instability ruins the run
  • especially harmful for LoRA fine-tuning

With warmup:

  • stable gradients
  • predictable training
  • smoother convergence
  • fewer early spikes

Warmup typically lasts:

100–500 steps for small datasets  
500–2000 steps for large datasets  

3. The Three Most Common Learning Rate Schedules

A. Linear Warmup → Linear Decay (Most Popular)

This is the default in many training scripts — and for good reason.

  1. Start slow
  2. Increase to peak LR
  3. Decrease linearly over the full run

When to use

✔ LoRA fine-tuning
✔ SLMs under 1B parameters
✔ Domain-specific training
✔ Stable, predictable learning

B. Cosine Decay

After warmup, LR follows a cosine curve downward.

Advantages:

  • smoother late-stage learning
  • less aggressive changes
  • often higher final accuracy

When to use:
✔ Mid-size datasets
✔ Longer training runs
✔ High-quality synthetic tasks

C. Constant LR + Warmup

This is simple, but less ideal.

When useful:

  • very small datasets
  • debugging runs
  • quick experimental training cycles

Not recommended for serious training.

4. What Happens if the Schedule Is Wrong?

❌ No Warmup

Huge loss spikes
Training collapses
Weights destabilize

❌ No Decay

Model “plateaus” early
Overfits quickly
Fails to refine logic

❌ Decay too steep

Model stops learning too soon
Underfits
Loss stagnates

❌ Decay too slow

Model learns aggressively for too long
Overfits

Schedules are not optional — they shape the entire learning process.

5. Recommended Schedules for Granite-350M Excel SLM Training

✔ Best All-Purpose

learning_rate = 2e-4
lr_scheduler = linear
warmup_steps = 300

✔ For large datasets (80,000+ samples)

learning_rate = 1e-4
lr_scheduler = cosine
warmup_steps = 500–1000

✔ For laptop hardware

learning_rate = 5e-5
lr_scheduler = linear
warmup_steps = 200

✔ For LoRA training

learning_rate = 2e-4 to 1e-3
warmup_steps = 100

6. Visualizing the Learning Rate Curve

A good LR schedule looks like this:

  • Phase 1: ramp up
  • Phase 2: plateau
  • Phase 3: gradual decrease

A bad LR schedule looks like:

  • sudden spikes
  • flat lines
  • sharp cliffs
  • jagged stair-steps

A single plot can tell you immediately whether the schedule is healthy.

7. Final Tips for Effective LR Scheduling

✔ If loss oscillates → LR too high

✔ If loss flatlines → LR too low

✔ If model collapses early → warmup missing

✔ If model overfits → decay too slow

✔ If model underfits → decay too fast

Schedules are where mathematics meets intuition.

Learning rate schedules are one of the most powerful tools in SLM training. They keep learning stable early, aggressive in the middle, and refined at the end. For small models trained on domain-specific datasets — like your Excel SLM — choosing the right schedule may be the single most important hyperparameter decision you make.

Get early access to the fastest way to turn plain language into Excel formulas—sign up for the waitlist.

Latest Articles