(Article #11 in the Build Your Own Small Language Model series)

Small Language Models (SLMs) are extremely sensitive to the learning rate — the number that determines how big each learning step is. But the learning rate doesn’t have to stay constant. In fact, the most stable and successful training runs depend heavily on learning rate schedules.

A schedule tells the optimizer:

how fast to start learning
how quickly to reach full speed
how to slow down toward the end
when to take tiny refinement steps

This article explains the schedules used in modern SLM training, why they’re necessary, and the best options for training your own specialized model.

1. Why Learning Rate Schedules Exist

If the learning rate stays constant:

too high → unstable
too low → slow
changing dataset difficulty → mismatch
training early vs late → different needs

SLMs benefit from different learning speeds at different stages of training.

✔ Early training

Weights are random → updates must be gentle.
(Large changes cause instability.)

✔ Middle training

Model has grasped basics → can learn aggressively.

✔ Late training

Model needs refinement → small, precise steps required.

A schedule handles this progression automatically.

2. The Most Important Schedule: Warmup

Warmup increases the learning rate slowly from 0 to your target LR.

Without warmup:

gradients explode
training loss oscillates
early instability ruins the run
especially harmful for LoRA fine-tuning

With warmup:

stable gradients
predictable training
smoother convergence
fewer early spikes

Warmup typically lasts:

100–500 steps for small datasets  
500–2000 steps for large datasets

3. The Three Most Common Learning Rate Schedules

A. Linear Warmup → Linear Decay (Most Popular)

This is the default in many training scripts — and for good reason.

Start slow
Increase to peak LR
Decrease linearly over the full run

When to use

✔ LoRA fine-tuning
✔ SLMs under 1B parameters
✔ Domain-specific training
✔ Stable, predictable learning

B. Cosine Decay

After warmup, LR follows a cosine curve downward.

Advantages:

smoother late-stage learning
less aggressive changes
often higher final accuracy

When to use:
✔ Mid-size datasets
✔ Longer training runs
✔ High-quality synthetic tasks

C. Constant LR + Warmup

This is simple, but less ideal.

When useful:

very small datasets
debugging runs
quick experimental training cycles

Not recommended for serious training.

4. What Happens if the Schedule Is Wrong?

❌ No Warmup

Huge loss spikes
Training collapses
Weights destabilize

❌ No Decay

Model “plateaus” early
Overfits quickly
Fails to refine logic

❌ Decay too steep

Model stops learning too soon
Underfits
Loss stagnates

❌ Decay too slow

Model learns aggressively for too long
Overfits

Schedules are not optional — they shape the entire learning process.

5. Recommended Schedules for Granite-350M Excel SLM Training

✔ Best All-Purpose

learning_rate = 2e-4
lr_scheduler = linear
warmup_steps = 300

✔ For large datasets (80,000+ samples)

learning_rate = 1e-4
lr_scheduler = cosine
warmup_steps = 500–1000

✔ For laptop hardware

learning_rate = 5e-5
lr_scheduler = linear
warmup_steps = 200

✔ For LoRA training

learning_rate = 2e-4 to 1e-3
warmup_steps = 100

6. Visualizing the Learning Rate Curve

A good LR schedule looks like this:

Phase 1: ramp up
Phase 2: plateau
Phase 3: gradual decrease

A bad LR schedule looks like:

sudden spikes
flat lines
sharp cliffs
jagged stair-steps

A single plot can tell you immediately whether the schedule is healthy.

7. Final Tips for Effective LR Scheduling

✔ If loss oscillates → LR too high

✔ If loss flatlines → LR too low

✔ If model collapses early → warmup missing

✔ If model overfits → decay too slow

✔ If model underfits → decay too fast

Schedules are where mathematics meets intuition.

Learning rate schedules are one of the most powerful tools in SLM training. They keep learning stable early, aggressive in the middle, and refined at the end. For small models trained on domain-specific datasets — like your Excel SLM — choosing the right schedule may be the single most important hyperparameter decision you make.

Nano Language Models

Learning Rate Schedules — Warmup, Decay & Why They Matter

1. Why Learning Rate Schedules Exist

✔ Early training

✔ Middle training

✔ Late training

2. The Most Important Schedule: Warmup

Without warmup:

With warmup:

3. The Three Most Common Learning Rate Schedules

A. Linear Warmup → Linear Decay (Most Popular)

When to use

B. Cosine Decay

C. Constant LR + Warmup

4. What Happens if the Schedule Is Wrong?

❌ No Warmup

❌ No Decay

❌ Decay too steep

❌ Decay too slow

5. Recommended Schedules for Granite-350M Excel SLM Training

✔ Best All-Purpose

✔ For large datasets (80,000+ samples)

✔ For laptop hardware

✔ For LoRA training

6. Visualizing the Learning Rate Curve

7. Final Tips for Effective LR Scheduling

✔ If loss oscillates → LR too high

✔ If loss flatlines → LR too low

✔ If model collapses early → warmup missing

✔ If model overfits → decay too slow

✔ If model underfits → decay too fast

Latest Articles

Stop Googling Excel Syntax — Let the AI Assistant Handle It

The AI That Understands Your Spreadsheet — User Edition

Learning Rate Schedules — Warmup, Decay & Why They Matter

1. Why Learning Rate Schedules Exist

✔ Early training

✔ Middle training

✔ Late training

2. The Most Important Schedule: Warmup

Without warmup:

With warmup:

3. The Three Most Common Learning Rate Schedules

A. Linear Warmup → Linear Decay (Most Popular)

When to use

B. Cosine Decay

C. Constant LR + Warmup

4. What Happens if the Schedule Is Wrong?

❌ No Warmup

❌ No Decay

❌ Decay too steep

❌ Decay too slow

5. Recommended Schedules for Granite-350M Excel SLM Training

✔ Best All-Purpose

✔ For large datasets (80,000+ samples)

✔ For laptop hardware

✔ For LoRA training

6. Visualizing the Learning Rate Curve

7. Final Tips for Effective LR Scheduling

✔ If loss oscillates → LR too high

✔ If loss flatlines → LR too low

✔ If model collapses early → warmup missing

✔ If model overfits → decay too slow

✔ If model underfits → decay too fast

Share this:

Latest Articles

Stop Googling Excel Syntax — Let the AI Assistant Handle It

The AI That Understands Your Spreadsheet — User Edition