Learning Rates & Optimizers — How SLMs Actually Improve

(Article #6 in the Build Your Own Small Language Model series)

Training a Small Language Model might look like a single monolithic process, but its success depends heavily on two key components:

  • the learning rate, which defines how big each learning step is, and
  • the optimizer, which defines how those steps are calculated.

If you choose these well, your SLM quickly becomes accurate, stable, and specialized. If you choose them poorly, the result is a model that:

  • refuses to learn
  • forgets earlier progress
  • becomes unstable
  • or outputs “hallucinated” formulas, code, or nonsense

This article breaks down the essentials you need to tune both learning rates and optimizers for your SLM training pipeline.

1. What Is a Learning Rate?

The learning rate (LR) controls how much the model’s weights are adjusted during each optimization step.

Think of it as the volume knob of learning:

  • Too low: Training becomes extremely slow
  • Too high: Training becomes unstable or chaotic

A typical stable range for 350M–1B parameter SLMs:

2e-4  → fast but riskier  
1e-4  → safe default  
5e-5  → slow but very stable  

✔ For LoRA fine-tuning

LoRA adapters tend to use higher learning rates because only a tiny fraction of parameters are trained.

LR = 2e-4 to 1e-3

2. What Happens When the Learning Rate Is Wrong?

LR Too High

  • Loss spikes
  • Training becomes noisy
  • Model starts forgetting earlier patterns
  • Outputs random tokens
  • Final performance collapses

LR Too Low

  • Loss decreases painfully slowly
  • Model memorizes instead of generalizing
  • Training takes days instead of hours

The perfect LR decreases loss smoothly without oscillations.

3. What Is an Optimizer?

The optimizer is the algorithm that updates the model’s weights based on:

  • gradients
  • historical values
  • stability terms
  • learning rate schedule

For SLM training, the most common options are:

A. Adam (legacy but common)

Simple, stable, widely used — but memory-heavy.

B. AdamW (industry standard)

The default for LLMs and SLMs.

Benefits:

  • decoupled weight decay
  • more stable
  • faster convergence
  • better generalization

You are already using AdamW in your training code — this is good.

C. Adafactor (memory-efficient)

Used for extremely large models or low-VRAM GPUs.

Pros:

  • Significantly lower memory use
  • Good for training on laptops

Cons:

  • Slightly less stable
  • Requires expert tuning

Not needed for Granite-350M.

4. Learning Rate Schedulers (Critical for Stable Training)

Most training pipelines use schedulers to adjust LR automatically across training.

Common patterns:

A. Linear Warmup → Stable LR → Linear Decay

Warmup (500–1000 steps)
Hold steady
Gradually decay to near 0

Warmup is crucial because:

  • It prevents early training instability
  • It helps LoRA adapters converge quickly
  • It avoids weight explosions

B. Cosine Decay

A smooth curved decay pattern, popular for long runs.

C. Constant LR

Simple but not recommended for long training.

5. Recommended Settings for Training Your Excel SLM

These settings work extremely well for:

  • Granite 350M base
  • LoRA fine-tuning
  • synthetic Excel datasets
  • 5k–100k examples

✔ Base configuration

optimizer = AdamW
lr = 2e-4
warmup_steps = 200
lr_scheduler_type = "linear"

✔ For larger datasets (20k–80k examples)

lr = 1e-4
warmup_steps = 500
scheduler = cosine

✔ For very small batches (batch size = 1)

lr = 5e-5 to 1e-4
gradient_accumulation_steps = 8

This simulates a large batch.

6. How to Inspect Learning Behavior (Loss Curve)

Plot the curve during training:

  • Loss should decrease steadily
  • Occasional small bumps are fine
  • Sharp spikes indicate LR too high
  • Flat line indicates LR too low

If you see oscillation like:

1.4 → 1.0 → 1.3 → 1.0 → 1.4

Your LR is too high.

7. Practical Advice for SLM Builders

✔ Don’t start with a large LR

Always begin with a conservative value.

✔ Use warmup

Even 100–200 steps dramatically stabilizes early training.

✔ Monitor loss from the first batch

You’ll know immediately if LR is too extreme.

✔ LoRA can handle bigger learning rates

Because it only updates small matrices.

✔ Do not change LR mid-training

Unless you fully understand the consequences.

Conclusion

Learning rates and optimizers are the hidden engines behind SLM training. If you choose them well, your model becomes stable, accurate, and smart. If you choose them poorly, training collapses. The good news is that SLMs like Granite-350M respond extremely well to modest learning rates, a simple AdamW optimizer, and a linear warmup schedule.

Master these, and you’ll have full control over how your SLM learns.

Read next article “Batch Size & Gradient Accumulation — Training Efficiently on Limited Hardware

Get early access to the fastest way to turn plain language into Excel formulas—sign up for the waitlist.

Latest Articles