(Article #6 in the Build Your Own Small Language Model series)
Training a Small Language Model might look like a single monolithic process, but its success depends heavily on two key components:
- the learning rate, which defines how big each learning step is, and
- the optimizer, which defines how those steps are calculated.
If you choose these well, your SLM quickly becomes accurate, stable, and specialized. If you choose them poorly, the result is a model that:
- refuses to learn
- forgets earlier progress
- becomes unstable
- or outputs “hallucinated” formulas, code, or nonsense
This article breaks down the essentials you need to tune both learning rates and optimizers for your SLM training pipeline.
1. What Is a Learning Rate?
The learning rate (LR) controls how much the model’s weights are adjusted during each optimization step.
Think of it as the volume knob of learning:
- Too low: Training becomes extremely slow
- Too high: Training becomes unstable or chaotic
A typical stable range for 350M–1B parameter SLMs:
2e-4 → fast but riskier
1e-4 → safe default
5e-5 → slow but very stable
✔ For LoRA fine-tuning
LoRA adapters tend to use higher learning rates because only a tiny fraction of parameters are trained.
LR = 2e-4 to 1e-3
2. What Happens When the Learning Rate Is Wrong?
LR Too High
- Loss spikes
- Training becomes noisy
- Model starts forgetting earlier patterns
- Outputs random tokens
- Final performance collapses
LR Too Low
- Loss decreases painfully slowly
- Model memorizes instead of generalizing
- Training takes days instead of hours
The perfect LR decreases loss smoothly without oscillations.
3. What Is an Optimizer?
The optimizer is the algorithm that updates the model’s weights based on:
- gradients
- historical values
- stability terms
- learning rate schedule
For SLM training, the most common options are:
A. Adam (legacy but common)
Simple, stable, widely used — but memory-heavy.
B. AdamW (industry standard)
The default for LLMs and SLMs.
Benefits:
- decoupled weight decay
- more stable
- faster convergence
- better generalization
You are already using AdamW in your training code — this is good.
C. Adafactor (memory-efficient)
Used for extremely large models or low-VRAM GPUs.
Pros:
- Significantly lower memory use
- Good for training on laptops
Cons:
- Slightly less stable
- Requires expert tuning
Not needed for Granite-350M.
4. Learning Rate Schedulers (Critical for Stable Training)
Most training pipelines use schedulers to adjust LR automatically across training.
Common patterns:
A. Linear Warmup → Stable LR → Linear Decay
Warmup (500–1000 steps)
Hold steady
Gradually decay to near 0
Warmup is crucial because:
- It prevents early training instability
- It helps LoRA adapters converge quickly
- It avoids weight explosions
B. Cosine Decay
A smooth curved decay pattern, popular for long runs.
C. Constant LR
Simple but not recommended for long training.
5. Recommended Settings for Training Your Excel SLM
These settings work extremely well for:
- Granite 350M base
- LoRA fine-tuning
- synthetic Excel datasets
- 5k–100k examples
✔ Base configuration
optimizer = AdamW
lr = 2e-4
warmup_steps = 200
lr_scheduler_type = "linear"
✔ For larger datasets (20k–80k examples)
lr = 1e-4
warmup_steps = 500
scheduler = cosine
✔ For very small batches (batch size = 1)
lr = 5e-5 to 1e-4
gradient_accumulation_steps = 8
This simulates a large batch.
6. How to Inspect Learning Behavior (Loss Curve)
Plot the curve during training:
- Loss should decrease steadily
- Occasional small bumps are fine
- Sharp spikes indicate LR too high
- Flat line indicates LR too low
If you see oscillation like:
1.4 → 1.0 → 1.3 → 1.0 → 1.4
Your LR is too high.
7. Practical Advice for SLM Builders
✔ Don’t start with a large LR
Always begin with a conservative value.
✔ Use warmup
Even 100–200 steps dramatically stabilizes early training.
✔ Monitor loss from the first batch
You’ll know immediately if LR is too extreme.
✔ LoRA can handle bigger learning rates
Because it only updates small matrices.
✔ Do not change LR mid-training
Unless you fully understand the consequences.
Conclusion
Learning rates and optimizers are the hidden engines behind SLM training. If you choose them well, your model becomes stable, accurate, and smart. If you choose them poorly, training collapses. The good news is that SLMs like Granite-350M respond extremely well to modest learning rates, a simple AdamW optimizer, and a linear warmup schedule.
Master these, and you’ll have full control over how your SLM learns.
Read next article “Batch Size & Gradient Accumulation — Training Efficiently on Limited Hardware“