(Article #6 in the Build Your Own Small Language Model series)

Training a Small Language Model might look like a single monolithic process, but its success depends heavily on two key components:

the learning rate, which defines how big each learning step is, and
the optimizer, which defines how those steps are calculated.

If you choose these well, your SLM quickly becomes accurate, stable, and specialized. If you choose them poorly, the result is a model that:

refuses to learn
forgets earlier progress
becomes unstable
or outputs “hallucinated” formulas, code, or nonsense

This article breaks down the essentials you need to tune both learning rates and optimizers for your SLM training pipeline.

1. What Is a Learning Rate?

The learning rate (LR) controls how much the model’s weights are adjusted during each optimization step.

Think of it as the volume knob of learning:

Too low: Training becomes extremely slow
Too high: Training becomes unstable or chaotic

A typical stable range for 350M–1B parameter SLMs:

2e-4  → fast but riskier  
1e-4  → safe default  
5e-5  → slow but very stable

✔ For LoRA fine-tuning

LoRA adapters tend to use higher learning rates because only a tiny fraction of parameters are trained.

LR = 2e-4 to 1e-3

2. What Happens When the Learning Rate Is Wrong?

LR Too High

Loss spikes
Training becomes noisy
Model starts forgetting earlier patterns
Outputs random tokens
Final performance collapses

LR Too Low

Loss decreases painfully slowly
Model memorizes instead of generalizing
Training takes days instead of hours

The perfect LR decreases loss smoothly without oscillations.

3. What Is an Optimizer?

The optimizer is the algorithm that updates the model’s weights based on:

gradients
historical values
stability terms
learning rate schedule

For SLM training, the most common options are:

A. Adam (legacy but common)

Simple, stable, widely used — but memory-heavy.

B. AdamW (industry standard)

The default for LLMs and SLMs.

Benefits:

decoupled weight decay
more stable
faster convergence
better generalization

You are already using AdamW in your training code — this is good.

C. Adafactor (memory-efficient)

Used for extremely large models or low-VRAM GPUs.

Pros:

Significantly lower memory use
Good for training on laptops

Cons:

Slightly less stable
Requires expert tuning

Not needed for Granite-350M.

4. Learning Rate Schedulers (Critical for Stable Training)

Most training pipelines use schedulers to adjust LR automatically across training.

Common patterns:

A. Linear Warmup → Stable LR → Linear Decay

Warmup (500–1000 steps)
Hold steady
Gradually decay to near 0

Warmup is crucial because:

It prevents early training instability
It helps LoRA adapters converge quickly
It avoids weight explosions

B. Cosine Decay

A smooth curved decay pattern, popular for long runs.

C. Constant LR

Simple but not recommended for long training.

5. Recommended Settings for Training Your Excel SLM

These settings work extremely well for:

Granite 350M base
LoRA fine-tuning
synthetic Excel datasets
5k–100k examples

✔ Base configuration

optimizer = AdamW
lr = 2e-4
warmup_steps = 200
lr_scheduler_type = "linear"

✔ For larger datasets (20k–80k examples)

lr = 1e-4
warmup_steps = 500
scheduler = cosine

✔ For very small batches (batch size = 1)

lr = 5e-5 to 1e-4
gradient_accumulation_steps = 8

This simulates a large batch.

6. How to Inspect Learning Behavior (Loss Curve)

Plot the curve during training:

Loss should decrease steadily
Occasional small bumps are fine
Sharp spikes indicate LR too high
Flat line indicates LR too low

If you see oscillation like:

1.4 → 1.0 → 1.3 → 1.0 → 1.4

Your LR is too high.

7. Practical Advice for SLM Builders

✔ Don’t start with a large LR

Always begin with a conservative value.

✔ Use warmup

Even 100–200 steps dramatically stabilizes early training.

✔ Monitor loss from the first batch

You’ll know immediately if LR is too extreme.

✔ LoRA can handle bigger learning rates

Because it only updates small matrices.

✔ Do not change LR mid-training

Unless you fully understand the consequences.

Conclusion

Learning rates and optimizers are the hidden engines behind SLM training. If you choose them well, your model becomes stable, accurate, and smart. If you choose them poorly, training collapses. The good news is that SLMs like Granite-350M respond extremely well to modest learning rates, a simple AdamW optimizer, and a linear warmup schedule.

Master these, and you’ll have full control over how your SLM learns.

Nano Language Models

Learning Rates & Optimizers — How SLMs Actually Improve

1. What Is a Learning Rate?

✔ For LoRA fine-tuning

2. What Happens When the Learning Rate Is Wrong?

LR Too High

LR Too Low

3. What Is an Optimizer?

A. Adam (legacy but common)

B. AdamW (industry standard)

C. Adafactor (memory-efficient)

4. Learning Rate Schedulers (Critical for Stable Training)

A. Linear Warmup → Stable LR → Linear Decay

B. Cosine Decay

C. Constant LR

5. Recommended Settings for Training Your Excel SLM

✔ Base configuration

✔ For larger datasets (20k–80k examples)

✔ For very small batches (batch size = 1)

6. How to Inspect Learning Behavior (Loss Curve)

7. Practical Advice for SLM Builders

✔ Don’t start with a large LR

✔ Use warmup

✔ Monitor loss from the first batch

✔ LoRA can handle bigger learning rates

✔ Do not change LR mid-training

Conclusion

Latest Articles

Stop Googling Excel Syntax — Let the AI Assistant Handle It

The AI That Understands Your Spreadsheet — User Edition

Learning Rates & Optimizers — How SLMs Actually Improve

1. What Is a Learning Rate?

✔ For LoRA fine-tuning

2. What Happens When the Learning Rate Is Wrong?

LR Too High

LR Too Low

3. What Is an Optimizer?

A. Adam (legacy but common)

B. AdamW (industry standard)

C. Adafactor (memory-efficient)

4. Learning Rate Schedulers (Critical for Stable Training)

A. Linear Warmup → Stable LR → Linear Decay

B. Cosine Decay

C. Constant LR

5. Recommended Settings for Training Your Excel SLM

✔ Base configuration

✔ For larger datasets (20k–80k examples)

✔ For very small batches (batch size = 1)

6. How to Inspect Learning Behavior (Loss Curve)

7. Practical Advice for SLM Builders

✔ Don’t start with a large LR

✔ Use warmup

✔ Monitor loss from the first batch

✔ LoRA can handle bigger learning rates

✔ Do not change LR mid-training

Conclusion

Share this:

Latest Articles

Stop Googling Excel Syntax — Let the AI Assistant Handle It

The AI That Understands Your Spreadsheet — User Edition