(Article #9 in the Build Your Own Small Language Model series)
Training a Small Language Model (SLM) is a balancing act. Train too little and the model remains weak. Train too much and the model becomes brittle, overly specific, or unable to generalize. These two failure modes are known as:
- underfitting
- overfitting
Finding the balance between them is one of the defining skills in SLM engineering. This article explains how both happen, how to detect them, and how to avoid them when training models like Granite-350M, Phi-2, or TinyLlama on domain-specific tasks.
1. What Is Underfitting?
Underfitting means the model hasn’t learned enough from your data.
It hasn’t captured the patterns, structure, or logic required to perform your task reliably.
Signs of Underfitting
- Loss stays high and barely improves
- Outputs are inconsistent or incomplete
- Syntax errors occur frequently
- Model behaves like the base model, not your specialized version
- Training accuracy is low
- Evaluation accuracy is also low
Causes of Underfitting
- Too little training time
- Too small dataset
- Dataset too diverse
- Learning rate too low
- Weak training signals (messy formatting, noisy examples)
- Poor prompt structure
Example (Excel SLM)
Model still outputs:
=SUM(A:A)
when asked for:
“Sum values in E where B="North"”
This shows it hasn’t internalized task-specific patterns yet.
2. What Is Overfitting?
Overfitting means the model memorized training data too closely and cannot generalize.
Instead of learning rules, it learns:
- exact examples
- exact phrase patterns
- exact column letters
- your dataset’s biases
Signs of Overfitting
- Training loss is very low
- Evaluation loss is much higher
- Model repeats training patterns verbatim
- Small perturbations in input cause large errors
- Model fails on edge cases
- Formulas look “memorized” rather than adapted
Example (Excel SLM)
You trained using A:A, B:B, C:C extensively.
Now the model outputs:
=SUMIF(B:B,"North",E:E)
even when asked for different column combinations.
3. Why Small Models Are More Sensitive
Small Language Models (≤1B parameters) have:
- limited capacity
- smaller embedding layers
- simpler attention
- restricted context windows
This means they:
- underfit fast (if dataset too small)
- overfit fast (if dataset too repetitive)
- reach their “optimal capacity” quickly
This is good news — your training cycles are shorter, faster, and easier to tune.
4. How to Detect Underfitting and Overfitting
✔ Track training loss vs validation loss
| Condition | Training Loss | Validation Loss |
|---|---|---|
| Underfitting | high | high |
| Overfitting | low | high |
| Good Training | lowish | lowish |
If validation loss is higher than training loss and diverging, you’re overfitting.
✔ Use a gold benchmark dataset
A small set of 50–200 high-quality examples
(never used in training)
lets you check generalization without ambiguity.
✔ Inspect qualitative outputs
Overfit models:
- repeat exact instructions
- reuse specific training formats
- output the same columns repeatedly
Underfit models:
- output random formulas
- skip logic
- produce syntax errors
5. How to Prevent Underfitting
✔ Use enough training samples
For a domain like Excel:
- 5,000 samples → minimum
- 10,000–40,000 → ideal
- 80,000+ → excellent for multi-task Excel SLMs
✔ Increase training steps
Most SLMs need:
- 1,000–8,000 steps for small datasets
- 10,000–40,000 steps for large datasets
✔ Increase model capacity a bit
A 700M or 1.3B model underfits less on messy datasets.
✔ Improve dataset consistency
Stable formatting reduces cognitive load on the model.
6. How to Prevent Overfitting
✔ Add variation
Don’t use the same column letters or values repeatedly.
✔ Use synthetic diversity
Even tiny changes prevent memorization:
- random numbers
- random labels (“North”, “West”, “Online”, “Active”)
- different ranges
- different delimiters
✔ Increase data size
More samples = better generalization.
✔ Use early stopping
If validation loss increases for 3–5 checkpoints → stop.
✔ Lower the learning rate
High LR can push the model into memorization.
7. How to Strike the Perfect Balance
Your SLM is properly trained when:
- training loss is low
- evaluation loss is also low
- benchmark results are consistent
- outputs vary appropriately
- outputs generalize to unseen cases
- no memorized patterns appear
This is the sweet spot — the point where the SLM understands the rules, not just the examples.
Recommended Settings for Granite-350M Excel SLM Training
✔ Dataset size
10k–80k examples
✔ Effective batch size
32–128 tokens (via gradient accumulation)
✔ Learning rate
1e-4 or 2e-4
✔ Sequence length
128–256 tokens
✔ Evaluation every
200–500 training steps
✔ Stop training when
Validation loss plateaus or increases
Conclusion
Overfitting and underfitting are two sides of the same challenge:
making sure your SLM learns just enough, but not too much.
If you monitor loss curves, maintain clean datasets, introduce variation, and evaluate regularly, you can train small models that:
- generalize well
- stay stable
- avoid hallucination
- and deliver accurate domain-specific behavior
Small models are powerful — but only when trained in balance.