(Article #8 in the Build Your Own Small Language Model series)
Every Small Language Model (SLM) learns through one mechanism: making mistakes and correcting them. To do that, the model needs a mathematical way to measure how wrong it was.
That measurement is called the loss function.
The loss determines:
- how fast your model learns
- how stable training is
- whether the model collapses or improves
- how well it generalizes (formula patterns, logic, reasoning)
In this article, we’ll explain loss functions in a simple, practical way — focusing on exactly what you need to know to train your own SLM.
1. What Is a Loss Function?
A loss function quantifies the difference between:
- the correct output
- the model’s predicted output
Loss is a single number:
lower = better.
If the model predicts the correct next token, loss is low.
If it predicts the wrong token, loss is high.
SLMs learn by minimizing loss — repeatedly and gradually.
2. The Most Important Loss Function for SLMs: Cross Entropy
Causal language models almost always use:
✔ Cross Entropy Loss
Why?
- It compares probability distributions
- It penalizes incorrect tokens
- It rewards confident, correct predictions
- It works extremely well with token-based models
- It’s stable and fully compatible with Transformers
Cross entropy calculates how “surprised” the model is by the correct answer.
If the model expected the correct answer → low surprise → low loss.
If the answer was unexpected → high surprise → high loss.
This is the core idea behind LLM learning.
3. How Loss Is Computed in Training
Let’s say the target output is:
=SUMIF(B:B,"North",E:E)
But the model predicts:
=SUMIF(B:B,"Noth",E:E)
It missed one character (“North” → “Noth”).
Cross entropy loss will:
- penalize the incorrect token
- reduce the gradients around that error
- adjust weights to favor “North” next time
Loss is computed token by token.
Then averaged across the entire sequence.
4. Typical Loss Values for SLM Training
Here’s what to expect when training a small model like Granite-350M:
| Phase | Typical Loss |
|---|---|
| Initial batches | 3.0 – 5.0 |
| 1–2 hours in | 1.2 – 2.0 |
| Well-trained domain model | 0.5 – 1.0 |
| Excellent specialization | 0.2 – 0.6 |
Loss never goes to zero — language is too complex and probabilistic.
But a downward trend indicates learning.
5. What Does “Good Loss” Look Like?
✔ Downward curve (smooth decline)
Model is learning efficiently.
✔ Occasional small spikes
Normal — especially with small batch sizes.
✔ Plateau after long training
Model has learned all it can from the dataset.
❌ Huge oscillations
Learning rate too high.
❌ Flatline at high loss
Learning rate too low
OR dataset is too small
OR training broke.
6. How to Interpret Loss for Your Excel SLM
For a domain like Excel formulas:
- patterns are consistent
- outputs are structured
- vocabulary is small
- syntax is predictable
This means:
- loss drops fast
- the model becomes specialized quickly
- you hit “excellent” performance with fewer examples
This is why your Excel SLM can reach near-perfect results using:
- 5,000–20,000 samples
- clean templates
- LoRA adapters
- small model sizes (≤1B)
7. Loss vs Accuracy: They Are Not the Same
A common misunderstanding:
“Low loss means 100% accurate outputs.”
Not exactly.
Loss is continuous, accuracy is discrete.
Example:
Model outputs:
=FILTER(A:B,A:A="North")
Expected:
=FILTER(A:B,(A:A="North"))
The difference is tiny.
Loss is low.
Accuracy is technically “wrong.”
This is why you must evaluate your SLM on a separate benchmark set.
8. How to Improve Loss During Training
✔ Clean your dataset
Remove formatting inconsistencies — they increase token confusion.
✔ Increase gradient accumulation
Stabilizes updates.
✔ Lower the learning rate
Especially if loss spikes.
✔ Increase training samples
More variety = better pattern learning.
✔ Use consistent templates
Better structure → fewer tokens → tighter loss.
✔ Use warmup
Reduces early instability.
9. Loss in LoRA Fine-Tuning (Very Important)
LoRA modifies only a small set of weights.
This means:
- loss decreases faster
- it requires fewer steps
- it is less prone to catastrophic forgetting
- lower VRAM use
- faster training
Most real improvements come from dataset quality, not training volume.
Conclusion
Loss functions are the compass guiding your SLM through training. When you understand how loss works — and what “good loss” looks like — you gain the ability to:
- tune learning rates
- adjust batch sizes
- design better datasets
- detect training instability
- know when your model is done
Loss isn’t just a number.
It’s the heartbeat of your model’s learning journey.
Read the next article on serie “Overfitting vs Underfitting — Finding the Sweet Spot in SLM Training“