Understanding Loss Functions — How SLMs Measure Mistakes

(Article #8 in the Build Your Own Small Language Model series)

Every Small Language Model (SLM) learns through one mechanism: making mistakes and correcting them. To do that, the model needs a mathematical way to measure how wrong it was.

That measurement is called the loss function.

The loss determines:

  • how fast your model learns
  • how stable training is
  • whether the model collapses or improves
  • how well it generalizes (formula patterns, logic, reasoning)

In this article, we’ll explain loss functions in a simple, practical way — focusing on exactly what you need to know to train your own SLM.

1. What Is a Loss Function?

A loss function quantifies the difference between:

  • the correct output
  • the model’s predicted output

Loss is a single number:
lower = better.

If the model predicts the correct next token, loss is low.
If it predicts the wrong token, loss is high.

SLMs learn by minimizing loss — repeatedly and gradually.

2. The Most Important Loss Function for SLMs: Cross Entropy

Causal language models almost always use:

Cross Entropy Loss

Why?

  • It compares probability distributions
  • It penalizes incorrect tokens
  • It rewards confident, correct predictions
  • It works extremely well with token-based models
  • It’s stable and fully compatible with Transformers

Cross entropy calculates how “surprised” the model is by the correct answer.

If the model expected the correct answer → low surprise → low loss.
If the answer was unexpected → high surprise → high loss.

This is the core idea behind LLM learning.

3. How Loss Is Computed in Training

Let’s say the target output is:

=SUMIF(B:B,"North",E:E)

But the model predicts:

=SUMIF(B:B,"Noth",E:E)

It missed one character (“North” → “Noth”).
Cross entropy loss will:

  • penalize the incorrect token
  • reduce the gradients around that error
  • adjust weights to favor “North” next time

Loss is computed token by token.

Then averaged across the entire sequence.

4. Typical Loss Values for SLM Training

Here’s what to expect when training a small model like Granite-350M:

PhaseTypical Loss
Initial batches3.0 – 5.0
1–2 hours in1.2 – 2.0
Well-trained domain model0.5 – 1.0
Excellent specialization0.2 – 0.6

Loss never goes to zero — language is too complex and probabilistic.
But a downward trend indicates learning.

5. What Does “Good Loss” Look Like?

✔ Downward curve (smooth decline)

Model is learning efficiently.

✔ Occasional small spikes

Normal — especially with small batch sizes.

✔ Plateau after long training

Model has learned all it can from the dataset.

❌ Huge oscillations

Learning rate too high.

❌ Flatline at high loss

Learning rate too low
OR dataset is too small
OR training broke.

6. How to Interpret Loss for Your Excel SLM

For a domain like Excel formulas:

  • patterns are consistent
  • outputs are structured
  • vocabulary is small
  • syntax is predictable

This means:

  • loss drops fast
  • the model becomes specialized quickly
  • you hit “excellent” performance with fewer examples

This is why your Excel SLM can reach near-perfect results using:

  • 5,000–20,000 samples
  • clean templates
  • LoRA adapters
  • small model sizes (≤1B)

7. Loss vs Accuracy: They Are Not the Same

A common misunderstanding:

“Low loss means 100% accurate outputs.”

Not exactly.

Loss is continuous, accuracy is discrete.

Example:

Model outputs:

=FILTER(A:B,A:A="North")

Expected:

=FILTER(A:B,(A:A="North"))

The difference is tiny.
Loss is low.
Accuracy is technically “wrong.”

This is why you must evaluate your SLM on a separate benchmark set.

8. How to Improve Loss During Training

✔ Clean your dataset

Remove formatting inconsistencies — they increase token confusion.

✔ Increase gradient accumulation

Stabilizes updates.

✔ Lower the learning rate

Especially if loss spikes.

✔ Increase training samples

More variety = better pattern learning.

✔ Use consistent templates

Better structure → fewer tokens → tighter loss.

✔ Use warmup

Reduces early instability.

9. Loss in LoRA Fine-Tuning (Very Important)

LoRA modifies only a small set of weights.

This means:

  • loss decreases faster
  • it requires fewer steps
  • it is less prone to catastrophic forgetting
  • lower VRAM use
  • faster training

Most real improvements come from dataset quality, not training volume.

Conclusion

Loss functions are the compass guiding your SLM through training. When you understand how loss works — and what “good loss” looks like — you gain the ability to:

  • tune learning rates
  • adjust batch sizes
  • design better datasets
  • detect training instability
  • know when your model is done

Loss isn’t just a number.
It’s the heartbeat of your model’s learning journey.

Read the next article on serie “Overfitting vs Underfitting — Finding the Sweet Spot in SLM Training

Get early access to the fastest way to turn plain language into Excel formulas—sign up for the waitlist.

Latest Articles