(Article #8 in the Build Your Own Small Language Model series)

Every Small Language Model (SLM) learns through one mechanism: making mistakes and correcting them. To do that, the model needs a mathematical way to measure how wrong it was.

That measurement is called the loss function.

The loss determines:

how fast your model learns
how stable training is
whether the model collapses or improves
how well it generalizes (formula patterns, logic, reasoning)

In this article, we’ll explain loss functions in a simple, practical way — focusing on exactly what you need to know to train your own SLM.

1. What Is a Loss Function?

A loss function quantifies the difference between:

the correct output
the model’s predicted output

Loss is a single number:
lower = better.

If the model predicts the correct next token, loss is low.
If it predicts the wrong token, loss is high.

SLMs learn by minimizing loss — repeatedly and gradually.

2. The Most Important Loss Function for SLMs: Cross Entropy

Causal language models almost always use:

✔ Cross Entropy Loss

Why?

It compares probability distributions
It penalizes incorrect tokens
It rewards confident, correct predictions
It works extremely well with token-based models
It’s stable and fully compatible with Transformers

Cross entropy calculates how “surprised” the model is by the correct answer.

If the model expected the correct answer → low surprise → low loss.
If the answer was unexpected → high surprise → high loss.

This is the core idea behind LLM learning.

3. How Loss Is Computed in Training

Let’s say the target output is:

=SUMIF(B:B,"North",E:E)

But the model predicts:

=SUMIF(B:B,"Noth",E:E)

It missed one character (“North” → “Noth”).
Cross entropy loss will:

penalize the incorrect token
reduce the gradients around that error
adjust weights to favor “North” next time

Loss is computed token by token.

Then averaged across the entire sequence.

4. Typical Loss Values for SLM Training

Here’s what to expect when training a small model like Granite-350M:

Phase	Typical Loss
Initial batches	3.0 – 5.0
1–2 hours in	1.2 – 2.0
Well-trained domain model	0.5 – 1.0
Excellent specialization	0.2 – 0.6

Loss never goes to zero — language is too complex and probabilistic.
But a downward trend indicates learning.

5. What Does “Good Loss” Look Like?

✔ Downward curve (smooth decline)

Model is learning efficiently.

✔ Occasional small spikes

Normal — especially with small batch sizes.

✔ Plateau after long training

Model has learned all it can from the dataset.

❌ Huge oscillations

Learning rate too high.

❌ Flatline at high loss

Learning rate too low
OR dataset is too small
OR training broke.

6. How to Interpret Loss for Your Excel SLM

For a domain like Excel formulas:

patterns are consistent
outputs are structured
vocabulary is small
syntax is predictable

This means:

loss drops fast
the model becomes specialized quickly
you hit “excellent” performance with fewer examples

This is why your Excel SLM can reach near-perfect results using:

5,000–20,000 samples
clean templates
LoRA adapters
small model sizes (≤1B)

7. Loss vs Accuracy: They Are Not the Same

A common misunderstanding:

“Low loss means 100% accurate outputs.”

Not exactly.

Loss is continuous, accuracy is discrete.

Example:

Model outputs:

=FILTER(A:B,A:A="North")

Expected:

=FILTER(A:B,(A:A="North"))

The difference is tiny.
Loss is low.
Accuracy is technically “wrong.”

This is why you must evaluate your SLM on a separate benchmark set.

8. How to Improve Loss During Training

✔ Clean your dataset

Remove formatting inconsistencies — they increase token confusion.

✔ Increase gradient accumulation

Stabilizes updates.

✔ Lower the learning rate

Especially if loss spikes.

✔ Increase training samples

More variety = better pattern learning.

✔ Use consistent templates

Better structure → fewer tokens → tighter loss.

✔ Use warmup

Reduces early instability.

9. Loss in LoRA Fine-Tuning (Very Important)

LoRA modifies only a small set of weights.

This means:

loss decreases faster
it requires fewer steps
it is less prone to catastrophic forgetting
lower VRAM use
faster training

Most real improvements come from dataset quality, not training volume.

Conclusion

Loss functions are the compass guiding your SLM through training. When you understand how loss works — and what “good loss” looks like — you gain the ability to:

tune learning rates
adjust batch sizes
design better datasets
detect training instability
know when your model is done

Loss isn’t just a number.
It’s the heartbeat of your model’s learning journey.

Read the next article on serie “Overfitting vs Underfitting — Finding the Sweet Spot in SLM Training“

Nano Language Models

Understanding Loss Functions — How SLMs Measure Mistakes

1. What Is a Loss Function?

2. The Most Important Loss Function for SLMs: Cross Entropy

✔ Cross Entropy Loss

3. How Loss Is Computed in Training

4. Typical Loss Values for SLM Training

5. What Does “Good Loss” Look Like?

✔ Downward curve (smooth decline)

✔ Occasional small spikes

✔ Plateau after long training

❌ Huge oscillations

❌ Flatline at high loss

6. How to Interpret Loss for Your Excel SLM

7. Loss vs Accuracy: They Are Not the Same

8. How to Improve Loss During Training

✔ Clean your dataset

✔ Increase gradient accumulation

✔ Lower the learning rate

✔ Increase training samples

✔ Use consistent templates

✔ Use warmup

9. Loss in LoRA Fine-Tuning (Very Important)

Conclusion

Latest Articles

Stop Googling Excel Syntax — Let the AI Assistant Handle It

The AI That Understands Your Spreadsheet — User Edition

Understanding Loss Functions — How SLMs Measure Mistakes

1. What Is a Loss Function?

2. The Most Important Loss Function for SLMs: Cross Entropy

✔ Cross Entropy Loss

3. How Loss Is Computed in Training

4. Typical Loss Values for SLM Training

5. What Does “Good Loss” Look Like?

✔ Downward curve (smooth decline)

✔ Occasional small spikes

✔ Plateau after long training

❌ Huge oscillations

❌ Flatline at high loss

6. How to Interpret Loss for Your Excel SLM

7. Loss vs Accuracy: They Are Not the Same

8. How to Improve Loss During Training

✔ Clean your dataset

✔ Increase gradient accumulation

✔ Lower the learning rate

✔ Increase training samples

✔ Use consistent templates

✔ Use warmup

9. Loss in LoRA Fine-Tuning (Very Important)

Conclusion

Share this:

Latest Articles

Stop Googling Excel Syntax — Let the AI Assistant Handle It

The AI That Understands Your Spreadsheet — User Edition