(Article #10 in the Build Your Own Small Language Model series)
You can’t improve what you can’t measure.
For Small Language Models (SLMs), evaluation is not optional — it is the difference between:
- a model that merely runs, and
- a model that is trustworthy, stable, and true to its domain.
Unlike general-purpose LLMs, SLMs must excel at very specific outputs (like Excel formulas, Sheets functions, or structured code). This makes evaluation both simpler and more exact.
This article explains the metrics that actually matter, how to build meaningful tests, and how to know when your SLM is ready for real-world use.
1. Why Evaluation Matters for SLMs
Unlike massive LLMs, SLMs:
- have smaller capacity
- learn narrow tasks
- require precise outputs
- can overfit or underfit quickly
Evaluation prevents:
- hallucinations
- syntax errors
- logic mistakes
- inconsistent formatting
- loss of domain rules
Your model is only as good as how you measure it.
2. The Three Levels of SLM Evaluation
A proper evaluation strategy has three layers:
A. Token-Level Evaluation
“How close was the predicted output to the ground truth?”
B. Structure-Level Evaluation
“Does the output follow the correct syntax and format?”
C. Task-Level Evaluation
“Does the output work for the intended purpose?”
For Excel SLMs, Task-Level is king — the formula must run without errors.
3. Key Metrics for Small Language Models
1. Accuracy (Binary Correctness)
Did the model produce exactly the expected output?
Great for deterministic domains like:
- Excel formulas
- SQL queries
- Regex patterns
- Code snippets
Predicted == Expected ? 1 : 0
This gives a clean, meaningful benchmark.
2. Precision (How Many of Its Answers Were Correct?)
Precision measures:
“Out of all outputs the model generated, how many were correct?”
Useful when you care more about correctness than coverage.
3. Recall (How Many Correct Answers Did It Find?)
Recall measures:
“Out of all expected correct answers, how many did the model successfully produce?”
Useful for tasks with multiple valid outputs or variants.
4. Exact Match Score
The strictest metric.
If the reference is:
=SUMIF(B:B,"North",E:E)
And the model outputs:
=SUMIF(B:B,"NORTH",E:E)
→ Not an exact match
→ Score = 0
Exact Match is unforgiving, but necessary for formula-based SLMs.
5. Syntax Validity Rate
Does the output pass a syntax check?
For Excel:
- Parentheses balanced
- Quotes matched
- Valid Excel functions
- Valid separators
A model that produces invalid syntax is unreliable.
6. Functional Accuracy (the ultimate metric)
Run the output inside Excel or a formula evaluator.
If:
- Output parses
- Output returns correct results on test data
- Output behaves as expected
→ Functional accuracy = PASS
This is the gold standard.
4. Building a Proper Benchmark Set
Your benchmark should include:
✔ 50–200 manually vetted cases
Never used in training.
✔ Full distribution of tasks
SUMIF, FILTER, INDEX/MATCH, LAMBDA…
Or their equivalents in Sheets, SQL, etc.
✔ Easy, medium, and hard tasks
Make sure your SLM learns patterns, not memorization.
✔ Edge cases
Missing values, date functions, nested logic.
✔ Ambiguous phrasing
To measure generalization.
✔ Structural traps
Mismatched parentheses, tricky ranges.
This benchmark becomes your single source of truth for progress.
5. How to Evaluate After Each Training Phase
✔ After 1,000 steps
Check basic syntax stability.
✔ After 5,000 steps
Check accuracy on common tasks.
✔ After 10,000–20,000 steps
Check generalization and edge-case performance.
✔ Before finalizing
Ensure functional correctness on 95%+ of your benchmark set.
6. Tools to Automate Evaluation
You can write a simple evaluator in Python:
- use the tokenizer to encode prompts
- generate with the fine-tuned model
- compare outputs with expected targets
- run optional Excel formula parsing
- log metrics to CSV or JSON
I can generate this script for you if you’d like.
7. Red Flags During Evaluation
❌ High training accuracy, low benchmark accuracy
→ overfitting
❌ Good token-level score, bad functional accuracy
→ model is “close” but inconsistent
❌ Good precision, low recall
→ model avoids generating complex answers
❌ Outputs differ only in style
→ dataset has formatting inconsistencies
❌ Syntax errors appear late in training
→ learning rate too high
Conclusion
Evaluation is where SLM training becomes real.
Without metrics, you’re flying blind.
With the right metrics — accuracy, syntax validity, functional correctness — you can build a Small Language Model that is:
- precise
- consistent
- reliable
- domain-aligned
- and production ready
A well-evaluated SLM outperforms much larger models inside its niche — every time.