Evaluation Metrics — How to Measure SLM Performance Properly

(Article #10 in the Build Your Own Small Language Model series)

You can’t improve what you can’t measure.
For Small Language Models (SLMs), evaluation is not optional — it is the difference between:

  • a model that merely runs, and
  • a model that is trustworthy, stable, and true to its domain.

Unlike general-purpose LLMs, SLMs must excel at very specific outputs (like Excel formulas, Sheets functions, or structured code). This makes evaluation both simpler and more exact.

This article explains the metrics that actually matter, how to build meaningful tests, and how to know when your SLM is ready for real-world use.

1. Why Evaluation Matters for SLMs

Unlike massive LLMs, SLMs:

  • have smaller capacity
  • learn narrow tasks
  • require precise outputs
  • can overfit or underfit quickly

Evaluation prevents:

  • hallucinations
  • syntax errors
  • logic mistakes
  • inconsistent formatting
  • loss of domain rules

Your model is only as good as how you measure it.

2. The Three Levels of SLM Evaluation

A proper evaluation strategy has three layers:

A. Token-Level Evaluation

“How close was the predicted output to the ground truth?”

B. Structure-Level Evaluation

“Does the output follow the correct syntax and format?”

C. Task-Level Evaluation

“Does the output work for the intended purpose?”

For Excel SLMs, Task-Level is king — the formula must run without errors.

3. Key Metrics for Small Language Models

1. Accuracy (Binary Correctness)

Did the model produce exactly the expected output?

Great for deterministic domains like:

  • Excel formulas
  • SQL queries
  • Regex patterns
  • Code snippets
Predicted == Expected ? 1 : 0

This gives a clean, meaningful benchmark.

2. Precision (How Many of Its Answers Were Correct?)

Precision measures:

“Out of all outputs the model generated, how many were correct?”

Useful when you care more about correctness than coverage.

3. Recall (How Many Correct Answers Did It Find?)

Recall measures:

“Out of all expected correct answers, how many did the model successfully produce?”

Useful for tasks with multiple valid outputs or variants.

4. Exact Match Score

The strictest metric.

If the reference is:

=SUMIF(B:B,"North",E:E)

And the model outputs:

=SUMIF(B:B,"NORTH",E:E)

Not an exact match
→ Score = 0

Exact Match is unforgiving, but necessary for formula-based SLMs.

5. Syntax Validity Rate

Does the output pass a syntax check?

For Excel:

  • Parentheses balanced
  • Quotes matched
  • Valid Excel functions
  • Valid separators

A model that produces invalid syntax is unreliable.

6. Functional Accuracy (the ultimate metric)

Run the output inside Excel or a formula evaluator.

If:

  • Output parses
  • Output returns correct results on test data
  • Output behaves as expected

Functional accuracy = PASS

This is the gold standard.

4. Building a Proper Benchmark Set

Your benchmark should include:

✔ 50–200 manually vetted cases

Never used in training.

✔ Full distribution of tasks

SUMIF, FILTER, INDEX/MATCH, LAMBDA…
Or their equivalents in Sheets, SQL, etc.

✔ Easy, medium, and hard tasks

Make sure your SLM learns patterns, not memorization.

✔ Edge cases

Missing values, date functions, nested logic.

✔ Ambiguous phrasing

To measure generalization.

✔ Structural traps

Mismatched parentheses, tricky ranges.

This benchmark becomes your single source of truth for progress.

5. How to Evaluate After Each Training Phase

✔ After 1,000 steps

Check basic syntax stability.

✔ After 5,000 steps

Check accuracy on common tasks.

✔ After 10,000–20,000 steps

Check generalization and edge-case performance.

✔ Before finalizing

Ensure functional correctness on 95%+ of your benchmark set.

6. Tools to Automate Evaluation

You can write a simple evaluator in Python:

  • use the tokenizer to encode prompts
  • generate with the fine-tuned model
  • compare outputs with expected targets
  • run optional Excel formula parsing
  • log metrics to CSV or JSON

I can generate this script for you if you’d like.

7. Red Flags During Evaluation

❌ High training accuracy, low benchmark accuracy

overfitting

❌ Good token-level score, bad functional accuracy

→ model is “close” but inconsistent

❌ Good precision, low recall

→ model avoids generating complex answers

❌ Outputs differ only in style

→ dataset has formatting inconsistencies

❌ Syntax errors appear late in training

learning rate too high

Conclusion

Evaluation is where SLM training becomes real.
Without metrics, you’re flying blind.
With the right metrics — accuracy, syntax validity, functional correctness — you can build a Small Language Model that is:

  • precise
  • consistent
  • reliable
  • domain-aligned
  • and production ready

A well-evaluated SLM outperforms much larger models inside its niche — every time.

Get early access to the fastest way to turn plain language into Excel formulas—sign up for the waitlist.

Latest Articles