(Article #10 in the Build Your Own Small Language Model series)

You can’t improve what you can’t measure.
For Small Language Models (SLMs), evaluation is not optional — it is the difference between:

a model that merely runs, and
a model that is trustworthy, stable, and true to its domain.

Unlike general-purpose LLMs, SLMs must excel at very specific outputs (like Excel formulas, Sheets functions, or structured code). This makes evaluation both simpler and more exact.

This article explains the metrics that actually matter, how to build meaningful tests, and how to know when your SLM is ready for real-world use.

1. Why Evaluation Matters for SLMs

Unlike massive LLMs, SLMs:

have smaller capacity
learn narrow tasks
require precise outputs
can overfit or underfit quickly

Evaluation prevents:

hallucinations
syntax errors
logic mistakes
inconsistent formatting
loss of domain rules

Your model is only as good as how you measure it.

2. The Three Levels of SLM Evaluation

A proper evaluation strategy has three layers:

A. Token-Level Evaluation

“How close was the predicted output to the ground truth?”

B. Structure-Level Evaluation

“Does the output follow the correct syntax and format?”

C. Task-Level Evaluation

“Does the output work for the intended purpose?”

For Excel SLMs, Task-Level is king — the formula must run without errors.

3. Key Metrics for Small Language Models

1. Accuracy (Binary Correctness)

Did the model produce exactly the expected output?

Great for deterministic domains like:

Excel formulas
SQL queries
Regex patterns
Code snippets

Predicted == Expected ? 1 : 0

This gives a clean, meaningful benchmark.

2. Precision (How Many of Its Answers Were Correct?)

Precision measures:

“Out of all outputs the model generated, how many were correct?”

Useful when you care more about correctness than coverage.

3. Recall (How Many Correct Answers Did It Find?)

Recall measures:

“Out of all expected correct answers, how many did the model successfully produce?”

Useful for tasks with multiple valid outputs or variants.

4. Exact Match Score

The strictest metric.

If the reference is:

=SUMIF(B:B,"North",E:E)

And the model outputs:

=SUMIF(B:B,"NORTH",E:E)

→ Not an exact match
→ Score = 0

Exact Match is unforgiving, but necessary for formula-based SLMs.

5. Syntax Validity Rate

Does the output pass a syntax check?

For Excel:

Parentheses balanced
Quotes matched
Valid Excel functions
Valid separators

A model that produces invalid syntax is unreliable.

6. Functional Accuracy (the ultimate metric)

Run the output inside Excel or a formula evaluator.

If:

Output parses
Output returns correct results on test data
Output behaves as expected

→ Functional accuracy = PASS

This is the gold standard.

4. Building a Proper Benchmark Set

Your benchmark should include:

✔ 50–200 manually vetted cases

Never used in training.

✔ Full distribution of tasks

SUMIF, FILTER, INDEX/MATCH, LAMBDA…
Or their equivalents in Sheets, SQL, etc.

✔ Easy, medium, and hard tasks

Make sure your SLM learns patterns, not memorization.

✔ Edge cases

Missing values, date functions, nested logic.

✔ Ambiguous phrasing

To measure generalization.

✔ Structural traps

Mismatched parentheses, tricky ranges.

This benchmark becomes your single source of truth for progress.

5. How to Evaluate After Each Training Phase

✔ After 1,000 steps

Check basic syntax stability.

✔ After 5,000 steps

Check accuracy on common tasks.

✔ After 10,000–20,000 steps

Check generalization and edge-case performance.

✔ Before finalizing

Ensure functional correctness on 95%+ of your benchmark set.

6. Tools to Automate Evaluation

You can write a simple evaluator in Python:

use the tokenizer to encode prompts
generate with the fine-tuned model
compare outputs with expected targets
run optional Excel formula parsing
log metrics to CSV or JSON

I can generate this script for you if you’d like.

7. Red Flags During Evaluation

❌ High training accuracy, low benchmark accuracy

→ overfitting

❌ Good token-level score, bad functional accuracy

→ model is “close” but inconsistent

❌ Good precision, low recall

→ model avoids generating complex answers

❌ Outputs differ only in style

→ dataset has formatting inconsistencies

❌ Syntax errors appear late in training

→ learning rate too high

Conclusion

Evaluation is where SLM training becomes real.
Without metrics, you’re flying blind.
With the right metrics — accuracy, syntax validity, functional correctness — you can build a Small Language Model that is:

precise
consistent
reliable
domain-aligned
and production ready

A well-evaluated SLM outperforms much larger models inside its niche — every time.

Nano Language Models

Evaluation Metrics — How to Measure SLM Performance Properly

1. Why Evaluation Matters for SLMs

2. The Three Levels of SLM Evaluation

A. Token-Level Evaluation

B. Structure-Level Evaluation

C. Task-Level Evaluation

3. Key Metrics for Small Language Models

1. Accuracy (Binary Correctness)

2. Precision (How Many of Its Answers Were Correct?)

3. Recall (How Many Correct Answers Did It Find?)

4. Exact Match Score

5. Syntax Validity Rate

6. Functional Accuracy (the ultimate metric)

4. Building a Proper Benchmark Set

✔ 50–200 manually vetted cases

✔ Full distribution of tasks

✔ Easy, medium, and hard tasks

✔ Edge cases

✔ Ambiguous phrasing

✔ Structural traps

5. How to Evaluate After Each Training Phase

✔ After 1,000 steps

✔ After 5,000 steps

✔ After 10,000–20,000 steps

✔ Before finalizing

6. Tools to Automate Evaluation

7. Red Flags During Evaluation

❌ High training accuracy, low benchmark accuracy

❌ Good token-level score, bad functional accuracy

❌ Good precision, low recall

❌ Outputs differ only in style

❌ Syntax errors appear late in training

Conclusion

Latest Articles

Stop Googling Excel Syntax — Let the AI Assistant Handle It

The AI That Understands Your Spreadsheet — User Edition

Evaluation Metrics — How to Measure SLM Performance Properly

1. Why Evaluation Matters for SLMs

2. The Three Levels of SLM Evaluation

A. Token-Level Evaluation

B. Structure-Level Evaluation

C. Task-Level Evaluation

3. Key Metrics for Small Language Models

1. Accuracy (Binary Correctness)

2. Precision (How Many of Its Answers Were Correct?)

3. Recall (How Many Correct Answers Did It Find?)

4. Exact Match Score

5. Syntax Validity Rate

6. Functional Accuracy (the ultimate metric)

4. Building a Proper Benchmark Set

✔ 50–200 manually vetted cases

✔ Full distribution of tasks

✔ Easy, medium, and hard tasks

✔ Edge cases

✔ Ambiguous phrasing

✔ Structural traps

5. How to Evaluate After Each Training Phase

✔ After 1,000 steps

✔ After 5,000 steps

✔ After 10,000–20,000 steps

✔ Before finalizing

6. Tools to Automate Evaluation

7. Red Flags During Evaluation

❌ High training accuracy, low benchmark accuracy

❌ Good token-level score, bad functional accuracy

❌ Good precision, low recall

❌ Outputs differ only in style

❌ Syntax errors appear late in training

Conclusion

Share this:

Latest Articles

Stop Googling Excel Syntax — Let the AI Assistant Handle It

The AI That Understands Your Spreadsheet — User Edition