Small AI models are getting powerful enough to become real tools — fast, cheap, and accurate in specialized domains.
To push the limits of 6SigmaMind, I built a public benchmark where three cutting-edge small models answer the same Excel prompt side by side:
🔹 SmolLM-1.7B — the 6SigmaMind baseline
🔹 Qwen2.5-1.5B-Instruct — one of the most capable small instruction models
🔹 DeepSeek R1 Distill Llama-1.5B — the new reasoning-focused small model
👉 Try all 3 live: https://huggingface.co/spaces/benkemp/6SigmaMindV3
👉 The python code used for the benchmarking available here on Github
👉 The benchmark side-by-side comparison of the SLMs is available here on Google docs
👉 The Excel Function benchmark testing excel sheet is available here
These three models are all in the 1.5–1.7B range — small enough to run on CPU, but smart enough to generate valid Excel logic.
Let’s see how they behave.
⚡ Why Compare These Three Models?
Because the future of AI is not only about scale — it’s about specialization.
Small models:
- run anywhere
- respond quickly
- cost nothing to operate
- can be fine-tuned into extremely focused assistants
6SigmaMind is designed to answer one question:
How good can a tiny model become at Excel formulas?
To find out, we test three different architectures, each with its own strengths.
🥊 The 6SigmaMind Battle Arena
One prompt → three different reasoning styles.
Try these prompts in the Space to see the differences:
- “Sum values in column C where column B equals ‘Closed’.”
- “Return the last non-empty cell in column B.”
- “Calculate correlation between A and B.”
- “Perform a two-tailed t-test between C2:C50 and D2:D50.”
- “XLOOKUP the price in D where A matches H2.”
Each model gives a unique “flavor” of output.
🧪 1. SmolLM-1.7B — The 6SigmaMind Baseline
⭐ Strengths
- Fastest responses
- Strong at SUMIFS / COUNTIFS
- Reliable
IF,AND,ORlogic - Good at simple lookups
- Concise formulas
⚠️ Weaknesses
- Argument ordering mistakes
- Rare statistical errors
- Occasionally echoes the prompt
🎯 Best use
Lightweight, embedded Excel assistants.
🧪 2. Qwen2.5-1.5B-Instruct
⭐ Strengths
- Most consistent reasoning of the small trio
- Excellent at modern Excel functions (
XLOOKUP,FILTER) - Structure-aware (very good argument ordering)
- Handles unusual phrasings better than SmolLM
⚠️ Weaknesses
- Sometimes too verbose
- Occasionally over-explains unless constrained
🎯 Best use
When accuracy and structure matter most.
🧪 3. DeepSeek R1 Distill Llama-1.5B
This model is the newest in the group — designed with a distilled-reasoning approach, meaning it tries to “think” more carefully than typical 1B–2B models.
⭐ Strengths
- Very strong reasoning for its size
- Good at multi-step formula logic
- Often provides notably precise function choices
- Handles text-processing functions well (
LEFT,MID,SEARCH)
⚠️ Weaknesses
- Sometimes produces “explanatory” text unless guided
- Statistical formulas can be hit-or-miss
- Sometimes generates slightly unusual but logically valid formulas
🎯 Best use
Exploring how reasoning-distilled small models behave on structured tasks.
🧠 What the Side-by-Side Results Reveal
✔ Small models already understand Excel semantics
All three know:
SUMIFS, COUNTIFS, IF logic, XLOOKUP, STDEV.S, CORREL, FILTER.
✔ They each have unique reasoning tendencies
- SmolLM → direct & fast
- Qwen → structured & accurate
- DeepSeek R1 → reflective & reasoning-driven
✔ 1.5–1.7B is a “sweet spot”
Large enough to handle structured tasks,
small enough to run anywhere.
✔ Fine-tuning will make the difference
With a specialized Excel dataset, these models will jump another level.
📊 Mini-Benchmark: Example Outputs
Prompt:
“Return the last non-empty value in column B.”
| Model | Typical Output |
|---|---|
| SmolLM-1.7B | =LOOKUP(2,1/(B:B<>""),B:B) |
| Qwen-1.5B | Same as above, very consistent |
| DeepSeek R1 | Same formula, sometimes adds brief reasoning |
Prompt:
“Calculate the standard deviation of B2:B80.”
| Model | Typical Output |
|---|---|
| SmolLM-1.7B | =STDEV.S(B2:B80) |
| Qwen-1.5B | Most reliable answer |
| DeepSeek R1 | Usually correct, sometimes extra commentary |
Prompt:
“Perform a two-tailed t-test on C2:C50 vs D2:D50.”
| Model | Typical Output |
|---|---|
| SmolLM-1.7B | Attempts but inconsistent |
| Qwen-1.5B | =T.TEST(C2:C50, D2:D50, 2, 2) (most accurate) |
| DeepSeek R1 | Usually correct, but may attempt an alternate approach |
🚀 Why This Matters for 6SigmaMind
Because 6SigmaMind isn’t just a demo —
it’s the start of a small-model specialization journey.
This comparison shows:
- Small models are already close to what’s needed
- Fine-tuning will push them from “good” to “expert”
- A benchmark helps identify which model becomes 6SigmaMindv3
This is the foundation for an Excel-focused mini-LLM.
🎯 Try the Live Benchmark Yourself
👉 Launch the model-comparison Space:
https://huggingface.co/spaces/benkemp/6SigmaMindV3
Try prompts like:
- “Sum C where B = ‘Closed’.”
- “Lookup D where A matches H2.”
- “IF A2 > 100 then return ‘High’ else ‘OK’.”
- “Correlation between A and B.”
- “Two-sample t-test for C2:C50 vs D2:D50.”
You’ll immediately feel which model “thinks” the way you like.
Small models are the future — and 6SigmaMind is being built in the open.