6SigmaMind Model Comparison: SmolLM-1.7B vs Qwen-1.5B vs DeepSeek R1 — Which Small Model Understands Excel Best?

Small AI models are getting powerful enough to become real tools — fast, cheap, and accurate in specialized domains.
To push the limits of 6SigmaMind, I built a public benchmark where three cutting-edge small models answer the same Excel prompt side by side:

🔹 SmolLM-1.7B — the 6SigmaMind baseline

🔹 Qwen2.5-1.5B-Instruct — one of the most capable small instruction models

🔹 DeepSeek R1 Distill Llama-1.5B — the new reasoning-focused small model

👉 Try all 3 live: https://huggingface.co/spaces/benkemp/6SigmaMindV3

👉 The python code used for the benchmarking available here on Github

👉 The benchmark side-by-side comparison of the SLMs is available here on Google docs

👉 The Excel Function benchmark testing excel sheet is available here

These three models are all in the 1.5–1.7B range — small enough to run on CPU, but smart enough to generate valid Excel logic.

Let’s see how they behave.

⚡ Why Compare These Three Models?

Because the future of AI is not only about scale — it’s about specialization.

Small models:

  • run anywhere
  • respond quickly
  • cost nothing to operate
  • can be fine-tuned into extremely focused assistants

6SigmaMind is designed to answer one question:

How good can a tiny model become at Excel formulas?

To find out, we test three different architectures, each with its own strengths.

🥊 The 6SigmaMind Battle Arena

One prompt → three different reasoning styles.

Try these prompts in the Space to see the differences:

  • “Sum values in column C where column B equals ‘Closed’.”
  • “Return the last non-empty cell in column B.”
  • “Calculate correlation between A and B.”
  • “Perform a two-tailed t-test between C2:C50 and D2:D50.”
  • “XLOOKUP the price in D where A matches H2.”

Each model gives a unique “flavor” of output.

🧪 1. SmolLM-1.7B — The 6SigmaMind Baseline

⭐ Strengths

  • Fastest responses
  • Strong at SUMIFS / COUNTIFS
  • Reliable IF, AND, OR logic
  • Good at simple lookups
  • Concise formulas

⚠️ Weaknesses

  • Argument ordering mistakes
  • Rare statistical errors
  • Occasionally echoes the prompt

🎯 Best use

Lightweight, embedded Excel assistants.

🧪 2. Qwen2.5-1.5B-Instruct

⭐ Strengths

  • Most consistent reasoning of the small trio
  • Excellent at modern Excel functions (XLOOKUP, FILTER)
  • Structure-aware (very good argument ordering)
  • Handles unusual phrasings better than SmolLM

⚠️ Weaknesses

  • Sometimes too verbose
  • Occasionally over-explains unless constrained

🎯 Best use

When accuracy and structure matter most.

🧪 3. DeepSeek R1 Distill Llama-1.5B

This model is the newest in the group — designed with a distilled-reasoning approach, meaning it tries to “think” more carefully than typical 1B–2B models.

⭐ Strengths

  • Very strong reasoning for its size
  • Good at multi-step formula logic
  • Often provides notably precise function choices
  • Handles text-processing functions well (LEFT, MID, SEARCH)

⚠️ Weaknesses

  • Sometimes produces “explanatory” text unless guided
  • Statistical formulas can be hit-or-miss
  • Sometimes generates slightly unusual but logically valid formulas

🎯 Best use

Exploring how reasoning-distilled small models behave on structured tasks.

🧠 What the Side-by-Side Results Reveal

✔ Small models already understand Excel semantics

All three know:
SUMIFS, COUNTIFS, IF logic, XLOOKUP, STDEV.S, CORREL, FILTER.

✔ They each have unique reasoning tendencies

  • SmolLM → direct & fast
  • Qwen → structured & accurate
  • DeepSeek R1 → reflective & reasoning-driven

✔ 1.5–1.7B is a “sweet spot”

Large enough to handle structured tasks,
small enough to run anywhere.

✔ Fine-tuning will make the difference

With a specialized Excel dataset, these models will jump another level.

📊 Mini-Benchmark: Example Outputs

Prompt:

“Return the last non-empty value in column B.”

ModelTypical Output
SmolLM-1.7B=LOOKUP(2,1/(B:B<>""),B:B)
Qwen-1.5BSame as above, very consistent
DeepSeek R1Same formula, sometimes adds brief reasoning

Prompt:

“Calculate the standard deviation of B2:B80.”

ModelTypical Output
SmolLM-1.7B=STDEV.S(B2:B80)
Qwen-1.5BMost reliable answer
DeepSeek R1Usually correct, sometimes extra commentary

Prompt:

“Perform a two-tailed t-test on C2:C50 vs D2:D50.”

ModelTypical Output
SmolLM-1.7BAttempts but inconsistent
Qwen-1.5B=T.TEST(C2:C50, D2:D50, 2, 2) (most accurate)
DeepSeek R1Usually correct, but may attempt an alternate approach

🚀 Why This Matters for 6SigmaMind

Because 6SigmaMind isn’t just a demo —
it’s the start of a small-model specialization journey.

This comparison shows:

  • Small models are already close to what’s needed
  • Fine-tuning will push them from “good” to “expert”
  • A benchmark helps identify which model becomes 6SigmaMindv3

This is the foundation for an Excel-focused mini-LLM.

🎯 Try the Live Benchmark Yourself

👉 Launch the model-comparison Space:
https://huggingface.co/spaces/benkemp/6SigmaMindV3

Try prompts like:

  • “Sum C where B = ‘Closed’.”
  • “Lookup D where A matches H2.”
  • “IF A2 > 100 then return ‘High’ else ‘OK’.”
  • “Correlation between A and B.”
  • “Two-sample t-test for C2:C50 vs D2:D50.”

You’ll immediately feel which model “thinks” the way you like.

Small models are the future — and 6SigmaMind is being built in the open.

Get early access to the fastest way to turn plain language into Excel formulas—sign up for the waitlist.

Latest Articles