(Article #2 in the Build Your Own Small Language Model series)

If you want to build your own Small Language Model (SLM), the single most important step is the dataset. The model is just math — your dataset is the intelligence. Even a perfectly engineered SLM becomes useless when trained on noisy, inconsistent, or irrelevant data. That is why professional AI teams invest up to 80% of their development time into dataset construction and cleaning.

In this article, we’ll walk through how to collect, structure, sanitize, and prepare a high-quality dataset for training a domain-specialized SLM, whether you’re building an Excel model, a Google Sheets model, a summarization agent, or any niche assistant.

1. Start With Your Task Definition (Narrow Beats Broad)

Before writing a single line of JSONL, you must answer one question:

“What exactly should my SLM be able to do — and what should it refuse to do?”

An SLM trained on too many task types becomes blurry and inconsistent. A tightly scoped SLM becomes shockingly good.

Examples:

Goal	Good Dataset	Bad Dataset
Excel formula generator	Human instructions → Excel formulas	Random QA, jokes, essays
Google Sheets assistant	Prompts → GS functions	Web scraping, code, math proofs
Python docstring explainer	Code → explanations	CSV tables, text dumps
Price tracking agent	URLs → extracted price fields	News articles, blog posts

Rule:
Your dataset should contain only examples of the behavior you want the model to produce.

2. Choose a Dataset Format That Works for SLMs

Most small models expect training samples in a single-string prompt → response format, packaged as JSONL.

A clean structured example:

{
  "input": "<INSTRUCTION>Sort this list by date.</INSTRUCTION>",
  "output": "=SORT(A2:B100,2,TRUE)"
}

Or, for multi-mode SLMs:

{
  "task": "fix_formula",
  "broken": "=SUM(A:A)",
  "output": "=SUM(A:A)"  
}

For most use cases, the recommended fields are:

task (optional)
input / instruction
output (the model’s expected answer)
never include comments, metadata, or explanations unless you want the model to learn them

Your dataset becomes more powerful if it uses consistent templating. For example:

<INSTRUCTION>…</INSTRUCTION>
<OUTPUT>…</OUTPUT>

This trains the model to always produce the right structure.

3. Collecting Data: Three Reliable Sources

A. Write Custom Synthetic Examples

For domain-specific SLMs, synthetic data is king.

Advantages:

Infinite volume
No copyright issues
No “human noise”
Perfect pattern diversity
Works extremely well for structured domains (Excel, SQL, regex, code)

Even OpenAI, Meta, and Microsoft use synthetic generators.

B. Convert Existing Knowledge Bases

You can safely transform:

Internal documentation
Public reference tables
Function lists
Tool instructions
Data schemas

into structured input/output pairs.

Example: take any Excel function documentation and convert it into:

Instruction → Formula
Formula → Explanation
Broken → Fixed

C. Collect task-specific user queries

If you are training an assistant for an existing SaaS product, gather:

customer questions
task logs
real input patterns
example workflows

This gives your SLM “real-world context”.

4. Cleaning Your Dataset: The Non-Negotiable Step

A dataset is only as good as its cleanliness.

A. Remove duplicates

Two identical samples cause:

overfitting
reduced generalization
training instability

Always dedupe, even synthetic sets.

B. Remove conflicting answers

If the same input has multiple different outputs, your model becomes confused and unstable:

❌

"Round to two decimals" → ROUND(A2,2)
"Round to two decimals" → FIXED(A2,2)

Pick one style and enforce it everywhere.

C. Normalize formatting

Clean:

spacing
casing
punctuation
leading/trailing whitespace
token structure

For Excel:

❌ =sumif ( A2 : A100 , "North" , C2:C100 )
✔ =SUMIF(A2:A100,"North",C2:C100)

For NLP:

❌ inconsistent casing
✔ consistent templates

D. Ensure inputs and outputs match your model’s required shape

Every model type has rules.

Causal LM?
→ Put everything into a single string the model learns left-to-right.

Seq2Seq?
→ Keep input and output fields separate.

LoRA?
→ Keep training samples short (≤512 tokens recommended for 350M models).

5. Building a “Gold Standard” Dataset Slice (Your Evaluation Set)

Never train on 100% of your dataset.

Create a 1–2% gold set containing:

perfectly written examples
rare edge cases
deliberate traps
cases requiring reasoning

This becomes your “exam” after each training stage.

Example (Excel):

nested functions
multi-criteria filters
date math
text parsing
SCAN+REDUCE cases
array LAMBDA logic

Your SLM should never train on these — only be evaluated on them.

6. Final Step: Export to JSONL and Validate

Every line must be a valid JSON dictionary.

Quick check:

wc -l excel_slm_dataset.jsonl

Then validate using:

jq . excel_slm_dataset.jsonl >/dev/null

Also check average length:

python - <<'EOF'
import json
import statistics

lengths=[]
with open("excel_slm_dataset.jsonl") as f:
    for line in f:
        obj=json.loads(line)
        lengths.append(len(obj["input"])+len(obj["output"]))

print("Avg chars:",statistics.mean(lengths))
EOF

Keep average sample length under 200–300 tokens for small models.

Conclusion

Collecting and cleaning your dataset is not glamorous — but it is responsible for 80% of your model’s quality. With clear task definitions, consistent formatting, synthetic data generation, and strict cleaning rules, you create a foundation strong enough to outperform much larger models in your domain.

A small model trained on a great dataset will always beat a large model trained on a messy one.

Next article in serie “Tokenization — How SLMs Understand Text“

Nano Language Models

Collecting and Cleaning Your Dataset: The Foundation of Any Small Language Model

1. Start With Your Task Definition (Narrow Beats Broad)

2. Choose a Dataset Format That Works for SLMs

3. Collecting Data: Three Reliable Sources

A. Write Custom Synthetic Examples

B. Convert Existing Knowledge Bases

C. Collect task-specific user queries

4. Cleaning Your Dataset: The Non-Negotiable Step

A. Remove duplicates

B. Remove conflicting answers

C. Normalize formatting

D. Ensure inputs and outputs match your model’s required shape

5. Building a “Gold Standard” Dataset Slice (Your Evaluation Set)

6. Final Step: Export to JSONL and Validate

Conclusion

Latest Articles

Stop Googling Excel Syntax — Let the AI Assistant Handle It

The AI That Understands Your Spreadsheet — User Edition

Collecting and Cleaning Your Dataset: The Foundation of Any Small Language Model

1. Start With Your Task Definition (Narrow Beats Broad)

2. Choose a Dataset Format That Works for SLMs

3. Collecting Data: Three Reliable Sources

A. Write Custom Synthetic Examples

B. Convert Existing Knowledge Bases

C. Collect task-specific user queries

4. Cleaning Your Dataset: The Non-Negotiable Step

A. Remove duplicates

B. Remove conflicting answers

C. Normalize formatting

D. Ensure inputs and outputs match your model’s required shape

5. Building a “Gold Standard” Dataset Slice (Your Evaluation Set)

6. Final Step: Export to JSONL and Validate

Conclusion

Share this:

Latest Articles

Stop Googling Excel Syntax — Let the AI Assistant Handle It

The AI That Understands Your Spreadsheet — User Edition