Collecting and Cleaning Your Dataset: The Foundation of Any Small Language Model

(Article #2 in the Build Your Own Small Language Model series)

If you want to build your own Small Language Model (SLM), the single most important step is the dataset. The model is just math — your dataset is the intelligence. Even a perfectly engineered SLM becomes useless when trained on noisy, inconsistent, or irrelevant data. That is why professional AI teams invest up to 80% of their development time into dataset construction and cleaning.

In this article, we’ll walk through how to collect, structure, sanitize, and prepare a high-quality dataset for training a domain-specialized SLM, whether you’re building an Excel model, a Google Sheets model, a summarization agent, or any niche assistant.

Collecting and Cleaning Your Dataset
Collecting and Cleaning Your Dataset

1. Start With Your Task Definition (Narrow Beats Broad)

Before writing a single line of JSONL, you must answer one question:

“What exactly should my SLM be able to do — and what should it refuse to do?”

An SLM trained on too many task types becomes blurry and inconsistent. A tightly scoped SLM becomes shockingly good.

Examples:

GoalGood DatasetBad Dataset
Excel formula generatorHuman instructions → Excel formulasRandom QA, jokes, essays
Google Sheets assistantPrompts → GS functionsWeb scraping, code, math proofs
Python docstring explainerCode → explanationsCSV tables, text dumps
Price tracking agentURLs → extracted price fieldsNews articles, blog posts

Rule:
Your dataset should contain only examples of the behavior you want the model to produce.

2. Choose a Dataset Format That Works for SLMs

Most small models expect training samples in a single-string prompt → response format, packaged as JSONL.

A clean structured example:

{
  "input": "<INSTRUCTION>Sort this list by date.</INSTRUCTION>",
  "output": "=SORT(A2:B100,2,TRUE)"
}

Or, for multi-mode SLMs:

{
  "task": "fix_formula",
  "broken": "=SUM(A:A)",
  "output": "=SUM(A:A)"  
}

For most use cases, the recommended fields are:

  • task (optional)
  • input / instruction
  • output (the model’s expected answer)
  • never include comments, metadata, or explanations unless you want the model to learn them

Your dataset becomes more powerful if it uses consistent templating. For example:

<INSTRUCTION>…</INSTRUCTION>
<OUTPUT>…</OUTPUT>

This trains the model to always produce the right structure.

3. Collecting Data: Three Reliable Sources

A. Write Custom Synthetic Examples

For domain-specific SLMs, synthetic data is king.

Advantages:

  • Infinite volume
  • No copyright issues
  • No “human noise”
  • Perfect pattern diversity
  • Works extremely well for structured domains (Excel, SQL, regex, code)

Even OpenAI, Meta, and Microsoft use synthetic generators.

B. Convert Existing Knowledge Bases

You can safely transform:

  • Internal documentation
  • Public reference tables
  • Function lists
  • Tool instructions
  • Data schemas

into structured input/output pairs.

Example: take any Excel function documentation and convert it into:

  • Instruction → Formula
  • Formula → Explanation
  • Broken → Fixed

C. Collect task-specific user queries

If you are training an assistant for an existing SaaS product, gather:

  • customer questions
  • task logs
  • real input patterns
  • example workflows

This gives your SLM “real-world context”.

4. Cleaning Your Dataset: The Non-Negotiable Step

A dataset is only as good as its cleanliness.

A. Remove duplicates

Two identical samples cause:

  • overfitting
  • reduced generalization
  • training instability

Always dedupe, even synthetic sets.

B. Remove conflicting answers

If the same input has multiple different outputs, your model becomes confused and unstable:

"Round to two decimals" → ROUND(A2,2)
"Round to two decimals" → FIXED(A2,2)

Pick one style and enforce it everywhere.

C. Normalize formatting

Clean:

  • spacing
  • casing
  • punctuation
  • leading/trailing whitespace
  • token structure

For Excel:

=sumif ( A2 : A100 , "North" , C2:C100 )
=SUMIF(A2:A100,"North",C2:C100)

For NLP:

❌ inconsistent casing
✔ consistent templates

D. Ensure inputs and outputs match your model’s required shape

Every model type has rules.

Causal LM?
→ Put everything into a single string the model learns left-to-right.

Seq2Seq?
→ Keep input and output fields separate.

LoRA?
→ Keep training samples short (≤512 tokens recommended for 350M models).

5. Building a “Gold Standard” Dataset Slice (Your Evaluation Set)

Never train on 100% of your dataset.

Create a 1–2% gold set containing:

  • perfectly written examples
  • rare edge cases
  • deliberate traps
  • cases requiring reasoning

This becomes your “exam” after each training stage.

Example (Excel):

  • nested functions
  • multi-criteria filters
  • date math
  • text parsing
  • SCAN+REDUCE cases
  • array LAMBDA logic

Your SLM should never train on these — only be evaluated on them.

6. Final Step: Export to JSONL and Validate

Every line must be a valid JSON dictionary.

Quick check:

wc -l excel_slm_dataset.jsonl

Then validate using:

jq . excel_slm_dataset.jsonl >/dev/null

Also check average length:

python - <<'EOF'
import json
import statistics

lengths=[]
with open("excel_slm_dataset.jsonl") as f:
    for line in f:
        obj=json.loads(line)
        lengths.append(len(obj["input"])+len(obj["output"]))

print("Avg chars:",statistics.mean(lengths))
EOF

Keep average sample length under 200–300 tokens for small models.

Conclusion

Collecting and cleaning your dataset is not glamorous — but it is responsible for 80% of your model’s quality. With clear task definitions, consistent formatting, synthetic data generation, and strict cleaning rules, you create a foundation strong enough to outperform much larger models in your domain.

A small model trained on a great dataset will always beat a large model trained on a messy one.

Next article in serie “Tokenization — How SLMs Understand Text

Get early access to the fastest way to turn plain language into Excel formulas—sign up for the waitlist.

Latest Articles