Collecting and Cleaning Your Dataset

How to build the raw text corpus that will teach your Small Language Model to “think.”

Here’s second part in your Building a Small Language Model from Scratch in Python series. This tutorial shows you steps from raw text to a clean, ready-to-train dataset — a foundational step before building your tokenizer and Transformer model.

🚀 Introduction — Garbage In, Garbage Out

Every language model, big or small, learns only as well as the data it’s trained on. Before we can teach our model to predict text, we need to give it clean, consistent, and meaningful text data.

Training data is the DNA of your model — if it’s messy, your model’s mind will be too.

In this article, you’ll learn how to:

  1. Collect open text datasets
  2. Clean and normalize the content
  3. Chunk it into sequences for model input

All using pure Python + PyTorch utilities.

📚 Step 1: Choosing the Right Dataset

For Small Language Models, you don’t need terabytes of data — a few hundred megabytes is plenty.

Recommended sources:

DatasetDescriptionSize
TinyStoriesChild-friendly sentences80MB
Wikitext-2Cleaned Wikipedia text20MB
Project GutenbergClassic booksvariable
OpenWebText smallReddit-linked content300MB

Let’s use Wikitext-2 for our first experiment.

pip install datasets

Install datasets for SLM
Install datasets for SLM

Then:

from datasets import load_dataset
dataset = load_dataset("wikitext", "wikitext-2-raw-v1")
text = "\n".join(dataset["train"]["text"])
print(len(text), "characters loaded.")

Loading Dataset Wikitext for training Small Language Model
Loading Dataset Wikitext for training Small Language Model

🧼 Step 2: Cleaning the Raw Text

Raw text contains markup, newlines, and unwanted symbols.
We’ll normalize it using simple regex and string operations.

import re

def clean_text(text):
    text = text.lower()
    text = re.sub(r'\[.*?\]', '', text)        # remove brackets
    text = re.sub(r'[^a-z0-9\s.,!?;:\'-]', '', text)
    text = re.sub(r'\s+', ' ', text).strip()
    return text

Now apply it:

cleaned_text = clean_text(text)
print(cleaned_text[:500])

✅ Converts to lowercase
✅ Removes stray characters
✅ Preserves punctuation and sentence flow

Cleaning Dataset Wikitext for training Small Language Model
Cleaning Dataset Wikitext for training Small Language Model

✂️ Step 3: Splitting Text into Manageable Chunks

Transformers have fixed input lengths — often 128–512 tokens.
Let’s pre-chunk our text into sentences or paragraphs:

max_length = 2000
chunks = [cleaned_text[i:i+max_length] for i in range(0, len(cleaned_text), max_length)]
print(f"{len(chunks)} text chunks created.")

Splitting Text into Manageable Chunks
Splitting Text into Manageable Chunks

These chunks can later be tokenized into training sequences.

⚙️ Step 4: Removing Low-Quality Data

Bad data leads to noisy predictions.
Here’s a quick filter to skip incomplete or low-information text:

def is_valid(chunk):
    return len(chunk.split()) > 20 and "." in chunk

chunks = [c for c in chunks if is_valid(c)]
print(f"After filtering: {len(chunks)} chunks remain.")

Removing Low-Quality Data
Removing Low-Quality Data

Optionally, you can:

  • Remove boilerplate (e.g., “Chapter 1”, “Advertisement”)
  • Drop duplicated lines
  • Keep only meaningful text

🧩 Step 5: Optional — Add Your Own Text

You can easily augment your corpus with local text:

with open("my_articles.txt") as f:
    extra = clean_text(f.read())
    chunks += [extra]

This is perfect for domain-specific models — for example, fine-tuning your SLM on scientific writing or programming tutorials.

(I will skip this part in Visual Code Studio to keep the code as simple as possible).

📦 Step 6: Save Your Preprocessed Corpus

Save the cleaned, chunked dataset for reuse:

import json

with open("cleaned_dataset.json", "w") as f:
    json.dump(chunks, f)

This will be your foundation for tokenization in the next step.

🔍 Step 7: Previewing the Data

Visualize random samples to confirm consistency:

import random
for i in range(3):
    print("--- SAMPLE ---")
    print(random.choice(chunks)[:300], "...")

If the samples read smoothly and cleanly, you’re ready to tokenize.

Previewing the Data
Previewing the Data
Cleaned dataset in JSON format
Cleaned dataset in JSON format

🔋 Step 8: The Takeaway

Good datasets don’t have to be big — they just need to be clean, representative, and consistent.
By curating your text properly, you give your small model a solid “mental world” to learn from.

Data preparation isn’t glamorous — but it’s where intelligence begins.

Follow NanoLanguageModels.com for the next article: “Building a Simple Tokenizer from Scratch” — where raw text turns into numbers your SLM can understand. ⚙️

Get early access to the fastest way to turn plain language into Excel formulas—sign up for the waitlist.

Latest Articles