(Article #3 in the Build Your Own Small Language Model series)
Small Language Models don’t read text the way humans do. They don’t see words, sentences, or paragraphs. What they actually see is tokens — small, numerical building blocks that transform language into something mathematics can operate on.
If you want to train your own SLM, fine-tune a model, or design synthetic datasets, tokenization is the foundation you must understand. It affects training cost, model quality, inference speed, dataset size, and even whether your model produces valid outputs.
This article gives you a clean, practical guide to tokenization: what it is, why it matters, and how to choose the right tokenizer for your SLM.
1. What Is a Token?
A token is a small unit of text — not always a word, not always a character, but something in between.
Examples:
| Text | Tokenized Into |
|---|---|
"Hello" | ["Hello"] or ["Hel", "lo"] |
"North-West 2024" | ["North", "-", "West", "Ġ2024"] |
"=SUMIF(A:A,"North",C:C)" | many small tokens |
Different models break text differently depending on their tokenizer.
A token is also a number. Tokenizers convert:
text → tokens → token IDs → tensors
This is how your SLM “understands” text.
2. Why Tokenization Matters for SLMs
Tokenization affects model performance more than most people realize.
A. Training Cost
Fewer tokens = cheaper training
More tokens = exponentially more compute
If your dataset contains 80,000 samples and each sample is ~120 tokens, that’s:
80,000 × 120 = 9.6M tokens per epoch
A small model like Granite-350M handles this efficiently.
B. How “Long” Your Prompts Can Be
Your tokenizer decides the model’s maximum context length in tokens.
For a 350M model trained with 512-token sequences:
- If your prompt is 700 tokens → it will be truncated
- If your dataset has longer sequences → you need to segment or shorten them
Your training script must match the tokenizer’s expectations.
C. Structured Outputs (Excel, JSON, Code)
Excel formulas (and code in general) tokenize into many tiny pieces, for example:
=SUMIF(A:A,"North",C:C)
might become:
["=", "SUM", "IF", "(", "A", ":", "A", ",", "\"", "North", "\"", ...]
This is why:
- Clean formatting matters
- Consistent templates help
- Noise is expensive (wasteful tokens)
Tokenization is where data cleanliness REALLY pays off.
3. Types of Tokenizers Used in Modern SLMs
There are 3 main tokenizer families:
A. Byte-Level BPE (used by GPT, many open models)
- Splits text into frequent sub-words
- Works well across languages
- Good for code, formulas, punctuation
- Granite uses a BPE-style tokenizer
Advantages:
- Compact vocabulary
- Robust to unseen words
- Good for structured data (Excel, JSON)
B. WordPiece (old BERT models)
- Used in older encoders
- Less common now
- Not ideal for SLMs or generative models
C. SentencePiece / Unigram (LLaMA, Alpaca, Mistral variants)
- Excellent for multilingual
- Good at compressing long documents
- Very stable for training small models
For your Excel SLM, byte-level BPE is ideal because:
- Every operator (
=,+," ",:) becomes its own token - Numbers split reliably
- Upper/lowercase is preserved
- Functions (
SUMIF,FILTER,COUNTIFS) appear as stable sub-tokens
This reduces token count and improves model accuracy.
4. Inspecting Tokenization in Practice (Python Example)
Here’s a quick demonstration using any Hugging Face tokenizer:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("ibm-granite/granite-4.0-h-350m-base")
text = '=FILTER(A:Z,(F:F>100)*(G:G="Yes"))'
tokens = tokenizer.tokenize(text)
print(tokens)
print(len(tokens))
You’ll notice:
- Every operator becomes a separate token
"FILTER"may become a single subword- Spaces and quotes matter
- Tiny differences in formatting change token count
This is why dataset consistency is essential.
5. Best Practices for Tokenization When Training SLMs
✔ Use one tokenizer for the entire project
Never switch tokenizers halfway through training — it corrupts the model’s embeddings.
✔ Clean and normalize your data BEFORE tokenization
Tokenizers treat whitespace, tabs, commas, and case as meaningful differences.
Standardize everything:
- spacing
- quoting
- capitalization
- formatting of formulas or structured outputs
✔ Keep sequences short (under 300 tokens)
For small models, shorter inputs give better:
- stability
- generalization
- faster training
- lower VRAM requirements
✔ Avoid rare symbol combinations unless absolutely needed
Custom unusual tokens (weird unicode) can break the model’s vocabulary patterns.
✔ Measure tokens, not characters
Use:
len(tokenizer(text)["input_ids"])
for real measurement.
6. Tokenization Strategy for Your Excel SLM
Your dataset uses a template token wrapper like:
<INSTRUCTION> ... </INSTRUCTION>
<OUTPUT> ... </OUTPUT>
This is perfect because:
- These become stable, reusable tokens
- The model easily learns structured delimiters
- They compress repeated patterns
- They reduce confusion during inference
- They reduce token count significantly
A small model learns faster if delimiters are consistent.
Conclusion
Tokenization is the hidden backbone of every SLM. It determines how text becomes numbers, how long prompts can be, how clean your dataset must be, and how much training will cost. Once you understand tokenization, you can control the model’s behavior at a fundamental level — and design datasets that small models can learn from efficiently.
Clean tokens lead to clean outputs. Understanding tokens makes you a better SLM engineer.
Read the next article in serie “Sequence Length & Context Windows — How Much Can an SLM Remember?“