(Article #3 in the Build Your Own Small Language Model series)

Small Language Models don’t read text the way humans do. They don’t see words, sentences, or paragraphs. What they actually see is tokens — small, numerical building blocks that transform language into something mathematics can operate on.

If you want to train your own SLM, fine-tune a model, or design synthetic datasets, tokenization is the foundation you must understand. It affects training cost, model quality, inference speed, dataset size, and even whether your model produces valid outputs.

This article gives you a clean, practical guide to tokenization: what it is, why it matters, and how to choose the right tokenizer for your SLM.

1. What Is a Token?

A token is a small unit of text — not always a word, not always a character, but something in between.

Examples:

Text	Tokenized Into
`"Hello"`	`["Hello"]` or `["Hel", "lo"]`
`"North-West 2024"`	`["North", "-", "West", "Ġ2024"]`
`"=SUMIF(A:A,"North",C:C)"`	many small tokens

Different models break text differently depending on their tokenizer.

A token is also a number. Tokenizers convert:

text → tokens → token IDs → tensors

This is how your SLM “understands” text.

2. Why Tokenization Matters for SLMs

Tokenization affects model performance more than most people realize.

A. Training Cost

Fewer tokens = cheaper training
More tokens = exponentially more compute

If your dataset contains 80,000 samples and each sample is ~120 tokens, that’s:

80,000 × 120 = 9.6M tokens per epoch

A small model like Granite-350M handles this efficiently.

B. How “Long” Your Prompts Can Be

Your tokenizer decides the model’s maximum context length in tokens.

For a 350M model trained with 512-token sequences:

If your prompt is 700 tokens → it will be truncated
If your dataset has longer sequences → you need to segment or shorten them

Your training script must match the tokenizer’s expectations.

C. Structured Outputs (Excel, JSON, Code)

Excel formulas (and code in general) tokenize into many tiny pieces, for example:

=SUMIF(A:A,"North",C:C)

might become:

["=", "SUM", "IF", "(", "A", ":", "A", ",", "\"", "North", "\"", ...]

This is why:

Clean formatting matters
Consistent templates help
Noise is expensive (wasteful tokens)

Tokenization is where data cleanliness REALLY pays off.

3. Types of Tokenizers Used in Modern SLMs

There are 3 main tokenizer families:

A. Byte-Level BPE (used by GPT, many open models)

Splits text into frequent sub-words
Works well across languages
Good for code, formulas, punctuation
Granite uses a BPE-style tokenizer

Advantages:

Compact vocabulary
Robust to unseen words
Good for structured data (Excel, JSON)

B. WordPiece (old BERT models)

Used in older encoders
Less common now
Not ideal for SLMs or generative models

C. SentencePiece / Unigram (LLaMA, Alpaca, Mistral variants)

Excellent for multilingual
Good at compressing long documents
Very stable for training small models

For your Excel SLM, byte-level BPE is ideal because:

Every operator (=, +, " ", :) becomes its own token
Numbers split reliably
Upper/lowercase is preserved
Functions (SUMIF, FILTER, COUNTIFS) appear as stable sub-tokens

This reduces token count and improves model accuracy.

4. Inspecting Tokenization in Practice (Python Example)

Here’s a quick demonstration using any Hugging Face tokenizer:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("ibm-granite/granite-4.0-h-350m-base")

text = '=FILTER(A:Z,(F:F>100)*(G:G="Yes"))'
tokens = tokenizer.tokenize(text)

print(tokens)
print(len(tokens))

You’ll notice:

Every operator becomes a separate token
"FILTER" may become a single subword
Spaces and quotes matter
Tiny differences in formatting change token count

This is why dataset consistency is essential.

5. Best Practices for Tokenization When Training SLMs

✔ Use one tokenizer for the entire project

Never switch tokenizers halfway through training — it corrupts the model’s embeddings.

✔ Clean and normalize your data BEFORE tokenization

Tokenizers treat whitespace, tabs, commas, and case as meaningful differences.

Standardize everything:

spacing
quoting
capitalization
formatting of formulas or structured outputs

✔ Keep sequences short (under 300 tokens)

For small models, shorter inputs give better:

stability
generalization
faster training
lower VRAM requirements

✔ Avoid rare symbol combinations unless absolutely needed

Custom unusual tokens (weird unicode) can break the model’s vocabulary patterns.

✔ Measure tokens, not characters

Use:

len(tokenizer(text)["input_ids"])

for real measurement.

6. Tokenization Strategy for Your Excel SLM

Your dataset uses a template token wrapper like:

<INSTRUCTION> ... </INSTRUCTION>
<OUTPUT> ... </OUTPUT>

This is perfect because:

These become stable, reusable tokens
The model easily learns structured delimiters
They compress repeated patterns
They reduce confusion during inference
They reduce token count significantly

A small model learns faster if delimiters are consistent.

Conclusion

Tokenization is the hidden backbone of every SLM. It determines how text becomes numbers, how long prompts can be, how clean your dataset must be, and how much training will cost. Once you understand tokenization, you can control the model’s behavior at a fundamental level — and design datasets that small models can learn from efficiently.

Clean tokens lead to clean outputs. Understanding tokens makes you a better SLM engineer.

Read the next article in serie “Sequence Length & Context Windows — How Much Can an SLM Remember?“

Nano Language Models

Tokenization — How SLMs Understand Text

1. What Is a Token?

2. Why Tokenization Matters for SLMs

A. Training Cost

B. How “Long” Your Prompts Can Be

C. Structured Outputs (Excel, JSON, Code)

3. Types of Tokenizers Used in Modern SLMs

A. Byte-Level BPE (used by GPT, many open models)

B. WordPiece (old BERT models)

C. SentencePiece / Unigram (LLaMA, Alpaca, Mistral variants)

4. Inspecting Tokenization in Practice (Python Example)

5. Best Practices for Tokenization When Training SLMs

✔ Use one tokenizer for the entire project

✔ Clean and normalize your data BEFORE tokenization

✔ Keep sequences short (under 300 tokens)

✔ Avoid rare symbol combinations unless absolutely needed

✔ Measure tokens, not characters

6. Tokenization Strategy for Your Excel SLM

Conclusion

Latest Articles

Stop Googling Excel Syntax — Let the AI Assistant Handle It

The AI That Understands Your Spreadsheet — User Edition

Tokenization — How SLMs Understand Text

1. What Is a Token?

2. Why Tokenization Matters for SLMs

A. Training Cost

B. How “Long” Your Prompts Can Be

C. Structured Outputs (Excel, JSON, Code)

3. Types of Tokenizers Used in Modern SLMs

A. Byte-Level BPE (used by GPT, many open models)

B. WordPiece (old BERT models)

C. SentencePiece / Unigram (LLaMA, Alpaca, Mistral variants)

4. Inspecting Tokenization in Practice (Python Example)

5. Best Practices for Tokenization When Training SLMs

✔ Use one tokenizer for the entire project

✔ Clean and normalize your data BEFORE tokenization

✔ Keep sequences short (under 300 tokens)

✔ Avoid rare symbol combinations unless absolutely needed

✔ Measure tokens, not characters

6. Tokenization Strategy for Your Excel SLM

Conclusion

Share this:

Latest Articles

Stop Googling Excel Syntax — Let the AI Assistant Handle It

The AI That Understands Your Spreadsheet — User Edition