(Article #5 in the serie “Build Your Own Small Language Model series”)
Training a Small Language Model (SLM) may look like magic, but under the hood it follows a simple and elegant cycle repeated millions of times: the training loop. Whether you’re fine-tuning Granite-350M, Phi-2, TinyLlama, or any SLM, the same four steps always occur:
- Forward pass
- Loss calculation
- Backward pass
- Optimization step
If you understand this loop, you understand how models actually learn. In this article we break it down in clean, practical terms — no unnecessary math, no academic jargon — just the essentials you need to train your own SLM effectively.
1. The Training Loop: The Heart of All Learning
Every epoch of training is built from thousands of iterations of this loop:
for each batch:
outputs = model(input) # forward
loss = compare(outputs, labels)
loss.backward() # backward
optimizer.step() # update weights
optimizer.zero_grad()
This pattern is identical whether you’re training:
- an Excel formula SLM
- a Google Sheets assistant
- a code generator
- a summarizer
- a price tracking agent
Everything relies on this loop.
2. Step 1 — The Forward Pass
In the forward pass, data flows through the model:
input → embeddings → transformer layers → logits (predictions)
What happens here:
- The model tokenizes the input
- Converts tokens into vectors
- Passes them through attention layers
- Produces logits (scores for the next token)
Nothing is updated yet.
The model simply guesses the next token based on its current state.
Example:
Input:<INSTRUCTION>Sum values in E where B="North"</INSTRUCTION>
Model prediction (logits → tokens):= S U M I F ( B 2 : B 1 0 0 , " N o r t h " , E 2 : E 1 0 0 )
It might be slightly wrong — and that’s good. We need that error.
3. Step 2 — Loss Calculation
Loss is the numerical measure of how wrong the model was.
For causal LMs, we typically use:
- Cross Entropy Loss
It compares:
- the model’s predicted token distribution
vs - the correct target sequence
Lower loss = better predictions.
Example:
Target output:=SUMIF(B:B,"North",E:E)
If the model predicts:
=SUMIF(B:B,"NORTH",E:E)
Loss is small.
If the model predicts:
=VLOOKUP(...)
Loss is huge.
Loss is the model’s teacher.
4. Step 3 — The Backward Pass (Backpropagation)
Now the magic happens.
During the backward pass, the model:
- computes gradients
- traces which weights caused the error
- determines how to adjust each layer
This is known as backpropagation.
In code:
loss.backward()
This step does not update the model yet.
It only computes how each weight should change.
Think of it as:
“How much did each parameter contribute to the mistake?”
5. Step 4 — Optimization Step
Finally, we update the model weights using an optimizer like:
- Adam
- AdamW
- Adafactor
Typical call:
optimizer.step()
optimizer.zero_grad()
This step:
- moves weights in the direction that lowers loss
- clears gradients
- prepares for the next batch
Over thousands of batches, this makes the model:
- smarter
- more consistent
- more confident
- more specialized
This is how your SLM becomes an Excel expert.
6. Putting It All Together
Here’s a simplified loop that mirrors your training script:
for step, batch in enumerate(dataloader):
# Forward pass
outputs = model(
input_ids=batch["input_ids"],
labels=batch["labels"]
)
loss = outputs.loss
# Backward
loss.backward()
# Optimize
optimizer.step()
optimizer.zero_grad()
# Optional: learning rate scheduler
lr_scheduler.step()
This loop runs millions of times.
Each iteration:
- Makes the model slightly better
- Reduces error
- Improves pattern recognition
That’s all training is — repeated improvement from thousands of tiny mistakes.
7. Why LoRA Uses the Same Loop (But With Fewer Trainable Weights)
When you use LoRA:
- 99.5% of model weights remain frozen
- Only tiny adapter matrices update
- Training becomes fast and stable
- You can train on a laptop or HF Space
The training loop is identical —
only fewer parameters change.
This is why LoRA is perfect for:
- Excel SLMs
- Sheets models
- Domain-specific assistants
- Small laptops
- Rapid experiments
8. Common Mistakes to Avoid
🔥 Mistake 1: Training on too long sequences
Your model wastes compute and trains slower.
🔥 Mistake 2: Using batch sizes too large
Causes out-of-memory errors.
🔥 Mistake 3: Forgetting to zero_grad()
Gradients accumulate and blow up your loss.
🔥 Mistake 4: Very high learning rates
Model becomes unstable and outputs garbage.
🔥 Mistake 5: Too little data
Training on 5 samples = no visible change.
Conclusion
The training loop is the beating heart of every SLM. Once you understand forward passes, loss, backpropagation, and optimization, you can:
- train your own model
- diagnose training problems
- design better datasets
- scale training efficiently
- understand exactly how your SLM learns
This is the core knowledge that transforms you from a model user into a model builder.
Read next article in serie “Learning Rates & Optimizers — How SLMs Actually Improve“