(Article #5 in the serie “Build Your Own Small Language Model series”)

Training a Small Language Model (SLM) may look like magic, but under the hood it follows a simple and elegant cycle repeated millions of times: the training loop. Whether you’re fine-tuning Granite-350M, Phi-2, TinyLlama, or any SLM, the same four steps always occur:

Forward pass
Loss calculation
Backward pass
Optimization step

If you understand this loop, you understand how models actually learn. In this article we break it down in clean, practical terms — no unnecessary math, no academic jargon — just the essentials you need to train your own SLM effectively.

1. The Training Loop: The Heart of All Learning

Every epoch of training is built from thousands of iterations of this loop:

for each batch:
    outputs = model(input)     # forward
    loss = compare(outputs, labels)
    loss.backward()            # backward
    optimizer.step()           # update weights
    optimizer.zero_grad()

This pattern is identical whether you’re training:

an Excel formula SLM
a Google Sheets assistant
a code generator
a summarizer
a price tracking agent

Everything relies on this loop.

2. Step 1 — The Forward Pass

In the forward pass, data flows through the model:

input → embeddings → transformer layers → logits (predictions)

What happens here:

The model tokenizes the input
Converts tokens into vectors
Passes them through attention layers
Produces logits (scores for the next token)

Nothing is updated yet.
The model simply guesses the next token based on its current state.

Example:

Input:
<INSTRUCTION>Sum values in E where B="North"</INSTRUCTION>

Model prediction (logits → tokens):
= S U M I F ( B 2 : B 1 0 0 , " N o r t h " , E 2 : E 1 0 0 )

It might be slightly wrong — and that’s good. We need that error.

3. Step 2 — Loss Calculation

Loss is the numerical measure of how wrong the model was.

For causal LMs, we typically use:

Cross Entropy Loss

It compares:

the model’s predicted token distribution
vs
the correct target sequence

Lower loss = better predictions.

Example:

Target output:
=SUMIF(B:B,"North",E:E)

If the model predicts:

=SUMIF(B:B,"NORTH",E:E)

Loss is small.

If the model predicts:

=VLOOKUP(...)

Loss is huge.

Loss is the model’s teacher.

4. Step 3 — The Backward Pass (Backpropagation)

Now the magic happens.

During the backward pass, the model:

computes gradients
traces which weights caused the error
determines how to adjust each layer

This is known as backpropagation.

In code:

loss.backward()

This step does not update the model yet.
It only computes how each weight should change.

Think of it as:

“How much did each parameter contribute to the mistake?”

5. Step 4 — Optimization Step

Finally, we update the model weights using an optimizer like:

Adam
AdamW
Adafactor

Typical call:

optimizer.step()
optimizer.zero_grad()

This step:

moves weights in the direction that lowers loss
clears gradients
prepares for the next batch

Over thousands of batches, this makes the model:

smarter
more consistent
more confident
more specialized

This is how your SLM becomes an Excel expert.

6. Putting It All Together

Here’s a simplified loop that mirrors your training script:

for step, batch in enumerate(dataloader):
    # Forward pass
    outputs = model(
        input_ids=batch["input_ids"],
        labels=batch["labels"]
    )

    loss = outputs.loss

    # Backward
    loss.backward()

    # Optimize
    optimizer.step()
    optimizer.zero_grad()

    # Optional: learning rate scheduler
    lr_scheduler.step()

This loop runs millions of times.

Each iteration:

Makes the model slightly better
Reduces error
Improves pattern recognition

That’s all training is — repeated improvement from thousands of tiny mistakes.

7. Why LoRA Uses the Same Loop (But With Fewer Trainable Weights)

When you use LoRA:

99.5% of model weights remain frozen
Only tiny adapter matrices update
Training becomes fast and stable
You can train on a laptop or HF Space

The training loop is identical —
only fewer parameters change.

This is why LoRA is perfect for:

Excel SLMs
Sheets models
Domain-specific assistants
Small laptops
Rapid experiments

8. Common Mistakes to Avoid

🔥 Mistake 1: Training on too long sequences

Your model wastes compute and trains slower.

🔥 Mistake 2: Using batch sizes too large

Causes out-of-memory errors.

🔥 Mistake 3: Forgetting to `zero_grad()`

Gradients accumulate and blow up your loss.

🔥 Mistake 4: Very high learning rates

Model becomes unstable and outputs garbage.

🔥 Mistake 5: Too little data

Training on 5 samples = no visible change.

Conclusion

The training loop is the beating heart of every SLM. Once you understand forward passes, loss, backpropagation, and optimization, you can:

train your own model
diagnose training problems
design better datasets
scale training efficiently
understand exactly how your SLM learns

This is the core knowledge that transforms you from a model user into a model builder.

Read next article in serie “Learning Rates & Optimizers — How SLMs Actually Improve“

Nano Language Models

Training Loops Explained (Forward, Backward, Loss, Optimization)

1. The Training Loop: The Heart of All Learning

2. Step 1 — The Forward Pass

3. Step 2 — Loss Calculation

4. Step 3 — The Backward Pass (Backpropagation)

5. Step 4 — Optimization Step

6. Putting It All Together

7. Why LoRA Uses the Same Loop (But With Fewer Trainable Weights)

8. Common Mistakes to Avoid

🔥 Mistake 1: Training on too long sequences

🔥 Mistake 2: Using batch sizes too large

🔥 Mistake 3: Forgetting to `zero_grad()`

🔥 Mistake 4: Very high learning rates

🔥 Mistake 5: Too little data

Conclusion

Latest Articles

Stop Googling Excel Syntax — Let the AI Assistant Handle It

The AI That Understands Your Spreadsheet — User Edition

Training Loops Explained (Forward, Backward, Loss, Optimization)

1. The Training Loop: The Heart of All Learning

2. Step 1 — The Forward Pass

3. Step 2 — Loss Calculation

4. Step 3 — The Backward Pass (Backpropagation)

5. Step 4 — Optimization Step

6. Putting It All Together

7. Why LoRA Uses the Same Loop (But With Fewer Trainable Weights)

8. Common Mistakes to Avoid

🔥 Mistake 1: Training on too long sequences

🔥 Mistake 2: Using batch sizes too large

🔥 Mistake 3: Forgetting to zero_grad()

🔥 Mistake 4: Very high learning rates

🔥 Mistake 5: Too little data

Conclusion

Share this:

Latest Articles

Stop Googling Excel Syntax — Let the AI Assistant Handle It

The AI That Understands Your Spreadsheet — User Edition

🔥 Mistake 3: Forgetting to `zero_grad()`