Training Loops Explained (Forward, Backward, Loss, Optimization)

(Article #5 in the serie “Build Your Own Small Language Model series”)

Training a Small Language Model (SLM) may look like magic, but under the hood it follows a simple and elegant cycle repeated millions of times: the training loop. Whether you’re fine-tuning Granite-350M, Phi-2, TinyLlama, or any SLM, the same four steps always occur:

  1. Forward pass
  2. Loss calculation
  3. Backward pass
  4. Optimization step

If you understand this loop, you understand how models actually learn. In this article we break it down in clean, practical terms — no unnecessary math, no academic jargon — just the essentials you need to train your own SLM effectively.

1. The Training Loop: The Heart of All Learning

Every epoch of training is built from thousands of iterations of this loop:

for each batch:
    outputs = model(input)     # forward
    loss = compare(outputs, labels)
    loss.backward()            # backward
    optimizer.step()           # update weights
    optimizer.zero_grad()

This pattern is identical whether you’re training:

  • an Excel formula SLM
  • a Google Sheets assistant
  • a code generator
  • a summarizer
  • a price tracking agent

Everything relies on this loop.

2. Step 1 — The Forward Pass

In the forward pass, data flows through the model:

input → embeddings → transformer layers → logits (predictions)

What happens here:

  • The model tokenizes the input
  • Converts tokens into vectors
  • Passes them through attention layers
  • Produces logits (scores for the next token)

Nothing is updated yet.
The model simply guesses the next token based on its current state.

Example:

Input:
<INSTRUCTION>Sum values in E where B="North"</INSTRUCTION>

Model prediction (logits → tokens):
= S U M I F ( B 2 : B 1 0 0 , " N o r t h " , E 2 : E 1 0 0 )

It might be slightly wrong — and that’s good. We need that error.

3. Step 2 — Loss Calculation

Loss is the numerical measure of how wrong the model was.

For causal LMs, we typically use:

  • Cross Entropy Loss

It compares:

  • the model’s predicted token distribution
    vs
  • the correct target sequence

Lower loss = better predictions.

Example:

Target output:
=SUMIF(B:B,"North",E:E)

If the model predicts:

=SUMIF(B:B,"NORTH",E:E)

Loss is small.

If the model predicts:

=VLOOKUP(...)

Loss is huge.

Loss is the model’s teacher.

4. Step 3 — The Backward Pass (Backpropagation)

Now the magic happens.

During the backward pass, the model:

  • computes gradients
  • traces which weights caused the error
  • determines how to adjust each layer

This is known as backpropagation.

In code:

loss.backward()

This step does not update the model yet.
It only computes how each weight should change.

Think of it as:

“How much did each parameter contribute to the mistake?”

5. Step 4 — Optimization Step

Finally, we update the model weights using an optimizer like:

  • Adam
  • AdamW
  • Adafactor

Typical call:

optimizer.step()
optimizer.zero_grad()

This step:

  • moves weights in the direction that lowers loss
  • clears gradients
  • prepares for the next batch

Over thousands of batches, this makes the model:

  • smarter
  • more consistent
  • more confident
  • more specialized

This is how your SLM becomes an Excel expert.

6. Putting It All Together

Here’s a simplified loop that mirrors your training script:

for step, batch in enumerate(dataloader):
    # Forward pass
    outputs = model(
        input_ids=batch["input_ids"],
        labels=batch["labels"]
    )

    loss = outputs.loss

    # Backward
    loss.backward()

    # Optimize
    optimizer.step()
    optimizer.zero_grad()

    # Optional: learning rate scheduler
    lr_scheduler.step()

This loop runs millions of times.

Each iteration:

  • Makes the model slightly better
  • Reduces error
  • Improves pattern recognition

That’s all training is — repeated improvement from thousands of tiny mistakes.

7. Why LoRA Uses the Same Loop (But With Fewer Trainable Weights)

When you use LoRA:

  • 99.5% of model weights remain frozen
  • Only tiny adapter matrices update
  • Training becomes fast and stable
  • You can train on a laptop or HF Space

The training loop is identical —
only fewer parameters change.

This is why LoRA is perfect for:

  • Excel SLMs
  • Sheets models
  • Domain-specific assistants
  • Small laptops
  • Rapid experiments

8. Common Mistakes to Avoid

🔥 Mistake 1: Training on too long sequences

Your model wastes compute and trains slower.

🔥 Mistake 2: Using batch sizes too large

Causes out-of-memory errors.

🔥 Mistake 3: Forgetting to zero_grad()

Gradients accumulate and blow up your loss.

🔥 Mistake 4: Very high learning rates

Model becomes unstable and outputs garbage.

🔥 Mistake 5: Too little data

Training on 5 samples = no visible change.

Conclusion

The training loop is the beating heart of every SLM. Once you understand forward passes, loss, backpropagation, and optimization, you can:

  • train your own model
  • diagnose training problems
  • design better datasets
  • scale training efficiently
  • understand exactly how your SLM learns

This is the core knowledge that transforms you from a model user into a model builder.

Read next article in serie “Learning Rates & Optimizers — How SLMs Actually Improve

Get early access to the fastest way to turn plain language into Excel formulas—sign up for the waitlist.

Latest Articles