Batch Size & Gradient Accumulation — Training Efficiently on Limited Hardware

(Article #7 in the Build Your Own Small Language Model series)

When training a Small Language Model (SLM), one of the biggest bottlenecks is memory. The size of your GPU (or CPU VRAM if you’re training locally) determines:

  • how big your batches can be
  • how stable your gradients are
  • how fast your model improves
  • and whether training will crash with an out-of-memory error

To solve this, machine learning introduces two important concepts:

  • Batch Size
  • Gradient Accumulation

Understand these two, and you can train a 350M-parameter SLM on hardware as small as a laptop.

1. What Is Batch Size?

The batch size is the number of training samples fed into the model before one optimization step.

Examples:

  • batch_size=1 → one sample at a time
  • batch_size=4 → four samples per update
  • General rule: larger batch sizes produce more stable learning

✔ Benefits of a larger batch size

  • Smooth, stable gradients
  • Faster convergence
  • Better generalization
  • Lower variance in training

✖ Downsides

  • Requires much more VRAM
  • Training crashes if VRAM runs out

2. VRAM Cost of Batch Size

VRAM consumption scales roughly like:

VRAM ∝ batch_size × sequence_length × model_size

For a 350M SLM:

  • batch_size=4 at 512 token sequences may require 8–10GB VRAM
  • batch_size=1 may work on 4–6GB VRAM
  • On CPU, even batch 1 can be slow but possible

This is why many developers hit memory errors during training.

3. The Solution: Gradient Accumulation

Gradient accumulation simulates a larger batch size without actually increasing batch size.

It works like this:

batch_size = 1  
gradient_accumulation_steps = 16

Effective batch size = 1 × 16 = 16

Instead of updating weights every batch, you:

  1. Run forward + backward pass
  2. Accumulate gradients
  3. Do not update the model yet
  4. After N steps, perform one optimization step

This gives you the stability of big batches with the memory footprint of tiny batches.

4. Example: Training Granite-350M on a 4–6 GB GPU

Without Gradient Accumulation (will crash)

batch_size=8
gradient_accumulation_steps=1

With Gradient Accumulation (works)

batch_size=1
gradient_accumulation_steps=32

Result:

  • Same effective batch = 32 samples
  • Same learning stability
  • No VRAM explosion

This is essential when training SLMs on:

5. How It Looks in Code

Most Hugging Face training pipelines allow this:

training_args = TrainingArguments(
    per_device_train_batch_size=1,
    gradient_accumulation_steps=32,
    learning_rate=2e-4,
    ...
)

This simulates a batch size of 32 while holding only 1 sample in memory at a time.

6. How Batch Size Affects Model Quality

✔ Larger effective batch size (via accumulation)

  • smoother gradient updates
  • improved convergence stability
  • lower variance
  • fewer “jumps” in loss

✖ Extremely small batches (e.g., 1 without accumulation)

  • noisy training
  • unstable loss
  • poor specialization
  • model may “forget” patterns

This is why gradient accumulation is mandatory for training stable SLMs on limited hardware.

7. Recommended Settings for Granite-350M SLM Training

For laptops (4–6GB VRAM)

batch_size = 1
gradient_accumulation = 16–64

For mid GPUs (10–16GB VRAM)

batch_size = 2–4
gradient_accumulation = 8–16

For large GPUs (24GB+)

batch_size = 8–16
gradient_accumulation = 1–8

For CPU training

batch_size = 1
gradient_accumulation = 32–128

8. Signs Your Batch Size Is Too Small

You’ll notice:

  • Highly unstable loss curves
  • Random spikes
  • Slow convergence
  • Inconsistent outputs
  • Excel formulas hallucinating or changing style randomly

Increasing gradient accumulation nearly always fixes this.

9. What About Very Large Batches?

Large batches (128–1024+) are common in huge model pretraining.

But for SLMs under 1B parameters:

  • They bring diminishing returns
  • Require excessive memory
  • Do not significantly improve quality

Optimal effective batch size is usually 32–128.

Conclusion

Batch size and gradient accumulation are the keys to training a Small Language Model efficiently — especially on constrained hardware. With the right settings, you can train Granite-350M, Phi-2, TinyLlama, or any domain-specific SLM on hardware almost anyone owns.

It’s not about having a massive GPU.
It’s about using the memory you have intelligently.

Read the next article “Understanding Loss Functions — How SLMs Measure Mistakes

Get early access to the fastest way to turn plain language into Excel formulas—sign up for the waitlist.

Latest Articles