(Article #7 in the Build Your Own Small Language Model series)
When training a Small Language Model (SLM), one of the biggest bottlenecks is memory. The size of your GPU (or CPU VRAM if you’re training locally) determines:
- how big your batches can be
- how stable your gradients are
- how fast your model improves
- and whether training will crash with an out-of-memory error
To solve this, machine learning introduces two important concepts:
- Batch Size
- Gradient Accumulation
Understand these two, and you can train a 350M-parameter SLM on hardware as small as a laptop.
1. What Is Batch Size?
The batch size is the number of training samples fed into the model before one optimization step.
Examples:
batch_size=1→ one sample at a timebatch_size=4→ four samples per update- General rule: larger batch sizes produce more stable learning
✔ Benefits of a larger batch size
- Smooth, stable gradients
- Faster convergence
- Better generalization
- Lower variance in training
✖ Downsides
- Requires much more VRAM
- Training crashes if VRAM runs out
2. VRAM Cost of Batch Size
VRAM consumption scales roughly like:
VRAM ∝ batch_size × sequence_length × model_size
For a 350M SLM:
batch_size=4at 512 token sequences may require 8–10GB VRAMbatch_size=1may work on 4–6GB VRAM- On CPU, even batch 1 can be slow but possible
This is why many developers hit memory errors during training.
3. The Solution: Gradient Accumulation
Gradient accumulation simulates a larger batch size without actually increasing batch size.
It works like this:
batch_size = 1
gradient_accumulation_steps = 16
Effective batch size = 1 × 16 = 16
Instead of updating weights every batch, you:
- Run forward + backward pass
- Accumulate gradients
- Do not update the model yet
- After N steps, perform one optimization step
This gives you the stability of big batches with the memory footprint of tiny batches.
4. Example: Training Granite-350M on a 4–6 GB GPU
Without Gradient Accumulation (will crash)
batch_size=8
gradient_accumulation_steps=1
With Gradient Accumulation (works)
batch_size=1
gradient_accumulation_steps=32
Result:
- Same effective batch = 32 samples
- Same learning stability
- No VRAM explosion
This is essential when training SLMs on:
- RTX 3050/3060
- Apple Silicon (M1/M2/M3)
- Google Colab free tier
- Laptop CPUs with quantized models
5. How It Looks in Code
Most Hugging Face training pipelines allow this:
training_args = TrainingArguments(
per_device_train_batch_size=1,
gradient_accumulation_steps=32,
learning_rate=2e-4,
...
)
This simulates a batch size of 32 while holding only 1 sample in memory at a time.
6. How Batch Size Affects Model Quality
✔ Larger effective batch size (via accumulation)
- smoother gradient updates
- improved convergence stability
- lower variance
- fewer “jumps” in loss
✖ Extremely small batches (e.g., 1 without accumulation)
- noisy training
- unstable loss
- poor specialization
- model may “forget” patterns
This is why gradient accumulation is mandatory for training stable SLMs on limited hardware.
7. Recommended Settings for Granite-350M SLM Training
For laptops (4–6GB VRAM)
batch_size = 1
gradient_accumulation = 16–64
For mid GPUs (10–16GB VRAM)
batch_size = 2–4
gradient_accumulation = 8–16
For large GPUs (24GB+)
batch_size = 8–16
gradient_accumulation = 1–8
For CPU training
batch_size = 1
gradient_accumulation = 32–128
8. Signs Your Batch Size Is Too Small
You’ll notice:
- Highly unstable loss curves
- Random spikes
- Slow convergence
- Inconsistent outputs
- Excel formulas hallucinating or changing style randomly
Increasing gradient accumulation nearly always fixes this.
9. What About Very Large Batches?
Large batches (128–1024+) are common in huge model pretraining.
But for SLMs under 1B parameters:
- They bring diminishing returns
- Require excessive memory
- Do not significantly improve quality
Optimal effective batch size is usually 32–128.
Conclusion
Batch size and gradient accumulation are the keys to training a Small Language Model efficiently — especially on constrained hardware. With the right settings, you can train Granite-350M, Phi-2, TinyLlama, or any domain-specific SLM on hardware almost anyone owns.
It’s not about having a massive GPU.
It’s about using the memory you have intelligently.
Read the next article “Understanding Loss Functions — How SLMs Measure Mistakes“