Building a Small Language Model from Scratch in Python

Small Language Models (SLMs) aren’t just a trend — they are the new practical frontier of AI engineering. The best part? You can build a functioning transformer-based SLM yourself, entirely in Python, without specialized hardware.

This guide walks you step-by-step through the core components of building a tiny transformer model from scratch. By the end, you’ll understand each piece well enough to modify, extend, and eventually train your own Nano model.

1. Why Build Your Own Small Language Model?

Most developers interact with LLMs as black boxes.
But building one — even a tiny one — gives you:

  • A deep understanding of how transformers actually work
  • The ability to customize behavior at the architecture level
  • Insight into fine-tuning, tokenization, and training loops
  • The foundation to create domain-specific SLMs later (Excel, price tracking, cybersecurity, etc.)
  • Full control without API limits

This knowledge becomes incredibly valuable as companies move toward efficient models that run on CPUs, edge devices, and internal infrastructure.

2. The Minimum Viable Transformer (MVT)

A transformer has three essential components:

  1. Token embeddings
  2. Self-attention mechanism
  3. Feed-forward layers

Every modern model — Llama, Phi, SmolLM, Granite, Mistral, GPT — is built from these blocks.

Here’s the smallest version you can implement:

2.1 Token Embeddings

Turn tokens from integers → vectors.

import torch
import torch.nn as nn

# Converts token IDs into vectors
class TokenEmbedding(nn.Module):
    def __init__(self, vocab_size, dim):
        super().__init__()

        # Learnable embedding table of shape [vocab_size, dim]
        self.embedding = nn.Embedding(vocab_size, dim)

    def forward(self, x):
        # Input: x of shape [batch, sequence_length]
        # Output: embedded tokens [batch, sequence_length, dim]
        return self.embedding(x)

Token Embeddings in Visual Studio Code
Token Embeddings in Visual Studio Code

The Python code is here available on Github.

2.2 Self-Attention

The heart of a transformer.

class SelfAttention(nn.Module):
    def __init__(self, dim, heads=4):
        super().__init__()

        self.heads = heads
        self.scale = dim ** -0.5  # Scaling factor for stable attention scores

        # One linear layer produces queries, keys, and values (Q, K, V)
        self.to_qkv = nn.Linear(dim, dim * 3)

        # Final linear layer mixes attention heads
        self.out = nn.Linear(dim, dim)

    def forward(self, x):
        # x shape: [batch, seq_len, dim]
        b, n, d = x.shape

        # Compute Q, K, V then split into three equal chunks
        qkv = self.to_qkv(x).chunk(3, dim=-1)
        q, k, v = qkv

        # Reshape for multi-head attention:
        # [batch, seq_len, heads, dim_per_head]
        q = q.view(b, n, self.heads, -1)
        k = k.view(b, n, self.heads, -1)
        v = v.view(b, n, self.heads, -1)

        # Compute attention scores = QKᵀ / sqrt(dim)
        attn = (q @ k.transpose(-2, -1)) * self.scale

        # Apply softmax over keys (last dimension)
        attn = attn.softmax(dim=-1)

        # Weighted sum of values
        out = (attn @ v)

        # Flatten multi-head output back into [batch, seq_len, dim]
        out = out.reshape(b, n, d)

        # Final linear projection
        return self.out(out)
Self-Attention in Visual Studio Code
Self-Attention in Visual Studio Code

2.3 Feed-Forward Network

Simple MLP improving model capacity.

class FeedForward(nn.Module):
    def __init__(self, dim, hidden_dim=4*256):
        super().__init__()

        self.net = nn.Sequential(
            nn.Linear(dim, hidden_dim),  # Expansion
            nn.GELU(),                   # Non-linearity
            nn.Linear(hidden_dim, dim)   # Compression
        )

    def forward(self, x):
        return self.net(x)

FeedForward in Visual Studio Code
FeedForward in Visual Studio Code

2.4 A Single Transformer Block

class TransformerBlock(nn.Module):
    def __init__(self, dim, heads):
        super().__init__()

        self.attn = SelfAttention(dim, heads)
        self.ff = FeedForward(dim)

        # Layer normalization stabilizes training
        self.ln1 = nn.LayerNorm(dim)
        self.ln2 = nn.LayerNorm(dim)

    def forward(self, x):
        # Attention with residual connection
        x = x + self.attn(self.ln1(x))

        # Feedforward with residual connection
        x = x + self.ff(self.ln2(x))

        return x

TransformerBlock in Visual Code Studio
TransformerBlock in Visual Code Studio

This is exactly how SmolLM and TinyLlama organize their internal blocks — just scaled to hundreds of layers.

3. The Tiny SLM Model Class

Combine:

  • Embedding
  • Positional encoding
  • Transformer blocks
  • Output projection
class TinySLM(nn.Module):
    def __init__(self, vocab_size, dim=256, depth=8, heads=4):
        super().__init__()

        # Token embeddings: convert IDs to vectors
        self.token_embedding = nn.Embedding(vocab_size, dim)

        # Positional embeddings: learn where each token is in sequence
        self.pos_embedding = nn.Embedding(512, dim)

        # Stack multiple transformer blocks
        self.blocks = nn.ModuleList([
            TransformerBlock(dim, heads) for _ in range(depth)
        ])

        self.ln = nn.LayerNorm(dim)     # Final normalization
        self.output = nn.Linear(dim, vocab_size)  # Predict next token

    def forward(self, x):
        # x: [batch, seq_len]
        b, n = x.shape

        # Create a position index tensor
        pos = torch.arange(n).unsqueeze(0).to(x.device)

        # Combine token + positional embeddings
        x = self.token_embedding(x) + self.pos_embedding(pos)

        # Pass through all transformer blocks
        for block in self.blocks:
            x = block(x)

        # Normalize and project back into vocabulary space
        x = self.ln(x)
        logits = self.output(x)

        # Output: [batch, seq_len, vocab_size]
        return logits

TinySLM in Visual Code Studio
TinySLM in Visual Code Studio

This is a fully working small language model.
No magic. No hidden layers. Just math.

4. Training the Model

You can train this model with:

  • Cross-entropy loss
  • AdamW optimizer
  • Teacher forcing
  • A simple text dataset
model = TinySLM(vocab_size=30000).cuda()

# AdamW helps stabilize training
optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4)

# CrossEntropyLoss is standard for next-token prediction
criterion = nn.CrossEntropyLoss()

for epoch in range(200):

    # Fetch a batch: 
    # inputs: [batch, seq_len]
    # labels: same, shifted by one position
    inputs, labels = get_batch()

    logits = model(inputs)

    # Reshape logits for CE loss:
    # [batch * seq_len, vocab_size]
    loss = criterion(
        logits.view(-1, logits.size(-1)),
        labels.view(-1)
    )

    # Zero gradients, backprop, update weights
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    print("Epoch:", epoch, "Loss:", loss.item())

Main in Visual Code Studio
Main in Visual Code Studio

On a single RTX 3060, this tiny model can train for fun.
On a cloud A100? It will fly.

5. What Can You Do With This Model?

This tiny SLM can:

  • autogenerate text
  • complete sentences
  • learn specific tasks (e.g., Excel formulas)
  • act as a chatbot
  • run inside an agent
  • embed into automation workflows
  • fine-tune on structured logic data
  • serve as a lightweight private AI

And because it’s small, you can:

  • Quantize it
  • Export it to ONNX
  • Run it in llama.cpp
  • Deploy to Hugging Face Spaces
  • Serve it with FastAPI or Gradio

This is the foundation of everything NanoLanguageModels will teach.

6. Where to Go From Here

After building your own basic SLM, you can learn:

  • How tokenization affects model intelligence
  • How to optimize training
  • How to create an instruction-tuned version
  • How to fine-tune on domain-specific datasets
  • How to benchmark and compare models
  • How to package it into a product

Articles #3–50 will take you through these steps.

Get early access to the fastest way to turn plain language into Excel formulas—sign up for the waitlist.

Latest Articles