Large Language Models, Explained for Builders
Article Image
Article Image
read

A language model is a musical instrument that plays probability. Every token is a note, and attention is the conductor that decides who must be heard.

Why this post

We will move from intuition to equations, from small code to scaling laws, then to practical guardrails. The goal is simple: understand the essential pieces well enough to build with confidence.

1. From sequence to structure

A text sequence is a list of tokens \((x_1, x_2, \dots, x_n)\). The job of a causal language model is to estimate

\[p(x_1,\dots,x_n)=\prod_{t=1}^{n}p(x_t\mid x_{<t}).\]

Learning this factorization is what let models write, translate, and reason by predicting the next token well.

Self-attention in one breath

For a matrix of token embeddings \((X\in\mathbb{R}^{n\times d})\), linear maps produce queries, keys, values:

\[Q= XW_Q,\quad K= XW_K,\quad V= XW_V.\]

Scaled dot-product attention for a single head is

\[\mathrm{Attn}(Q,K,V)=\mathrm{softmax}\!\left(\frac{QK^\top}{\sqrt{d_k}}+M\right)V,\]

where \((M)\) is a mask that blocks looking into the future. Multihead concatenates several heads, then mixes them with a final linear layer. Residual connections and layer norm stabilize learning. A feedforward block adds a nonlinearity and width.

Complexity. Vanilla attention costs \((\Theta(n^2 d))\) per layer because of the \((QK^\top)\) matrix. This is why long contexts are expensive, and why techniques like FlashAttention and sparse or linear attention matter (see references).

A micro implementation

import math
import torch
import torch.nn as nn
import torch.nn.functional as F

class TinyAttention(nn.Module):
    def __init__(self, d_model=64, n_heads=4):
        super().__init__()
        assert d_model % n_heads == 0
        self.d = d_model // n_heads
        self.n_heads = n_heads
        self.Wq = nn.Linear(d_model, d_model)
        self.Wk = nn.Linear(d_model, d_model)
        self.Wv = nn.Linear(d_model, d_model)
        self.Wo = nn.Linear(d_model, d_model)

    def forward(self, x, mask=None):
        B, T, D = x.shape
        q = self.Wq(x).view(B, T, self.n_heads, self.d).transpose(1,2)   # B, H, T, d
        k = self.Wk(x).view(B, T, self.n_heads, self.d).transpose(1,2)
        v = self.Wv(x).view(B, T, self.n_heads, self.d).transpose(1,2)
        scores = q @ k.transpose(-2, -1) / math.sqrt(self.d)              # B, H, T, T
        if mask is not None:
            scores = scores.masked_fill(mask == 0, float('-inf'))
        w = F.softmax(scores, dim=-1)
        y = w @ v                                                         # B, H, T, d
        y = y.transpose(1,2).contiguous().view(B, T, D)
        return self.Wo(y)

Mini‑recap. We now have the core operator that lets a token listen to others. Everything else in the stack, from positional encodings to layer scaling, is in service of making this operator efficient and stable.

2. Positional information

Since attention is permutation invariant, positions must be injected. Two common choices:

1) Sinusoidal: fixed embeddings with frequencies that let the model infer relative shifts. 2) Rotary: rotate queries and keys in a complex plane so that relative positions are implicit.

In practice, rotary embeddings often improve extrapolation to longer contexts.

3. Training as compression

Pretraining is maximum likelihood over a large corpus. The loss is cross‑entropy

\[\mathcal{L}= -\frac{1}{N}\sum_{i=1}^{N}\log p_\theta(x_i\mid x_{<i}).\]

This is compression in disguise. If your model predicts tokens well, you can code text with fewer bits per token. Perplexity \((\mathrm{PPL}=e^{\mathcal{L}})\) is the exponential of the average surprise.

Optimization in practice

  • Adam or AdamW with warmup and cosine decay.
  • Mixed precision to keep throughput high.
  • Gradient clipping to tame rare spikes.
  • Data and model parallelism when a single device is not enough.

Mini‑recap. You do not need magic. You need a clean data path, stable optimization, and patience.

4. Scaling laws in one page

Empirically, loss follows a power law with compute, parameters, and data. Two rules of thumb that guide design:

  • For a fixed training compute budget, there is an optimal balance between model size and tokens.
  • Under‑trained large models waste capacity, over‑trained small models hit a floor.

These observations led to the Chinchilla recipe: keep models modest, feed them more tokens.

5. Efficient attention and long context

Why do long prompts feel slow? Because the attention matrix grows with \((n^2)\). Practical tricks:

  • FlashAttention reorders operations so attention fits GPU memory, improving throughput.
  • Chunking and sliding windows limit interactions to local neighborhoods.
  • KV caching avoids recomputing keys and values during generation, turning the cost closer to \((\Theta(n d))\) per generated token after the first pass.

6. From foundation to function

A base model maps text to text. Real products need more:

  • Instruction tuning: supervised fine‑tuning on prompt‑response pairs.
  • RL from human feedback: align the model with qualitative preferences.
  • Tool use: extend the model with APIs for search, code execution, or calculators.
  • Retrieval augmentation: ground the model on private documents to reduce hallucinations.

These layers add behavior without retraining the base from scratch.

7. Safety is a system property

Guardrails are not a single filter. Think in layers:

1) Prompt design and system messages. 2) Input validation and content filters. 3) Retrieval scopes that limit what the model can see. 4) Post‑generation checks, red‑team tests, and audit trails.

There is no perfect defense, but layered design reduces risk and improves trust.

8. Worked micro‑example: next‑token sampler

import torch

def sample_next(logits, temperature=1.0, top_p=0.9):
    # logits: 1 x V
    if temperature <= 0:
        return torch.argmax(logits, dim=-1)
    logits = logits / temperature
    probs = torch.softmax(logits, dim=-1)
    # nucleus sampling
    sorted_probs, idx = torch.sort(probs, descending=True)
    cum = torch.cumsum(sorted_probs, dim=-1)
    keep = cum <= top_p
    keep[..., 0] = True
    cutoff = keep.sum().item()
    filtered = torch.full_like(sorted_probs, float('-inf'))
    filtered[..., :cutoff] = torch.log(sorted_probs[..., :cutoff])
    choice = torch.multinomial(torch.softmax(filtered, dim=-1), num_samples=1)
    return idx.gather(-1, choice)

This sampler is enough to feel how temperature and nucleus sampling change style, repetition, and risk.

9. Costs you must budget

  • Latency depends on tokens in, tokens out, and hardware. KV caching and batching help.
  • Memory scales with parameters and activation checkpoints.
  • Money scales with throughput and uptime guarantees. Measure tokens per dollar, not just tokens per second.

Mini‑recap. Your design space is a triangle: quality, latency, cost. Pick two, engineer the third.

10. A small mental model

Think of an LLM as a choir that learned to sing the statistical music of the web. Pretraining taught the harmonies, fine‑tuning taught the lyrics of your task, retrieval hands the choir the correct sheet music, and tools bring instruments to the stage. Good engineering is arranging the performance so the audience hears the right song quickly and safely.

Exercises

1) Prove that causal masking makes the attention matrix strictly upper triangular after softmax. Explain how that structure shapes gradients.

2) Derive the memory footprint of KV caches for a decoder with \((L)\) layers, \((H)\) heads, head dimension \((d)\), and sequence length \((n)\). Express it in bytes for float16.

3) Implement FlashAttention v1 or a sliding‑window attention and measure speedups for \((n\in\{512, 2048, 8192\})\). Report throughput and exactness.

4) Using a public dataset, replicate a tiny instruction‑tuning run. Document prompt formats and failures you observe.

5) Design a retrieval pipeline for your company docs. Specify chunking strategy, embeddings, indexing, and security boundaries.

Beyond the Algorithm

The interesting part is not that models predict text. The interesting part is that prediction turns into synthesis, and synthesis into ideas that help us build. Like astronomy, you do not own the stars, yet you navigate by their patterns. Work with humility, measure honestly, and ship tools that make others stronger.

Bibliography (arXiv)

  • Vaswani et al., Attention Is All You Need, arXiv:1706.03762
  • Brown et al., Language Models are Few-Shot Learners, arXiv:2005.14165
  • Kaplan et al., Scaling Laws for Neural Language Models, arXiv:2001.08361
  • Hoffmann et al., Training Compute-Optimal Large Language Models, arXiv:2203.15556
  • Rae et al., Scaling Language Models: Methods, Analysis & Insights from Training Gopher, arXiv:2112.11446
  • Dao et al., FlashAttention: Fast and Memory-Efficient Exact Attention, arXiv:2205.14135
  • Hu et al., LoRA: Low-Rank Adaptation of Large Language Models, arXiv:2106.09685
  • Schick et al., Toolformer: Language Models Can Teach Themselves to Use Tools, arXiv:2302.04761
  • Tay et al., Efficient Transformers: A Survey, arXiv:2009.06732
Blog Logo

Richardson Lima


Published

Image

Richardson Lima

A brain dump about technology and some restrict interests.

Back to Overview