LLM Training - How It Works
What actually happens when you train a model from scratch — from data to architecture to a 16,384-GPU training loop, one rabbit hole at a time.
train a model from scratch?"]:::node R0 --> 1a["Training Data"]:::node R0 --> 1b["Model Architecture"]:::node R0 --> 1c["Hardware & Scale"]:::node R0 --> 1d["The Training Step"]:::node R0 --> 1e["The Training Loop"]:::node R0 --> 1f["Post-Training"]:::node R0 --> 1g["Evaluation"]:::node 1a --> 1a_more["3 deeper topics"]:::count 1b --> 1b_more["8 deeper topics"]:::count 1c --> 1c_more["2 deeper topics"]:::count 1d --> 1d_more["8 deeper topics"]:::count 1e --> 1e_more["3 deeper topics"]:::count 1f --> 1f_more["3 deeper topics"]:::count 1g --> 1g_more["2 deeper topics"]:::count classDef node fill:#2d6a4f,stroke:#1b4332,color:#d8f3dc classDef count fill:#1a1a2e,stroke:#16213e,color:#888,font-size:12px click R0 "/series/training/train-from-scratch/" click 1a "/series/training/train-from-scratch/training-data/" click 1b "/series/training/train-from-scratch/model-architecture/" click 1c "/series/training/train-from-scratch/hardware-and-scale/" click 1d "/series/training/train-from-scratch/training-step/" click 1e "/series/training/train-from-scratch/training-loop/" click 1f "/series/training/train-from-scratch/post-training/" click 1g "/series/training/train-from-scratch/evaluation/"
Tier 1
- Read
What is the training data?
Training data for a model like Llama 3 70B is a massive, curated corpus — Meta reported using over 15 trillion tokens from a mix of publicly available sources.
- Read
What is the model architecture?
A model architecture is a blueprint — it defines how information flows from input tokens to output predictions, without saying anything about what the model knows.
- Read
Hardware & scale
Let's ground this in the actual iron. We're using a GB200 NVL72 rack as our reference, so here's what's physically sitting in the datacenter.
- Read
The training step
One training step is where the model actually learns. It's the atomic unit of progress — everything else is about how many steps you take and how you configure them.
- Read
The training loop
The training step is what happens once. The training loop is what happens 3.75 million times.
- Read
Post-training
Pretraining teaches the model to continue text. Post-training teaches it what kind of continuation is acceptable, useful, formatted, safe, and rewarded.
- Read
Evaluation
Training loss tells you whether the model is getting better at predicting held-out tokens. It does not tell you whether the model is getting better at being useful, safe, honest, robust, or non-contaminated.
Tier 2
- Read
What does the final training data actually look like?
After all that filtering, deduplication, and cleaning, the data isn't stored as raw text anymore. It's been tokenized — converted from strings into sequences of integer IDs.
- Read
Why can't training data be pre-tokenized at the source?
If Common Crawl is the backbone for everyone, why doesn't Common Crawl just ship pre-tokenized binary files? Several reasons, and they compound.
- Read
Data mixture & curation
The optimizer updates the weights, but the data mixture decides what the gradients are usually about.
- Read
How do you decide the number of layers?
Short answer: you don't calculate it from first principles. You inherit it from a body of empirical research, scaling laws, and ablation studies.
- Read
Vocabulary size: 128,000
The vocabulary is the set of tokens the model can recognize. The vocabulary size determines how many unique tokens exist in that lookup table.
- Read
Model dimension (d_model): 8,192
This is the width of the highway. Every token at every layer is a vector of exactly 8,192 floating-point numbers.
- Read
Attention heads: 64
Each layer's attention mechanism runs 64 independent attention computations in parallel.
- Read
Key-value heads: 8 (Grouped Query Attention)
GQA shares Key-Value projections across groups of heads. Llama 3 has 8 KV heads — cutting the KV cache size by 8x compared to full MHA.
- Read
FFN hidden dimension: 28,672
Each layer has a feed-forward network that takes the 8,192-dimensional vector, expands it to 28,672 dimensions, applies SiLU activation, then projects it back.
- Read
RMSNorm, RoPE, and SiLU
These three are the infrastructure choices — less glamorous than attention heads or model dimensions, but each one solves a specific problem that would otherwise break training at scale.
- Read
Orchestration — Slurm vs Kubernetes
Somebody has to actually launch the training job, allocate GPUs, handle failures, and manage the queue of researchers waiting for compute time.
- Read
Why 16,384 GPUs?
A 70B model trained on 15 trillion tokens literally cannot be done at small scale in a reasonable timeframe. Scale is not a luxury. It's the only way the math works.
- Read
The units of distributed training
Before walking through the seven phases, we need a shared vocabulary for the units involved in distributed training.
- Read
Phase 1 — Data loading
The simplest phase conceptually, but at 16,384 GPUs it becomes a distributed systems problem. Each data-parallel replica needs its micro-batch of token sequences, consuming millions of tokens per optimizer step.
- Read
Phase 2 — Forward pass
The forward pass pushes a batch of tokens through all 80 layers to produce predictions. At 16,384 GPUs, the model isn't sitting on one GPU — it's sliced up across thousands of them using three types of parallelism simultaneously.
- Read
Phase 3 — Loss calculation
The forward pass produced a prediction for every token position — a probability distribution over 128,000 tokens. The loss calculation is where you find out how wrong those predictions were.
- Read
Phase 4 — Backward pass
The backward pass goes top-to-bottom through all 80 layers answering one question for every single parameter: how much did you contribute to the error, and in which direction?
- Read
Phase 5 — Gradient synchronization
After the backward pass, every GPU has local gradients computed from its slice of the data. Before anyone updates weights, all 512 data-parallel replicas need to agree on a single set of gradients via all-reduce.
- Read
Phase 6 — Optimizer step
The gradients are synced. Time to actually update the weights. This is the moment the model learns — and where AdamW's per-parameter adaptive learning rates and 840GB of optimizer state earn their keep.
- Read
Phase 7 — Checkpointing
Hardware fails. At 16,384 GPUs running 24/7 for weeks, it's not if but how often. Checkpointing saves a complete snapshot of training state so you can recover from inevitable failures without starting over.
- Read
Epochs
If your dataset has 2 trillion unique tokens and your training run processes 15 trillion tokens total, that's roughly 7-8 epochs — the model sees every piece of training data about 7-8 times.
- Read
Batch size & gradient accumulation
Global batch size is the total number of tokens processed before the optimizer updates the weights. For Llama 3 70B, Meta reportedly used ~4 million tokens per step — roughly 500 sequences of 8,192 tokens.
- Read
Scaling laws & compute economics
Scaling laws determine how much capability you get for your compute budget, and they explain why model training is a resource allocation problem, not just an ML procedure.
- Read
Supervised fine-tuning (SFT)
Demonstrations of desired behavior. The first step after pretraining that begins shaping the model from a text completer into an assistant.
- Read
Preference training (RLHF, DPO)
SFT says 'here's what a good response looks like.' Preference training says 'this response is better than that one — learn why.'
- Read
Reward hacking & objective mismatch
The model optimizes the signal, not the intent. This is the gap between what we measure and what we actually want.
- Read
Validation loss & benchmarks
The quantitative signals teams track during and after training — from validation loss curves to downstream benchmarks to human preference evaluations.
- Read
Contamination & evaluation integrity
If benchmark examples are in the training data, the model appears more capable than it really is. This is not a theoretical concern — it's an active problem.
Tier 3
- Read
Why does the FFN hold 80% of the parameters?
The FFN accounts for 82% of Llama 3 70B's parameters — 56.4 billion out of 69.5 billion. SwiGLU added a third matrix while GQA shrank attention. The gap widened from both directions.
- Read
Mixture of Experts (MoE)
MoE decouples total parameter count from per-token compute by routing each token through a subset of expert sub-networks. The architecture behind Mixtral, DeepSeek-V3, and reportedly GPT-4 -- with deep tradeoffs in memory, communication, and training stability.
- Read
Synthetic data & distillation
Synthetic data is not free data. It is a way of converting model capability, filtering, and compute into more training signal. For modern training at the frontier, it's no longer optional.
- Read
Long-context training & sequence packing
Llama 3 70B pretrains on 8K-token sequences. Llama 3.1 extended the family to 128K context through additional long-context training. The gap is bridged by sequence packing, RoPE scaling, and a short but carefully tuned long-context fine-tuning phase.
- Read
The parallelism orchestration stack
Slurm launches the job and allocates hardware. But Slurm doesn't know what tensor parallelism is. The actual orchestration — which GPU holds which slice of which layer — is a software stack inside the training job.
- Read
The training objective: shifted tokens & loss masking
Most explanations say the model learns to predict the next word. That's correct but papers over details that matter enormously — shifted token prediction, SFT loss masking, and why the three training stages shape different behaviors.
- Read
Activation checkpointing
Throw away most activations during the forward pass, recompute them during the backward pass. The universal answer to the activation memory wall — trading compute for memory at a ratio that makes large-scale training possible.
- Read
What actually changes in the weights
Training doesn't insert records into a lookup table. The model's knowledge lives in the geometry of its transformation space — distributed across millions of weights, each contributing a tiny piece to many different capabilities simultaneously.
- Read
Training stability: loss spikes, NaN, and precision
A training run at scale is a delicate numerical process running for millions of steps across thousands of GPUs. Training stability is the emergent property of many interacting design choices — normalization, precision, clipping, warmup — each contributing a small margin of safety.
- Read
ZeRO & memory optimization
In pure data parallelism, every GPU replica stores a complete copy of the 840GB training state — 430TB of redundant memory across the cluster. ZeRO eliminates that redundancy in three progressive stages, from sharding optimizer states to sharding everything.
- Read
Mixed precision training
Every floating-point operation in training has a precision. More bits means more accuracy but more memory and slower compute. Mixed precision uses different precisions for different parts of the computation — BF16 for speed where it's safe, FP32 where it's not.