LLM Training - How It Works on Wes Kennedy

LLM Training - How It Works on Wes Kennedyhttps://wes.today/series/training/Recent content in LLM Training - How It Works on Wes KennedyHugoen-usTue, 19 May 2026 00:00:00 +0000What is the training data?https://wes.today/series/training/train-from-scratch/training-data/Thu, 14 May 2026 00:00:00 +0000https://wes.today/series/training/train-from-scratch/training-data/Training data for a model like Llama 3 70B is a massive, curated corpus — Meta reported using over 15 trillion tokens from a mix of publicly available sources.What is the model architecture?https://wes.today/series/training/train-from-scratch/model-architecture/Thu, 14 May 2026 00:00:00 +0000https://wes.today/series/training/train-from-scratch/model-architecture/A model architecture is a blueprint — it defines how information flows from input tokens to output predictions, without saying anything about what the model knows.Hardware & scalehttps://wes.today/series/training/train-from-scratch/hardware-and-scale/Thu, 14 May 2026 00:00:00 +0000https://wes.today/series/training/train-from-scratch/hardware-and-scale/Let’s ground this in the actual iron. We’re using a GB200 NVL72 rack as our reference, so here’s what’s physically sitting in the datacenter.The training stephttps://wes.today/series/training/train-from-scratch/training-step/Tue, 19 May 2026 00:00:00 +0000https://wes.today/series/training/train-from-scratch/training-step/One training step is where the model actually learns. It’s the atomic unit of progress — everything else is about how many steps you take and how you configure them.The training loophttps://wes.today/series/training/train-from-scratch/training-loop/Tue, 19 May 2026 00:00:00 +0000https://wes.today/series/training/train-from-scratch/training-loop/The training step is what happens once. The training loop is what happens 3.75 million times.Post-traininghttps://wes.today/series/training/train-from-scratch/post-training/Tue, 19 May 2026 00:00:00 +0000https://wes.today/series/training/train-from-scratch/post-training/Pretraining teaches the model to continue text. Post-training teaches it what kind of continuation is acceptable, useful, formatted, safe, and rewarded.Evaluationhttps://wes.today/series/training/train-from-scratch/evaluation/Tue, 19 May 2026 00:00:00 +0000https://wes.today/series/training/train-from-scratch/evaluation/Training loss tells you whether the model is getting better at predicting held-out tokens. It does not tell you whether the model is getting better at being useful, safe, honest, robust, or non-contaminated.What does the final training data actually look like?https://wes.today/series/training/train-from-scratch/training-data/final-data-format/Thu, 14 May 2026 00:00:00 +0000https://wes.today/series/training/train-from-scratch/training-data/final-data-format/After all that filtering, deduplication, and cleaning, the data isn’t stored as raw text anymore. It’s been tokenized — converted from strings into sequences of integer IDs.Why can't training data be pre-tokenized at the source?https://wes.today/series/training/train-from-scratch/training-data/pre-tokenization/Thu, 14 May 2026 00:00:00 +0000https://wes.today/series/training/train-from-scratch/training-data/pre-tokenization/If Common Crawl is the backbone for everyone, why doesn’t Common Crawl just ship pre-tokenized binary files? Several reasons, and they compound.Data mixture & curationhttps://wes.today/series/training/train-from-scratch/training-data/data-mixture/Tue, 19 May 2026 00:00:00 +0000https://wes.today/series/training/train-from-scratch/training-data/data-mixture/The optimizer updates the weights, but the data mixture decides what the gradients are usually about.How do you decide the number of layers?https://wes.today/series/training/train-from-scratch/model-architecture/layer-count/Thu, 14 May 2026 00:00:00 +0000https://wes.today/series/training/train-from-scratch/model-architecture/layer-count/Short answer: you don’t calculate it from first principles. You inherit it from a body of empirical research, scaling laws, and ablation studies.Vocabulary size: 128,000https://wes.today/series/training/train-from-scratch/model-architecture/vocabulary-size/Thu, 14 May 2026 00:00:00 +0000https://wes.today/series/training/train-from-scratch/model-architecture/vocabulary-size/The vocabulary is the set of tokens the model can recognize. The vocabulary size determines how many unique tokens exist in that lookup table.Model dimension (d_model): 8,192https://wes.today/series/training/train-from-scratch/model-architecture/model-dimension/Thu, 14 May 2026 00:00:00 +0000https://wes.today/series/training/train-from-scratch/model-architecture/model-dimension/This is the width of the highway. Every token at every layer is a vector of exactly 8,192 floating-point numbers.Attention heads: 64https://wes.today/series/training/train-from-scratch/model-architecture/attention-heads/Thu, 14 May 2026 00:00:00 +0000https://wes.today/series/training/train-from-scratch/model-architecture/attention-heads/Each layer’s attention mechanism runs 64 independent attention computations in parallel.Key-value heads: 8 (Grouped Query Attention)https://wes.today/series/training/train-from-scratch/model-architecture/gqa/Thu, 14 May 2026 00:00:00 +0000https://wes.today/series/training/train-from-scratch/model-architecture/gqa/GQA shares Key-Value projections across groups of heads. Llama 3 has 8 KV heads — cutting the KV cache size by 8x compared to full MHA.FFN hidden dimension: 28,672https://wes.today/series/training/train-from-scratch/model-architecture/ffn-hidden-dimension/Thu, 14 May 2026 00:00:00 +0000https://wes.today/series/training/train-from-scratch/model-architecture/ffn-hidden-dimension/Each layer has a feed-forward network that takes the 8,192-dimensional vector, expands it to 28,672 dimensions, applies SiLU activation, then projects it back.RMSNorm, RoPE, and SiLUhttps://wes.today/series/training/train-from-scratch/model-architecture/rmsnorm-rope-silu/Thu, 14 May 2026 00:00:00 +0000https://wes.today/series/training/train-from-scratch/model-architecture/rmsnorm-rope-silu/These three are the infrastructure choices — less glamorous than attention heads or model dimensions, but each one solves a specific problem that would otherwise break training at scale.Orchestration — Slurm vs Kuberneteshttps://wes.today/series/training/train-from-scratch/hardware-and-scale/orchestration/Thu, 14 May 2026 00:00:00 +0000https://wes.today/series/training/train-from-scratch/hardware-and-scale/orchestration/Somebody has to actually launch the training job, allocate GPUs, handle failures, and manage the queue of researchers waiting for compute time.Why 16,384 GPUs?https://wes.today/series/training/train-from-scratch/hardware-and-scale/why-scale/Tue, 19 May 2026 00:00:00 +0000https://wes.today/series/training/train-from-scratch/hardware-and-scale/why-scale/A 70B model trained on 15 trillion tokens literally cannot be done at small scale in a reasonable timeframe. Scale is not a luxury. It’s the only way the math works.The units of distributed traininghttps://wes.today/series/training/train-from-scratch/training-step/training-units/Tue, 19 May 2026 00:00:00 +0000https://wes.today/series/training/train-from-scratch/training-step/training-units/Before walking through the seven phases, we need a shared vocabulary for the units involved in distributed training.Phase 1 — Data loadinghttps://wes.today/series/training/train-from-scratch/training-step/data-loading/Thu, 14 May 2026 00:00:00 +0000https://wes.today/series/training/train-from-scratch/training-step/data-loading/The simplest phase conceptually, but at 16,384 GPUs it becomes a distributed systems problem. Each data-parallel replica needs its micro-batch of token sequences, consuming millions of tokens per optimizer step.Phase 2 — Forward passhttps://wes.today/series/training/train-from-scratch/training-step/forward-pass/Thu, 14 May 2026 00:00:00 +0000https://wes.today/series/training/train-from-scratch/training-step/forward-pass/The forward pass pushes a batch of tokens through all 80 layers to produce predictions. At 16,384 GPUs, the model isn’t sitting on one GPU — it’s sliced up across thousands of them using three types of parallelism simultaneously.Phase 3 — Loss calculationhttps://wes.today/series/training/train-from-scratch/training-step/loss-calculation/Thu, 14 May 2026 00:00:00 +0000https://wes.today/series/training/train-from-scratch/training-step/loss-calculation/The forward pass produced a prediction for every token position — a probability distribution over 128,000 tokens. The loss calculation is where you find out how wrong those predictions were.Phase 4 — Backward passhttps://wes.today/series/training/train-from-scratch/training-step/backward-pass/Thu, 14 May 2026 00:00:00 +0000https://wes.today/series/training/train-from-scratch/training-step/backward-pass/The backward pass goes top-to-bottom through all 80 layers answering one question for every single parameter: how much did you contribute to the error, and in which direction?Phase 5 — Gradient synchronizationhttps://wes.today/series/training/train-from-scratch/training-step/gradient-sync/Thu, 14 May 2026 00:00:00 +0000https://wes.today/series/training/train-from-scratch/training-step/gradient-sync/After the backward pass, every GPU has local gradients computed from its slice of the data. Before anyone updates weights, all 512 data-parallel replicas need to agree on a single set of gradients via all-reduce.Phase 6 — Optimizer stephttps://wes.today/series/training/train-from-scratch/training-step/optimizer-step/Thu, 14 May 2026 00:00:00 +0000https://wes.today/series/training/train-from-scratch/training-step/optimizer-step/The gradients are synced. Time to actually update the weights. This is the moment the model learns — and where AdamW’s per-parameter adaptive learning rates and 840GB of optimizer state earn their keep.Phase 7 — Checkpointinghttps://wes.today/series/training/train-from-scratch/training-step/checkpointing/Thu, 14 May 2026 00:00:00 +0000https://wes.today/series/training/train-from-scratch/training-step/checkpointing/Hardware fails. At 16,384 GPUs running 24/7 for weeks, it’s not if but how often. Checkpointing saves a complete snapshot of training state so you can recover from inevitable failures without starting over.Epochshttps://wes.today/series/training/train-from-scratch/training-loop/epochs/Tue, 19 May 2026 00:00:00 +0000https://wes.today/series/training/train-from-scratch/training-loop/epochs/If your dataset has 2 trillion unique tokens and your training run processes 15 trillion tokens total, that’s roughly 7-8 epochs — the model sees every piece of training data about 7-8 times.Batch size & gradient accumulationhttps://wes.today/series/training/train-from-scratch/training-loop/batch-size/Tue, 19 May 2026 00:00:00 +0000https://wes.today/series/training/train-from-scratch/training-loop/batch-size/Global batch size is the total number of tokens processed before the optimizer updates the weights. For Llama 3 70B, Meta reportedly used ~4 million tokens per step — roughly 500 sequences of 8,192 tokens.Scaling laws & compute economicshttps://wes.today/series/training/train-from-scratch/training-loop/scaling-laws/Tue, 19 May 2026 00:00:00 +0000https://wes.today/series/training/train-from-scratch/training-loop/scaling-laws/Scaling laws determine how much capability you get for your compute budget, and they explain why model training is a resource allocation problem, not just an ML procedure.Supervised fine-tuning (SFT)https://wes.today/series/training/train-from-scratch/post-training/sft/Tue, 19 May 2026 00:00:00 +0000https://wes.today/series/training/train-from-scratch/post-training/sft/Demonstrations of desired behavior. The first step after pretraining that begins shaping the model from a text completer into an assistant.Preference training (RLHF, DPO)https://wes.today/series/training/train-from-scratch/post-training/preference-training/Tue, 19 May 2026 00:00:00 +0000https://wes.today/series/training/train-from-scratch/post-training/preference-training/SFT says ‘here’s what a good response looks like.’ Preference training says ’this response is better than that one — learn why.’Reward hacking & objective mismatchhttps://wes.today/series/training/train-from-scratch/post-training/reward-hacking/Tue, 19 May 2026 00:00:00 +0000https://wes.today/series/training/train-from-scratch/post-training/reward-hacking/The model optimizes the signal, not the intent. This is the gap between what we measure and what we actually want.Validation loss & benchmarkshttps://wes.today/series/training/train-from-scratch/evaluation/validation-benchmarks/Tue, 19 May 2026 00:00:00 +0000https://wes.today/series/training/train-from-scratch/evaluation/validation-benchmarks/The quantitative signals teams track during and after training — from validation loss curves to downstream benchmarks to human preference evaluations.Contamination & evaluation integrityhttps://wes.today/series/training/train-from-scratch/evaluation/contamination/Tue, 19 May 2026 00:00:00 +0000https://wes.today/series/training/train-from-scratch/evaluation/contamination/If benchmark examples are in the training data, the model appears more capable than it really is. This is not a theoretical concern — it’s an active problem.Why does the FFN hold 80% of the parameters?https://wes.today/series/training/train-from-scratch/model-architecture/ffn-hidden-dimension/ffn-parameters/Tue, 19 May 2026 00:00:00 +0000https://wes.today/series/training/train-from-scratch/model-architecture/ffn-hidden-dimension/ffn-parameters/The FFN accounts for 82% of Llama 3 70B’s parameters — 56.4 billion out of 69.5 billion. SwiGLU added a third matrix while GQA shrank attention. The gap widened from both directions.Mixture of Experts (MoE)https://wes.today/series/training/train-from-scratch/model-architecture/ffn-hidden-dimension/moe/Tue, 19 May 2026 00:00:00 +0000https://wes.today/series/training/train-from-scratch/model-architecture/ffn-hidden-dimension/moe/MoE decouples total parameter count from per-token compute by routing each token through a subset of expert sub-networks. The architecture behind Mixtral, DeepSeek-V3, and reportedly GPT-4 – with deep tradeoffs in memory, communication, and training stability.Synthetic data & distillationhttps://wes.today/series/training/train-from-scratch/training-data/data-mixture/synthetic-data/Tue, 19 May 2026 00:00:00 +0000https://wes.today/series/training/train-from-scratch/training-data/data-mixture/synthetic-data/Synthetic data is not free data. It is a way of converting model capability, filtering, and compute into more training signal. For modern training at the frontier, it’s no longer optional.Long-context training & sequence packinghttps://wes.today/series/training/train-from-scratch/training-data/data-mixture/long-context/Tue, 19 May 2026 00:00:00 +0000https://wes.today/series/training/train-from-scratch/training-data/data-mixture/long-context/Llama 3 70B pretrains on 8K-token sequences. Llama 3.1 extended the family to 128K context through additional long-context training. The gap is bridged by sequence packing, RoPE scaling, and a short but carefully tuned long-context fine-tuning phase.The parallelism orchestration stackhttps://wes.today/series/training/train-from-scratch/training-step/forward-pass/parallelism-stack/Tue, 19 May 2026 00:00:00 +0000https://wes.today/series/training/train-from-scratch/training-step/forward-pass/parallelism-stack/Slurm launches the job and allocates hardware. But Slurm doesn’t know what tensor parallelism is. The actual orchestration — which GPU holds which slice of which layer — is a software stack inside the training job.The training objective: shifted tokens & loss maskinghttps://wes.today/series/training/train-from-scratch/training-step/loss-calculation/training-objective/Tue, 19 May 2026 00:00:00 +0000https://wes.today/series/training/train-from-scratch/training-step/loss-calculation/training-objective/Most explanations say the model learns to predict the next word. That’s correct but papers over details that matter enormously — shifted token prediction, SFT loss masking, and why the three training stages shape different behaviors.Activation checkpointinghttps://wes.today/series/training/train-from-scratch/training-step/backward-pass/activation-checkpointing/Tue, 19 May 2026 00:00:00 +0000https://wes.today/series/training/train-from-scratch/training-step/backward-pass/activation-checkpointing/Throw away most activations during the forward pass, recompute them during the backward pass. The universal answer to the activation memory wall — trading compute for memory at a ratio that makes large-scale training possible.What actually changes in the weightshttps://wes.today/series/training/train-from-scratch/training-loop/epochs/weight-changes/Tue, 19 May 2026 00:00:00 +0000https://wes.today/series/training/train-from-scratch/training-loop/epochs/weight-changes/Training doesn’t insert records into a lookup table. The model’s knowledge lives in the geometry of its transformation space — distributed across millions of weights, each contributing a tiny piece to many different capabilities simultaneously.Training stability: loss spikes, NaN, and precisionhttps://wes.today/series/training/train-from-scratch/training-loop/scaling-laws/training-stability/Tue, 19 May 2026 00:00:00 +0000https://wes.today/series/training/train-from-scratch/training-loop/scaling-laws/training-stability/A training run at scale is a delicate numerical process running for millions of steps across thousands of GPUs. Training stability is the emergent property of many interacting design choices — normalization, precision, clipping, warmup — each contributing a small margin of safety.ZeRO & memory optimizationhttps://wes.today/series/training/train-from-scratch/training-step/optimizer-step/zero/Tue, 19 May 2026 00:00:00 +0000https://wes.today/series/training/train-from-scratch/training-step/optimizer-step/zero/In pure data parallelism, every GPU replica stores a complete copy of the 840GB training state — 430TB of redundant memory across the cluster. ZeRO eliminates that redundancy in three progressive stages, from sharding optimizer states to sharding everything.Mixed precision traininghttps://wes.today/series/training/train-from-scratch/training-step/optimizer-step/mixed-precision/Tue, 19 May 2026 00:00:00 +0000https://wes.today/series/training/train-from-scratch/training-step/optimizer-step/mixed-precision/Every floating-point operation in training has a precision. More bits means more accuracy but more memory and slower compute. Mixed precision uses different precisions for different parts of the computation — BF16 for speed where it’s safe, FP32 where it’s not.What happens when you train a model from scratch?https://wes.today/series/training/train-from-scratch/Thu, 14 May 2026 00:00:00 +0000https://wes.today/series/training/train-from-scratch/You have a pile of text — trillions of tokens scraped from the internet, books, code repositories, scientific papers. You have a model architecture — a specific arrangement of transformer layers, attention heads, and feed-forward networks.