<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>LLM Training - How It Works on Wes Kennedy</title><link>https://wes.today/series/training/</link><description>Recent content in LLM Training - How It Works on Wes Kennedy</description><generator>Hugo</generator><language>en-us</language><lastBuildDate>Tue, 19 May 2026 00:00:00 +0000</lastBuildDate><atom:link href="https://wes.today/series/training/index.xml" rel="self" type="application/rss+xml"/><item><title>What is the training data?</title><link>https://wes.today/series/training/train-from-scratch/training-data/</link><pubDate>Thu, 14 May 2026 00:00:00 +0000</pubDate><guid>https://wes.today/series/training/train-from-scratch/training-data/</guid><description>Training data for a model like Llama 3 70B is a massive, curated corpus — Meta reported using over 15 trillion tokens from a mix of publicly available sources.</description></item><item><title>What is the model architecture?</title><link>https://wes.today/series/training/train-from-scratch/model-architecture/</link><pubDate>Thu, 14 May 2026 00:00:00 +0000</pubDate><guid>https://wes.today/series/training/train-from-scratch/model-architecture/</guid><description>A model architecture is a blueprint — it defines how information flows from input tokens to output predictions, without saying anything about what the model knows.</description></item><item><title>Hardware &amp; scale</title><link>https://wes.today/series/training/train-from-scratch/hardware-and-scale/</link><pubDate>Thu, 14 May 2026 00:00:00 +0000</pubDate><guid>https://wes.today/series/training/train-from-scratch/hardware-and-scale/</guid><description>Let&amp;rsquo;s ground this in the actual iron. We&amp;rsquo;re using a GB200 NVL72 rack as our reference, so here&amp;rsquo;s what&amp;rsquo;s physically sitting in the datacenter.</description></item><item><title>The training step</title><link>https://wes.today/series/training/train-from-scratch/training-step/</link><pubDate>Tue, 19 May 2026 00:00:00 +0000</pubDate><guid>https://wes.today/series/training/train-from-scratch/training-step/</guid><description>One training step is where the model actually learns. It&amp;rsquo;s the atomic unit of progress — everything else is about how many steps you take and how you configure them.</description></item><item><title>The training loop</title><link>https://wes.today/series/training/train-from-scratch/training-loop/</link><pubDate>Tue, 19 May 2026 00:00:00 +0000</pubDate><guid>https://wes.today/series/training/train-from-scratch/training-loop/</guid><description>The training step is what happens once. The training loop is what happens 3.75 million times.</description></item><item><title>Post-training</title><link>https://wes.today/series/training/train-from-scratch/post-training/</link><pubDate>Tue, 19 May 2026 00:00:00 +0000</pubDate><guid>https://wes.today/series/training/train-from-scratch/post-training/</guid><description>Pretraining teaches the model to continue text. Post-training teaches it what kind of continuation is acceptable, useful, formatted, safe, and rewarded.</description></item><item><title>Evaluation</title><link>https://wes.today/series/training/train-from-scratch/evaluation/</link><pubDate>Tue, 19 May 2026 00:00:00 +0000</pubDate><guid>https://wes.today/series/training/train-from-scratch/evaluation/</guid><description>Training loss tells you whether the model is getting better at predicting held-out tokens. It does not tell you whether the model is getting better at being useful, safe, honest, robust, or non-contaminated.</description></item><item><title>What does the final training data actually look like?</title><link>https://wes.today/series/training/train-from-scratch/training-data/final-data-format/</link><pubDate>Thu, 14 May 2026 00:00:00 +0000</pubDate><guid>https://wes.today/series/training/train-from-scratch/training-data/final-data-format/</guid><description>After all that filtering, deduplication, and cleaning, the data isn&amp;rsquo;t stored as raw text anymore. It&amp;rsquo;s been tokenized — converted from strings into sequences of integer IDs.</description></item><item><title>Why can't training data be pre-tokenized at the source?</title><link>https://wes.today/series/training/train-from-scratch/training-data/pre-tokenization/</link><pubDate>Thu, 14 May 2026 00:00:00 +0000</pubDate><guid>https://wes.today/series/training/train-from-scratch/training-data/pre-tokenization/</guid><description>If Common Crawl is the backbone for everyone, why doesn&amp;rsquo;t Common Crawl just ship pre-tokenized binary files? Several reasons, and they compound.</description></item><item><title>Data mixture &amp; curation</title><link>https://wes.today/series/training/train-from-scratch/training-data/data-mixture/</link><pubDate>Tue, 19 May 2026 00:00:00 +0000</pubDate><guid>https://wes.today/series/training/train-from-scratch/training-data/data-mixture/</guid><description>The optimizer updates the weights, but the data mixture decides what the gradients are usually about.</description></item><item><title>How do you decide the number of layers?</title><link>https://wes.today/series/training/train-from-scratch/model-architecture/layer-count/</link><pubDate>Thu, 14 May 2026 00:00:00 +0000</pubDate><guid>https://wes.today/series/training/train-from-scratch/model-architecture/layer-count/</guid><description>Short answer: you don&amp;rsquo;t calculate it from first principles. You inherit it from a body of empirical research, scaling laws, and ablation studies.</description></item><item><title>Vocabulary size: 128,000</title><link>https://wes.today/series/training/train-from-scratch/model-architecture/vocabulary-size/</link><pubDate>Thu, 14 May 2026 00:00:00 +0000</pubDate><guid>https://wes.today/series/training/train-from-scratch/model-architecture/vocabulary-size/</guid><description>The vocabulary is the set of tokens the model can recognize. The vocabulary size determines how many unique tokens exist in that lookup table.</description></item><item><title>Model dimension (d_model): 8,192</title><link>https://wes.today/series/training/train-from-scratch/model-architecture/model-dimension/</link><pubDate>Thu, 14 May 2026 00:00:00 +0000</pubDate><guid>https://wes.today/series/training/train-from-scratch/model-architecture/model-dimension/</guid><description>This is the width of the highway. Every token at every layer is a vector of exactly 8,192 floating-point numbers.</description></item><item><title>Attention heads: 64</title><link>https://wes.today/series/training/train-from-scratch/model-architecture/attention-heads/</link><pubDate>Thu, 14 May 2026 00:00:00 +0000</pubDate><guid>https://wes.today/series/training/train-from-scratch/model-architecture/attention-heads/</guid><description>Each layer&amp;rsquo;s attention mechanism runs 64 independent attention computations in parallel.</description></item><item><title>Key-value heads: 8 (Grouped Query Attention)</title><link>https://wes.today/series/training/train-from-scratch/model-architecture/gqa/</link><pubDate>Thu, 14 May 2026 00:00:00 +0000</pubDate><guid>https://wes.today/series/training/train-from-scratch/model-architecture/gqa/</guid><description>GQA shares Key-Value projections across groups of heads. Llama 3 has 8 KV heads — cutting the KV cache size by 8x compared to full MHA.</description></item><item><title>FFN hidden dimension: 28,672</title><link>https://wes.today/series/training/train-from-scratch/model-architecture/ffn-hidden-dimension/</link><pubDate>Thu, 14 May 2026 00:00:00 +0000</pubDate><guid>https://wes.today/series/training/train-from-scratch/model-architecture/ffn-hidden-dimension/</guid><description>Each layer has a feed-forward network that takes the 8,192-dimensional vector, expands it to 28,672 dimensions, applies SiLU activation, then projects it back.</description></item><item><title>RMSNorm, RoPE, and SiLU</title><link>https://wes.today/series/training/train-from-scratch/model-architecture/rmsnorm-rope-silu/</link><pubDate>Thu, 14 May 2026 00:00:00 +0000</pubDate><guid>https://wes.today/series/training/train-from-scratch/model-architecture/rmsnorm-rope-silu/</guid><description>These three are the infrastructure choices — less glamorous than attention heads or model dimensions, but each one solves a specific problem that would otherwise break training at scale.</description></item><item><title>Orchestration — Slurm vs Kubernetes</title><link>https://wes.today/series/training/train-from-scratch/hardware-and-scale/orchestration/</link><pubDate>Thu, 14 May 2026 00:00:00 +0000</pubDate><guid>https://wes.today/series/training/train-from-scratch/hardware-and-scale/orchestration/</guid><description>Somebody has to actually launch the training job, allocate GPUs, handle failures, and manage the queue of researchers waiting for compute time.</description></item><item><title>Why 16,384 GPUs?</title><link>https://wes.today/series/training/train-from-scratch/hardware-and-scale/why-scale/</link><pubDate>Tue, 19 May 2026 00:00:00 +0000</pubDate><guid>https://wes.today/series/training/train-from-scratch/hardware-and-scale/why-scale/</guid><description>A 70B model trained on 15 trillion tokens literally cannot be done at small scale in a reasonable timeframe. Scale is not a luxury. It&amp;rsquo;s the only way the math works.</description></item><item><title>The units of distributed training</title><link>https://wes.today/series/training/train-from-scratch/training-step/training-units/</link><pubDate>Tue, 19 May 2026 00:00:00 +0000</pubDate><guid>https://wes.today/series/training/train-from-scratch/training-step/training-units/</guid><description>Before walking through the seven phases, we need a shared vocabulary for the units involved in distributed training.</description></item><item><title>Phase 1 — Data loading</title><link>https://wes.today/series/training/train-from-scratch/training-step/data-loading/</link><pubDate>Thu, 14 May 2026 00:00:00 +0000</pubDate><guid>https://wes.today/series/training/train-from-scratch/training-step/data-loading/</guid><description>The simplest phase conceptually, but at 16,384 GPUs it becomes a distributed systems problem. Each data-parallel replica needs its micro-batch of token sequences, consuming millions of tokens per optimizer step.</description></item><item><title>Phase 2 — Forward pass</title><link>https://wes.today/series/training/train-from-scratch/training-step/forward-pass/</link><pubDate>Thu, 14 May 2026 00:00:00 +0000</pubDate><guid>https://wes.today/series/training/train-from-scratch/training-step/forward-pass/</guid><description>The forward pass pushes a batch of tokens through all 80 layers to produce predictions. At 16,384 GPUs, the model isn&amp;rsquo;t sitting on one GPU — it&amp;rsquo;s sliced up across thousands of them using three types of parallelism simultaneously.</description></item><item><title>Phase 3 — Loss calculation</title><link>https://wes.today/series/training/train-from-scratch/training-step/loss-calculation/</link><pubDate>Thu, 14 May 2026 00:00:00 +0000</pubDate><guid>https://wes.today/series/training/train-from-scratch/training-step/loss-calculation/</guid><description>The forward pass produced a prediction for every token position — a probability distribution over 128,000 tokens. The loss calculation is where you find out how wrong those predictions were.</description></item><item><title>Phase 4 — Backward pass</title><link>https://wes.today/series/training/train-from-scratch/training-step/backward-pass/</link><pubDate>Thu, 14 May 2026 00:00:00 +0000</pubDate><guid>https://wes.today/series/training/train-from-scratch/training-step/backward-pass/</guid><description>The backward pass goes top-to-bottom through all 80 layers answering one question for every single parameter: how much did you contribute to the error, and in which direction?</description></item><item><title>Phase 5 — Gradient synchronization</title><link>https://wes.today/series/training/train-from-scratch/training-step/gradient-sync/</link><pubDate>Thu, 14 May 2026 00:00:00 +0000</pubDate><guid>https://wes.today/series/training/train-from-scratch/training-step/gradient-sync/</guid><description>After the backward pass, every GPU has local gradients computed from its slice of the data. Before anyone updates weights, all 512 data-parallel replicas need to agree on a single set of gradients via all-reduce.</description></item><item><title>Phase 6 — Optimizer step</title><link>https://wes.today/series/training/train-from-scratch/training-step/optimizer-step/</link><pubDate>Thu, 14 May 2026 00:00:00 +0000</pubDate><guid>https://wes.today/series/training/train-from-scratch/training-step/optimizer-step/</guid><description>The gradients are synced. Time to actually update the weights. This is the moment the model learns — and where AdamW&amp;rsquo;s per-parameter adaptive learning rates and 840GB of optimizer state earn their keep.</description></item><item><title>Phase 7 — Checkpointing</title><link>https://wes.today/series/training/train-from-scratch/training-step/checkpointing/</link><pubDate>Thu, 14 May 2026 00:00:00 +0000</pubDate><guid>https://wes.today/series/training/train-from-scratch/training-step/checkpointing/</guid><description>Hardware fails. At 16,384 GPUs running 24/7 for weeks, it&amp;rsquo;s not if but how often. Checkpointing saves a complete snapshot of training state so you can recover from inevitable failures without starting over.</description></item><item><title>Epochs</title><link>https://wes.today/series/training/train-from-scratch/training-loop/epochs/</link><pubDate>Tue, 19 May 2026 00:00:00 +0000</pubDate><guid>https://wes.today/series/training/train-from-scratch/training-loop/epochs/</guid><description>If your dataset has 2 trillion unique tokens and your training run processes 15 trillion tokens total, that&amp;rsquo;s roughly 7-8 epochs — the model sees every piece of training data about 7-8 times.</description></item><item><title>Batch size &amp; gradient accumulation</title><link>https://wes.today/series/training/train-from-scratch/training-loop/batch-size/</link><pubDate>Tue, 19 May 2026 00:00:00 +0000</pubDate><guid>https://wes.today/series/training/train-from-scratch/training-loop/batch-size/</guid><description>Global batch size is the total number of tokens processed before the optimizer updates the weights. For Llama 3 70B, Meta reportedly used ~4 million tokens per step — roughly 500 sequences of 8,192 tokens.</description></item><item><title>Scaling laws &amp; compute economics</title><link>https://wes.today/series/training/train-from-scratch/training-loop/scaling-laws/</link><pubDate>Tue, 19 May 2026 00:00:00 +0000</pubDate><guid>https://wes.today/series/training/train-from-scratch/training-loop/scaling-laws/</guid><description>Scaling laws determine how much capability you get for your compute budget, and they explain why model training is a resource allocation problem, not just an ML procedure.</description></item><item><title>Supervised fine-tuning (SFT)</title><link>https://wes.today/series/training/train-from-scratch/post-training/sft/</link><pubDate>Tue, 19 May 2026 00:00:00 +0000</pubDate><guid>https://wes.today/series/training/train-from-scratch/post-training/sft/</guid><description>Demonstrations of desired behavior. The first step after pretraining that begins shaping the model from a text completer into an assistant.</description></item><item><title>Preference training (RLHF, DPO)</title><link>https://wes.today/series/training/train-from-scratch/post-training/preference-training/</link><pubDate>Tue, 19 May 2026 00:00:00 +0000</pubDate><guid>https://wes.today/series/training/train-from-scratch/post-training/preference-training/</guid><description>SFT says &amp;lsquo;here&amp;rsquo;s what a good response looks like.&amp;rsquo; Preference training says &amp;rsquo;this response is better than that one — learn why.&amp;rsquo;</description></item><item><title>Reward hacking &amp; objective mismatch</title><link>https://wes.today/series/training/train-from-scratch/post-training/reward-hacking/</link><pubDate>Tue, 19 May 2026 00:00:00 +0000</pubDate><guid>https://wes.today/series/training/train-from-scratch/post-training/reward-hacking/</guid><description>The model optimizes the signal, not the intent. This is the gap between what we measure and what we actually want.</description></item><item><title>Validation loss &amp; benchmarks</title><link>https://wes.today/series/training/train-from-scratch/evaluation/validation-benchmarks/</link><pubDate>Tue, 19 May 2026 00:00:00 +0000</pubDate><guid>https://wes.today/series/training/train-from-scratch/evaluation/validation-benchmarks/</guid><description>The quantitative signals teams track during and after training — from validation loss curves to downstream benchmarks to human preference evaluations.</description></item><item><title>Contamination &amp; evaluation integrity</title><link>https://wes.today/series/training/train-from-scratch/evaluation/contamination/</link><pubDate>Tue, 19 May 2026 00:00:00 +0000</pubDate><guid>https://wes.today/series/training/train-from-scratch/evaluation/contamination/</guid><description>If benchmark examples are in the training data, the model appears more capable than it really is. This is not a theoretical concern — it&amp;rsquo;s an active problem.</description></item><item><title>Why does the FFN hold 80% of the parameters?</title><link>https://wes.today/series/training/train-from-scratch/model-architecture/ffn-hidden-dimension/ffn-parameters/</link><pubDate>Tue, 19 May 2026 00:00:00 +0000</pubDate><guid>https://wes.today/series/training/train-from-scratch/model-architecture/ffn-hidden-dimension/ffn-parameters/</guid><description>The FFN accounts for 82% of Llama 3 70B&amp;rsquo;s parameters — 56.4 billion out of 69.5 billion. SwiGLU added a third matrix while GQA shrank attention. The gap widened from both directions.</description></item><item><title>Mixture of Experts (MoE)</title><link>https://wes.today/series/training/train-from-scratch/model-architecture/ffn-hidden-dimension/moe/</link><pubDate>Tue, 19 May 2026 00:00:00 +0000</pubDate><guid>https://wes.today/series/training/train-from-scratch/model-architecture/ffn-hidden-dimension/moe/</guid><description>MoE decouples total parameter count from per-token compute by routing each token through a subset of expert sub-networks. The architecture behind Mixtral, DeepSeek-V3, and reportedly GPT-4 &amp;ndash; with deep tradeoffs in memory, communication, and training stability.</description></item><item><title>Synthetic data &amp; distillation</title><link>https://wes.today/series/training/train-from-scratch/training-data/data-mixture/synthetic-data/</link><pubDate>Tue, 19 May 2026 00:00:00 +0000</pubDate><guid>https://wes.today/series/training/train-from-scratch/training-data/data-mixture/synthetic-data/</guid><description>Synthetic data is not free data. It is a way of converting model capability, filtering, and compute into more training signal. For modern training at the frontier, it&amp;rsquo;s no longer optional.</description></item><item><title>Long-context training &amp; sequence packing</title><link>https://wes.today/series/training/train-from-scratch/training-data/data-mixture/long-context/</link><pubDate>Tue, 19 May 2026 00:00:00 +0000</pubDate><guid>https://wes.today/series/training/train-from-scratch/training-data/data-mixture/long-context/</guid><description>Llama 3 70B pretrains on 8K-token sequences. Llama 3.1 extended the family to 128K context through additional long-context training. The gap is bridged by sequence packing, RoPE scaling, and a short but carefully tuned long-context fine-tuning phase.</description></item><item><title>The parallelism orchestration stack</title><link>https://wes.today/series/training/train-from-scratch/training-step/forward-pass/parallelism-stack/</link><pubDate>Tue, 19 May 2026 00:00:00 +0000</pubDate><guid>https://wes.today/series/training/train-from-scratch/training-step/forward-pass/parallelism-stack/</guid><description>Slurm launches the job and allocates hardware. But Slurm doesn&amp;rsquo;t know what tensor parallelism is. The actual orchestration — which GPU holds which slice of which layer — is a software stack inside the training job.</description></item><item><title>The training objective: shifted tokens &amp; loss masking</title><link>https://wes.today/series/training/train-from-scratch/training-step/loss-calculation/training-objective/</link><pubDate>Tue, 19 May 2026 00:00:00 +0000</pubDate><guid>https://wes.today/series/training/train-from-scratch/training-step/loss-calculation/training-objective/</guid><description>Most explanations say the model learns to predict the next word. That&amp;rsquo;s correct but papers over details that matter enormously — shifted token prediction, SFT loss masking, and why the three training stages shape different behaviors.</description></item><item><title>Activation checkpointing</title><link>https://wes.today/series/training/train-from-scratch/training-step/backward-pass/activation-checkpointing/</link><pubDate>Tue, 19 May 2026 00:00:00 +0000</pubDate><guid>https://wes.today/series/training/train-from-scratch/training-step/backward-pass/activation-checkpointing/</guid><description>Throw away most activations during the forward pass, recompute them during the backward pass. The universal answer to the activation memory wall — trading compute for memory at a ratio that makes large-scale training possible.</description></item><item><title>What actually changes in the weights</title><link>https://wes.today/series/training/train-from-scratch/training-loop/epochs/weight-changes/</link><pubDate>Tue, 19 May 2026 00:00:00 +0000</pubDate><guid>https://wes.today/series/training/train-from-scratch/training-loop/epochs/weight-changes/</guid><description>Training doesn&amp;rsquo;t insert records into a lookup table. The model&amp;rsquo;s knowledge lives in the geometry of its transformation space — distributed across millions of weights, each contributing a tiny piece to many different capabilities simultaneously.</description></item><item><title>Training stability: loss spikes, NaN, and precision</title><link>https://wes.today/series/training/train-from-scratch/training-loop/scaling-laws/training-stability/</link><pubDate>Tue, 19 May 2026 00:00:00 +0000</pubDate><guid>https://wes.today/series/training/train-from-scratch/training-loop/scaling-laws/training-stability/</guid><description>A training run at scale is a delicate numerical process running for millions of steps across thousands of GPUs. Training stability is the emergent property of many interacting design choices — normalization, precision, clipping, warmup — each contributing a small margin of safety.</description></item><item><title>ZeRO &amp; memory optimization</title><link>https://wes.today/series/training/train-from-scratch/training-step/optimizer-step/zero/</link><pubDate>Tue, 19 May 2026 00:00:00 +0000</pubDate><guid>https://wes.today/series/training/train-from-scratch/training-step/optimizer-step/zero/</guid><description>In pure data parallelism, every GPU replica stores a complete copy of the 840GB training state — 430TB of redundant memory across the cluster. ZeRO eliminates that redundancy in three progressive stages, from sharding optimizer states to sharding everything.</description></item><item><title>Mixed precision training</title><link>https://wes.today/series/training/train-from-scratch/training-step/optimizer-step/mixed-precision/</link><pubDate>Tue, 19 May 2026 00:00:00 +0000</pubDate><guid>https://wes.today/series/training/train-from-scratch/training-step/optimizer-step/mixed-precision/</guid><description>Every floating-point operation in training has a precision. More bits means more accuracy but more memory and slower compute. Mixed precision uses different precisions for different parts of the computation — BF16 for speed where it&amp;rsquo;s safe, FP32 where it&amp;rsquo;s not.</description></item><item><title>What happens when you train a model from scratch?</title><link>https://wes.today/series/training/train-from-scratch/</link><pubDate>Thu, 14 May 2026 00:00:00 +0000</pubDate><guid>https://wes.today/series/training/train-from-scratch/</guid><description>You have a pile of text — trillions of tokens scraped from the internet, books, code repositories, scientific papers. You have a model architecture — a specific arrangement of transformer layers, attention heads, and feed-forward networks.</description></item></channel></rss>