Meta trained Llama 3 70B on 16,384 H100 GPUs. That’s roughly 2,048 nodes of 8 GPUs each (DGX H100 form factor), or in our GB200 NVL72 terms, ~227 racks. This isn’t a nice-to-have. A 70B model trained on 15 trillion tokens literally cannot be done at small scale in a reasonable timeframe.
Why this can’t run on a single rack: The math is unforgiving. Training Llama 3 70B requires roughly 6 × 70B × 15T = 6.3 × 10²⁴ FLOPs (the standard approximation is 6 × parameters × tokens for total training compute). A single GB200 NVL72 rack delivers ~180 petaFLOPS peak at FP8. Realistic MFU (Model FLOP Utilization) for training is 30-50%, giving ~54-90 petaFLOPS effective throughput. At the optimistic end: 6.3 × 10²⁴ / 9 × 10¹⁶ ≈ 70 million seconds ≈ 2.2 years. At the conservative end: ~3.7 years. Nobody is going to babysit a training run for years and hope nothing breaks.
At 16,384 GPUs (~227 racks), you divide that compute across all of them: the run finishes in roughly 1-2 weeks. That’s consistent with what Meta reported for Llama 3 70B on H100s. Scale is not a luxury. It’s the only way the math works.
Performance Profile
- Total training compute: ~6.3 × 10²⁴ FLOPs (6 × 70B params × 15T tokens). This is the fundamental cost — every other number derives from dividing this across GPUs and time.
- Single-rack time: ~2-4 years at realistic utilization (30-50% MFU). Impractical — too long, too many failure opportunities.
- At 16,384 GPUs (~227 racks): 1-2 weeks. The only scale where the math works for a production training run.
- Per-step data consumption: ~537M tokens = ~2GB raw input per step. Tiny relative to compute, but must arrive on time across 2,048+ nodes every few seconds.
- Failure cost at scale: Any single GPU failure, network flap, or storage stall affects all 16,384 GPUs. Mean time between failures across the cluster is hours, not days — operational reliability is the dominant engineering challenge.