This is where it gets interesting and directly connects to inference cost. Standard multi-head attention (MHA) gives each of the 64 heads its own Key and Value projections. That means during inference, the KV cache stores 64 separate K and V vectors per token per layer. We covered in the LLMs series how the KV cache is often the memory bottleneck during inference.
Grouped Query Attention (GQA) shares Key-Value projections across groups of heads. Llama 3 has 8 KV heads, so every 8 query heads share one KV head (64 / 8 = 8 heads per group). This cuts the KV cache size by 8x compared to full MHA.
Why 8 KV heads specifically? It’s a spectrum:
- MHA (64 KV heads): Maximum expressiveness, maximum KV cache memory
- GQA (8 KV heads): Minor quality loss, 8x KV cache reduction
- MQA (1 KV head): All heads share one KV projection. Maximum compression but measurable quality degradation on complex tasks
Meta tested these ratios and found 8 KV heads preserved nearly all of MHA’s quality while dramatically reducing inference memory. This is a training-time decision that optimizes for inference cost — a recurring theme. You’re choosing the architecture now, but the thing you’re optimizing for is how efficiently this model will serve millions of requests later.
The math: At full MHA, KV cache for one 8K-token sequence across 80 layers at FP16: 2 × 64 × 128 × 8192 × 80 × 2 bytes = ~21GB. With GQA at 8 KV heads: 2 × 8 × 128 × 8192 × 80 × 2 bytes = ~2.6GB. That’s the difference between serving 1 user per GPU and serving 8.
Performance Profile
- KV cache reduction: 8x reduction vs. full MHA (21GB → 2.6GB per 8K sequence). This is the dominant factor in inference serving density — more concurrent users per GPU.
- Training cost: GQA is slightly cheaper to train than full MHA because the KV projections have fewer parameters (8 KV heads vs. 64). Minor savings during training, massive savings during inference.
- Quality impact: Meta’s ablations showed GQA with 8 KV heads preserved >99% of full MHA quality on standard benchmarks. The quality gap widens at very long contexts and complex multi-hop reasoning tasks but remains small.
- Architecture decision with inference consequences: GQA is chosen at architecture design time (before training) specifically to optimize inference. A training-time decision that pays dividends at serving time.