128K × D"]:::hl --> E["Q,K,V Proj
D × D"]:::hl --> F["Attn Scores
T² × D"] --> G["FFN W_up
D × 4D"]:::hl --> H["FFN W_down
4D × D"]:::hl --> I["→ ×80"] --> J["Vocab Proj
D × 128K"]:::hl --> K[Output] classDef hl fill:#2d6a4f,stroke:#1b4332,color:#d8f3dc classDef default fill:#1a1a2e,stroke:#16213e,color:#e0e0e0 click A "/llms/what-happens/tokens/" click D "/llms/what-happens/embeddings/" click E "/llms/what-happens/embeddings/model-layers/attention-deep-dive/" click F "/llms/what-happens/embeddings/model-layers/attention-deep-dive/" click G "/llms/what-happens/embeddings/model-layers/ffn-deep-dive/" click H "/llms/what-happens/embeddings/model-layers/ffn-deep-dive/" click J "/llms/what-happens/embeddings/model-layers/final-vector-to-token/" click K "/llms/what-happens/embeddings/model-layers/final-vector-to-token/"
Highlighted: every operation where D (dimension) determines matrix size and cost
The embedding dimension (d_model) is a design choice made before training. Common values:
| Model | Dimensions | Parameters |
|---|---|---|
| Llama 3 8B | 4,096 | 8 billion |
| Llama 3 70B | 8,192 | 70 billion |
| Llama 3 405B | 16,384 | 405 billion |
| Claude 3 Opus | ~12,288 | (undisclosed) |
What more dimensions buy you: capacity, not accuracy. Each vector has more room to encode nuanced distinctions. With 256 dimensions, the model has to cram everything it knows about a token into 256 numbers — subtle differences between related concepts get lost. With 16,384 dimensions, it can represent much finer-grained relationships. Think of it like color depth: 8-bit color gives you 256 colors, 24-bit gives you 16 million. More dimensions don’t make individual representations “more correct” — they give the model more room to distinguish between things.
What more dimensions cost: The weight matrices in attention and feed-forward layers are roughly D × D in size. So compute scales as D² — doubling the dimension roughly quadruples the math per layer. Memory scales similarly: the weight matrices themselves are bigger, the activations (intermediate computation values) are bigger, and the KV cache (see KV cache) is bigger. Going from 4,096 to 16,384 dimensions is a 4x increase in dimension but approximately a 16x increase in per-layer compute and weight storage.
This is why scaling model size isn’t free — and why there’s an architecture design art to choosing the right balance of layers, dimensions, and attention heads for a given parameter budget.