LLMs: All the Rabbit Holes
What actually happens when you send a message to an LLM — from tokens to attention to output, one rabbit hole at a time.
send a message"]:::node R0 --> 1a["Vectors"]:::node R0 --> 1b["Tokens"]:::node R0 --> 1c["Embeddings"]:::node R0 --> 1d["Prefill vs Decode"]:::node R0 --> 1e["Thinking"]:::node R0 --> 1f["Tool Calls"]:::node R0 --> 1g["Memory"]:::node 1a --> 1a_more["1 deeper topic"]:::count 1b --> 1b_more["1 deeper topic"]:::count 1c --> 1c_more["6 deeper topics"]:::count 1d --> 1d_more["1 deeper topic"]:::count 1e --> 1e_more["1 deeper topic"]:::count 1f --> 1f_more["1 deeper topic"]:::count classDef node fill:#2d6a4f,stroke:#1b4332,color:#d8f3dc classDef count fill:#1a1a2e,stroke:#16213e,color:#888,font-size:12px click R0 "/llms/what-happens/" click 1a "/llms/what-happens/vectors/" click 1b "/llms/what-happens/tokens/" click 1c "/llms/what-happens/embeddings/" click 1d "/llms/what-happens/prefill-decode/" click 1e "/llms/what-happens/thinking/" click 1f "/llms/what-happens/tool-calls/" click 1g "/llms/what-happens/memory/"
Tier 1
- Read
What are vectors?
A vector is just a list of numbers. That's it. The list `[0.2, -1.5, 0.8]` is a three-dimensional vector. The word "dimensional" sounds intimidating, but all it means is how many numbers are in the...
- Read
What is a token?
A token is the unit of text that an LLM actually works with. It's not a word, not a character — it's somewhere in between. The model has a fixed vocabulary of tokens (typically 30,000–100,000 of...
- Read
What are embeddings and how are they created?
After tokenization gives you a sequence of token IDs like `[40, 3021, 5765, 18510, 540]`, the model needs to convert each ID into a vector the neural network can work with. This is the embedding...
- Read
Prefill vs decode
When you send a message to an LLM, inference happens in two distinct phases. Understanding these two phases is key to understanding why LLMs perform the way they do.
- Read
How does "thinking" work?
There is no separate "thinking" module. The model doesn't have an internal scratchpad or a different mode of processing. Thinking is just **more token generation** — the model produces tokens that...
- Read
How do tool calls work?
Tool calling is not a special capability baked into the model architecture. The model doesn't "run code" or "access the internet." It generates text that happens to follow a specific structured...
- Read
How does memory work?
LLMs have no persistent memory. Every conversation starts from zero — the model has its weights (fixed, trained knowledge) and whatever tokens are in the current context window. When the context is...
Tier 2
- Read
How tokenization actually works
Let's walk through what happens when you type "I love unbelievable tacos" and hit send.
- Read
What are weights?
A weight is a single number that the model can adjust to change its behavior. That's it. When people say a model has "405 billion parameters," they mean it has 405 billion individual numbers —...
- Read
Gradients and gradient updates (how weights get their values)
Training works by repeated trial and error, automated at massive scale. Here's the loop:
- Read
"Directions in the space encode relationships"
This is one of the most surprising things about embeddings. Take the classic example: king - man + woman ≈ queen. What's actually happening in the vector math?
- Read
What are model layers?
A layer is one complete round of transformation that every token vector passes through. If the model has 80 layers (typical for a large LLM), then each token vector gets transformed 80 times in...
- Read
How do layers transform vectors?
The actual operations are matrix multiplications. Every transformation in a layer — attention, feed-forward — boils down to multiplying a vector by a weight matrix to produce a new vector.
- Read
What are hidden states?
Hidden states are simply what the token vectors are called once they're inside the model. Before layer 1, they're called embeddings. After passing through one or more layers, they're called hidden...
- Read
What is a dot product?
Take two vectors of the same length. Multiply their elements in pairs, then add up the results. That's the entire operation:
- Read
KV cache and context memory costs
During prefill, every token at every layer produces three vectors: a **query** (Q), a **key** (K), and a **value** (V). The query is ephemeral — it's used once to compute attention scores against all...
- Read
Skills: abstraction over tool calls
A skill is not something the model does — it's something the **harness** (the software wrapping the model) does before and after the model runs.
- Read
Planning and multi-step execution
Planning is an extension of thinking combined with tool use. The model doesn't have a planner module — it generates a plan as text, then executes it step by step through the normal token generation loop.
Tier 3
- Read
Tokenization performance: where does it run and what's the bottleneck?
Tokenization runs on the **CPU**. Not the GPU. This surprises people because everything else in the LLM pipeline is GPU-bound, but tokenization is the wrong shape for GPU execution.
- Read
Dimension trade-offs: expressiveness vs. cost
The embedding dimension (`d_model`) is a design choice made before training. Common values:
- Read
Attention deep dive: what does it mean for a token to "pay attention"?
In practice, the model doesn't run attention once per layer — it runs it **multiple times in parallel** (typically 32-128 times), each with its own independent set of W_Q, W_K, W_V weight matrices....
- Read
FFN deep dive: the per-token thinking step
After attention has mixed information between tokens, each token's vector passes through the **feed-forward network (FFN)** — and this is the solo step. No cross-token interaction whatsoever. Every...
- Read
From final vector to predicted token
After 80 layers of attention and FFN, each token's vector has been transformed into a rich contextual representation. But the model needs to produce an actual token — a word (or subword) from its...
- Read
Sparse attention: skipping tokens you don't need
Full attention means every token attends to every other token — T² comparisons. Sparse attention says: most of those comparisons are wasted. Token 4,000 probably doesn't need to attend to token 12....
- Read
Attention approximations: breaking the T² barrier differently
Where sparse attention *skips* certain token pairs, approximation methods try to compute *something like* full attention but with less math. The goal is the same — avoid the T² bottleneck — but the...
- Read
What is quantization?
Quantization means storing numbers with fewer bits. Every weight, every cached K/V vector, every activation is a number that takes up space in memory. The question is: how precisely do you need to...
- Read
Paged attention: virtual memory for KV cache
The problem paged attention solves isn't the size of the KV cache per se — it's **fragmentation** and **waste** when serving many concurrent requests.
- Read
KV cache offloading: trading latency for capacity
When the KV cache exceeds GPU HBM capacity, you have two choices: refuse the request, or move some of the cache to slower memory. Offloading chooses the second option — it creates a **memory...
- Read
MQA and GQA: reducing cache size at the architecture level
The problem: in standard Multi-Head Attention (MHA), each of the 64 heads has its own K and V projections, so the KV cache stores 64 sets of K and V per layer. Most of the KV cache cost comes from...
Tier 4
- Read
Multi-head attention: how it works and what it costs
**How multi-head splits the work.** Say `d_model` = 8,192 and you have 64 attention heads. Instead of running one attention operation across all 8,192 dimensions, you split the vector into 64 chunks...
- Read
How does the model know when to stop?
The model doesn't "decide" it has fully answered your question. It has no concept of completeness, correctness, or satisfaction. It stops because it predicts a special token.