LLMs: All the Rabbit Holes on Wes Kennedy

What are vectors?

Mon, 13 Apr 2026 00:00:00 +0000

A vector is just a list of numbers. That’s it. The list [0.2, -1.5, 0.8] is a three-dimensional vector. The word “dimensional” sounds intimidating, but all it means is how many numbers are in the…

What is a token?

Mon, 13 Apr 2026 00:00:00 +0000

A token is the unit of text that an LLM actually works with. It’s not a word, not a character — it’s somewhere in between. The model has a fixed vocabulary of tokens (typically 30,000–100,000 of…

What are embeddings and how are they created?

Mon, 13 Apr 2026 00:00:00 +0000

After tokenization gives you a sequence of token IDs like [40, 3021, 5765, 18510, 540], the model needs to convert each ID into a vector the neural network can work with. This is the embedding…

Prefill vs decode

Mon, 13 Apr 2026 00:00:00 +0000

When you send a message to an LLM, inference happens in two distinct phases. Understanding these two phases is key to understanding why LLMs perform the way they do.

How does "thinking" work?

Mon, 13 Apr 2026 00:00:00 +0000

There is no separate “thinking” module. The model doesn’t have an internal scratchpad or a different mode of processing. Thinking is just more token generation — the model produces tokens that…

How do tool calls work?

Mon, 13 Apr 2026 00:00:00 +0000

Tool calling is not a special capability baked into the model architecture. The model doesn’t “run code” or “access the internet.” It generates text that happens to follow a specific structured…

How does memory work?

Mon, 13 Apr 2026 00:00:00 +0000

LLMs have no persistent memory. Every conversation starts from zero — the model has its weights (fixed, trained knowledge) and whatever tokens are in the current context window. When the context is…

How tokenization actually works

Mon, 13 Apr 2026 00:00:00 +0000

Let’s walk through what happens when you type “I love unbelievable tacos” and hit send.

What are weights?

Mon, 13 Apr 2026 00:00:00 +0000

A weight is a single number that the model can adjust to change its behavior. That’s it. When people say a model has “405 billion parameters,” they mean it has 405 billion individual numbers —…

Gradients and gradient updates (how weights get their values)

Mon, 13 Apr 2026 00:00:00 +0000

Training works by repeated trial and error, automated at massive scale. Here’s the loop:

"Directions in the space encode relationships"

Mon, 13 Apr 2026 00:00:00 +0000

This is one of the most surprising things about embeddings. Take the classic example: king - man + woman ≈ queen. What’s actually happening in the vector math?

What are model layers?

Mon, 13 Apr 2026 00:00:00 +0000

A layer is one complete round of transformation that every token vector passes through. If the model has 80 layers (typical for a large LLM), then each token vector gets transformed 80 times in…

How do layers transform vectors?

Mon, 13 Apr 2026 00:00:00 +0000

The actual operations are matrix multiplications. Every transformation in a layer — attention, feed-forward — boils down to multiplying a vector by a weight matrix to produce a new vector.

What are hidden states?

Mon, 13 Apr 2026 00:00:00 +0000

Hidden states are simply what the token vectors are called once they’re inside the model. Before layer 1, they’re called embeddings. After passing through one or more layers, they’re called hidden…

What is a dot product?

Mon, 13 Apr 2026 00:00:00 +0000

Take two vectors of the same length. Multiply their elements in pairs, then add up the results. That’s the entire operation:

KV cache and context memory costs

Mon, 13 Apr 2026 00:00:00 +0000

During prefill, every token at every layer produces three vectors: a query (Q), a key (K), and a value (V). The query is ephemeral — it’s used once to compute attention scores against all…

Skills: abstraction over tool calls

Mon, 13 Apr 2026 00:00:00 +0000

A skill is not something the model does — it’s something the harness (the software wrapping the model) does before and after the model runs.

Planning and multi-step execution

Mon, 13 Apr 2026 00:00:00 +0000

Planning is an extension of thinking combined with tool use. The model doesn’t have a planner module — it generates a plan as text, then executes it step by step through the normal token generation loop.

Tokenization performance: where does it run and what's the bottleneck?

Mon, 13 Apr 2026 00:00:00 +0000

Tokenization runs on the CPU. Not the GPU. This surprises people because everything else in the LLM pipeline is GPU-bound, but tokenization is the wrong shape for GPU execution.

Dimension trade-offs: expressiveness vs. cost

Mon, 13 Apr 2026 00:00:00 +0000

The embedding dimension (d_model) is a design choice made before training. Common values:

Attention deep dive: what does it mean for a token to "pay attention"?

Mon, 13 Apr 2026 00:00:00 +0000

In practice, the model doesn’t run attention once per layer — it runs it multiple times in parallel (typically 32-128 times), each with its own independent set of W_Q, W_K, W_V weight matrices….

FFN deep dive: the per-token thinking step

Mon, 13 Apr 2026 00:00:00 +0000

After attention has mixed information between tokens, each token’s vector passes through the feed-forward network (FFN) — and this is the solo step. No cross-token interaction whatsoever. Every…

From final vector to predicted token

Mon, 13 Apr 2026 00:00:00 +0000

After 80 layers of attention and FFN, each token’s vector has been transformed into a rich contextual representation. But the model needs to produce an actual token — a word (or subword) from its…

Sparse attention: skipping tokens you don't need

Mon, 13 Apr 2026 00:00:00 +0000

Full attention means every token attends to every other token — T² comparisons. Sparse attention says: most of those comparisons are wasted. Token 4,000 probably doesn’t need to attend to token 12….

Attention approximations: breaking the T² barrier differently

Mon, 13 Apr 2026 00:00:00 +0000

Where sparse attention skips certain token pairs, approximation methods try to compute something like full attention but with less math. The goal is the same — avoid the T² bottleneck — but the…

What is quantization?

Mon, 13 Apr 2026 00:00:00 +0000

Quantization means storing numbers with fewer bits. Every weight, every cached K/V vector, every activation is a number that takes up space in memory. The question is: how precisely do you need to…

Paged attention: virtual memory for KV cache

Mon, 13 Apr 2026 00:00:00 +0000

The problem paged attention solves isn’t the size of the KV cache per se — it’s fragmentation and waste when serving many concurrent requests.

KV cache offloading: trading latency for capacity

Mon, 13 Apr 2026 00:00:00 +0000

When the KV cache exceeds GPU HBM capacity, you have two choices: refuse the request, or move some of the cache to slower memory. Offloading chooses the second option — it creates a **memory…

MQA and GQA: reducing cache size at the architecture level

Mon, 13 Apr 2026 00:00:00 +0000

The problem: in standard Multi-Head Attention (MHA), each of the 64 heads has its own K and V projections, so the KV cache stores 64 sets of K and V per layer. Most of the KV cache cost comes from…

Multi-head attention: how it works and what it costs

Mon, 13 Apr 2026 00:00:00 +0000

How multi-head splits the work. Say d_model = 8,192 and you have 64 attention heads. Instead of running one attention operation across all 8,192 dimensions, you split the vector into 64 chunks…

How does the model know when to stop?

Mon, 13 Apr 2026 00:00:00 +0000

The model doesn’t “decide” it has fully answered your question. It has no concept of completeness, correctness, or satisfaction. It stops because it predicts a special token.

What happens when you send a message to an LLM

Mon, 13 Apr 2026 00:00:00 +0000

When you send a message to an LLM like ChatGPT or Claude, here’s what happens at a high level: