Wes Kennedy

What are vectors?

Mon, 13 Apr 2026 00:00:00 +0000

A vector is just a list of numbers. That’s it. The list [0.2, -1.5, 0.8] is a three-dimensional vector. The word “dimensional” sounds intimidating, but all it means is how many numbers are in the…

What is a token?

Mon, 13 Apr 2026 00:00:00 +0000

A token is the unit of text that an LLM actually works with. It’s not a word, not a character — it’s somewhere in between. The model has a fixed vocabulary of tokens (typically 30,000–100,000 of…

What are embeddings and how are they created?

Mon, 13 Apr 2026 00:00:00 +0000

After tokenization gives you a sequence of token IDs like [40, 3021, 5765, 18510, 540], the model needs to convert each ID into a vector the neural network can work with. This is the embedding…

Prefill vs decode

Mon, 13 Apr 2026 00:00:00 +0000

When you send a message to an LLM, inference happens in two distinct phases. Understanding these two phases is key to understanding why LLMs perform the way they do.

How does "thinking" work?

Mon, 13 Apr 2026 00:00:00 +0000

There is no separate “thinking” module. The model doesn’t have an internal scratchpad or a different mode of processing. Thinking is just more token generation — the model produces tokens that…

How do tool calls work?

Mon, 13 Apr 2026 00:00:00 +0000

Tool calling is not a special capability baked into the model architecture. The model doesn’t “run code” or “access the internet.” It generates text that happens to follow a specific structured…

How does memory work?

Mon, 13 Apr 2026 00:00:00 +0000

LLMs have no persistent memory. Every conversation starts from zero — the model has its weights (fixed, trained knowledge) and whatever tokens are in the current context window. When the context is…

How tokenization actually works

Mon, 13 Apr 2026 00:00:00 +0000

Let’s walk through what happens when you type “I love unbelievable tacos” and hit send.

What are weights?

Mon, 13 Apr 2026 00:00:00 +0000

A weight is a single number that the model can adjust to change its behavior. That’s it. When people say a model has “405 billion parameters,” they mean it has 405 billion individual numbers —…

Gradients and gradient updates (how weights get their values)

Mon, 13 Apr 2026 00:00:00 +0000

Training works by repeated trial and error, automated at massive scale. Here’s the loop:

"Directions in the space encode relationships"

Mon, 13 Apr 2026 00:00:00 +0000

This is one of the most surprising things about embeddings. Take the classic example: king - man + woman ≈ queen. What’s actually happening in the vector math?

What are model layers?

Mon, 13 Apr 2026 00:00:00 +0000

A layer is one complete round of transformation that every token vector passes through. If the model has 80 layers (typical for a large LLM), then each token vector gets transformed 80 times in…

How do layers transform vectors?

Mon, 13 Apr 2026 00:00:00 +0000

The actual operations are matrix multiplications. Every transformation in a layer — attention, feed-forward — boils down to multiplying a vector by a weight matrix to produce a new vector.

What are hidden states?

Mon, 13 Apr 2026 00:00:00 +0000

Hidden states are simply what the token vectors are called once they’re inside the model. Before layer 1, they’re called embeddings. After passing through one or more layers, they’re called hidden…

What is a dot product?

Mon, 13 Apr 2026 00:00:00 +0000

Take two vectors of the same length. Multiply their elements in pairs, then add up the results. That’s the entire operation:

KV cache and context memory costs

Mon, 13 Apr 2026 00:00:00 +0000

During prefill, every token at every layer produces three vectors: a query (Q), a key (K), and a value (V). The query is ephemeral — it’s used once to compute attention scores against all…

Skills: abstraction over tool calls

Mon, 13 Apr 2026 00:00:00 +0000

A skill is not something the model does — it’s something the harness (the software wrapping the model) does before and after the model runs.

Planning and multi-step execution

Mon, 13 Apr 2026 00:00:00 +0000

Planning is an extension of thinking combined with tool use. The model doesn’t have a planner module — it generates a plan as text, then executes it step by step through the normal token generation loop.

Tokenization performance: where does it run and what's the bottleneck?

Mon, 13 Apr 2026 00:00:00 +0000

Tokenization runs on the CPU. Not the GPU. This surprises people because everything else in the LLM pipeline is GPU-bound, but tokenization is the wrong shape for GPU execution.

Dimension trade-offs: expressiveness vs. cost

Mon, 13 Apr 2026 00:00:00 +0000

The embedding dimension (d_model) is a design choice made before training. Common values:

Attention deep dive: what does it mean for a token to "pay attention"?

Mon, 13 Apr 2026 00:00:00 +0000

In practice, the model doesn’t run attention once per layer — it runs it multiple times in parallel (typically 32-128 times), each with its own independent set of W_Q, W_K, W_V weight matrices….

FFN deep dive: the per-token thinking step

Mon, 13 Apr 2026 00:00:00 +0000

After attention has mixed information between tokens, each token’s vector passes through the feed-forward network (FFN) — and this is the solo step. No cross-token interaction whatsoever. Every…

From final vector to predicted token

Mon, 13 Apr 2026 00:00:00 +0000

After 80 layers of attention and FFN, each token’s vector has been transformed into a rich contextual representation. But the model needs to produce an actual token — a word (or subword) from its…

Sparse attention: skipping tokens you don't need

Mon, 13 Apr 2026 00:00:00 +0000

Full attention means every token attends to every other token — T² comparisons. Sparse attention says: most of those comparisons are wasted. Token 4,000 probably doesn’t need to attend to token 12….

Attention approximations: breaking the T² barrier differently

Mon, 13 Apr 2026 00:00:00 +0000

Where sparse attention skips certain token pairs, approximation methods try to compute something like full attention but with less math. The goal is the same — avoid the T² bottleneck — but the…

What is quantization?

Mon, 13 Apr 2026 00:00:00 +0000

Quantization means storing numbers with fewer bits. Every weight, every cached K/V vector, every activation is a number that takes up space in memory. The question is: how precisely do you need to…

Paged attention: virtual memory for KV cache

Mon, 13 Apr 2026 00:00:00 +0000

The problem paged attention solves isn’t the size of the KV cache per se — it’s fragmentation and waste when serving many concurrent requests.

KV cache offloading: trading latency for capacity

Mon, 13 Apr 2026 00:00:00 +0000

When the KV cache exceeds GPU HBM capacity, you have two choices: refuse the request, or move some of the cache to slower memory. Offloading chooses the second option — it creates a **memory…

MQA and GQA: reducing cache size at the architecture level

Mon, 13 Apr 2026 00:00:00 +0000

The problem: in standard Multi-Head Attention (MHA), each of the 64 heads has its own K and V projections, so the KV cache stores 64 sets of K and V per layer. Most of the KV cache cost comes from…

Multi-head attention: how it works and what it costs

Mon, 13 Apr 2026 00:00:00 +0000

How multi-head splits the work. Say d_model = 8,192 and you have 64 attention heads. Instead of running one attention operation across all 8,192 dimensions, you split the vector into 64 chunks…

How does the model know when to stop?

Mon, 13 Apr 2026 00:00:00 +0000

The model doesn’t “decide” it has fully answered your question. It has no concept of completeness, correctness, or satisfaction. It stops because it predicts a special token.

Hello World

Sun, 19 Apr 2026 00:00:00 +0000

I’ve spent a lot of time in the last decade+ maintaining a website/blog. I’ve been maintaining the website for different reasons over the years, at the beginning of my career I would share a lot of operational things that I ran into that I thought other people would find value in, then it shifted into a place where I would dig deep into technical concepts that are more complex, then it became somewhere where I would more or less be a corporate shill. Never once has it truly represented me as a whole person. So we’re going to give this thing a fresh go.

What happens when you send a message to an LLM

Mon, 13 Apr 2026 00:00:00 +0000

When you send a message to an LLM like ChatGPT or Claude, here’s what happens at a high level:

About

Mon, 01 Jan 0001 00:00:00 +0000

I’m Wes Kennedy — a systems archaeologist by instinct, technical marketer by trade.

My default mode is reverse-engineering: I walk into any system whether its technical, mechanical, organizational, and start reading the decisions frozen inside it. How it got here. Where it’s drifted. Where it’s headed. I’ve been doing this since before I had words for it, across farming, electrical work, woodworking, enterprise infrastructure, and AI datacenter architecture. The domain changes. The underlying work doesn’t.

Contact

Mon, 01 Jan 0001 00:00:00 +0000

No fancy forms, just email. wk at this domain.

Don’t try and sell me something because cold solicitations are a quick way to not get my attention and will result in me blocking your email address. Just like the roofers that come to my house after every bad storm trying to sell me a new roof, I won’t be using your services.

Aside from that, I’m super friendly and will be happy to hear from you. I’m pretty swamped, so don’t expect a quick response.

Now

Mon, 01 Jan 0001 00:00:00 +0000

People where more than one hat, we’re WAY more than just the labor we trade for capitalism points. Here are the things that have me occupied these days.

“Daddy” to an amazing 5 year old kid
Husband to my wife
Aspiring woodworker
Work at NVIDIA as a Senior Technical Marketing Engineer - DGX Platforms
Board Member & IT Director at The Juniper School
Baron Bruno, Chaotic Good, D&D Campaign

Side Quests

This is a constantly growing list of side quests that can become exhausting at times. I’ve worked very hard to slim this down, but I can’t help myself sometimes.