<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>LLMs: All the Rabbit Holes on Wes Kennedy</title><link>https://wes.today/llms/</link><description>Recent content in LLMs: All the Rabbit Holes on Wes Kennedy</description><generator>Hugo</generator><language>en-us</language><lastBuildDate>Mon, 13 Apr 2026 00:00:00 +0000</lastBuildDate><atom:link href="https://wes.today/llms/index.xml" rel="self" type="application/rss+xml"/><item><title>What are vectors?</title><link>https://wes.today/llms/what-happens/vectors/</link><pubDate>Mon, 13 Apr 2026 00:00:00 +0000</pubDate><guid>https://wes.today/llms/what-happens/vectors/</guid><description>A vector is just a list of numbers. That&amp;rsquo;s it. The list &lt;code>[0.2, -1.5, 0.8]&lt;/code> is a three-dimensional vector. The word &amp;ldquo;dimensional&amp;rdquo; sounds intimidating, but all it means is how many numbers are in the&amp;hellip;</description></item><item><title>What is a token?</title><link>https://wes.today/llms/what-happens/tokens/</link><pubDate>Mon, 13 Apr 2026 00:00:00 +0000</pubDate><guid>https://wes.today/llms/what-happens/tokens/</guid><description>A token is the unit of text that an LLM actually works with. It&amp;rsquo;s not a word, not a character — it&amp;rsquo;s somewhere in between. The model has a fixed vocabulary of tokens (typically 30,000–100,000 of&amp;hellip;</description></item><item><title>What are embeddings and how are they created?</title><link>https://wes.today/llms/what-happens/embeddings/</link><pubDate>Mon, 13 Apr 2026 00:00:00 +0000</pubDate><guid>https://wes.today/llms/what-happens/embeddings/</guid><description>After tokenization gives you a sequence of token IDs like &lt;code>[40, 3021, 5765, 18510, 540]&lt;/code>, the model needs to convert each ID into a vector the neural network can work with. This is the embedding&amp;hellip;</description></item><item><title>Prefill vs decode</title><link>https://wes.today/llms/what-happens/prefill-decode/</link><pubDate>Mon, 13 Apr 2026 00:00:00 +0000</pubDate><guid>https://wes.today/llms/what-happens/prefill-decode/</guid><description>When you send a message to an LLM, inference happens in two distinct phases. Understanding these two phases is key to understanding why LLMs perform the way they do.</description></item><item><title>How does "thinking" work?</title><link>https://wes.today/llms/what-happens/thinking/</link><pubDate>Mon, 13 Apr 2026 00:00:00 +0000</pubDate><guid>https://wes.today/llms/what-happens/thinking/</guid><description>There is no separate &amp;ldquo;thinking&amp;rdquo; module. The model doesn&amp;rsquo;t have an internal scratchpad or a different mode of processing. Thinking is just &lt;strong>more token generation&lt;/strong> — the model produces tokens that&amp;hellip;</description></item><item><title>How do tool calls work?</title><link>https://wes.today/llms/what-happens/tool-calls/</link><pubDate>Mon, 13 Apr 2026 00:00:00 +0000</pubDate><guid>https://wes.today/llms/what-happens/tool-calls/</guid><description>Tool calling is not a special capability baked into the model architecture. The model doesn&amp;rsquo;t &amp;ldquo;run code&amp;rdquo; or &amp;ldquo;access the internet.&amp;rdquo; It generates text that happens to follow a specific structured&amp;hellip;</description></item><item><title>How does memory work?</title><link>https://wes.today/llms/what-happens/memory/</link><pubDate>Mon, 13 Apr 2026 00:00:00 +0000</pubDate><guid>https://wes.today/llms/what-happens/memory/</guid><description>LLMs have no persistent memory. Every conversation starts from zero — the model has its weights (fixed, trained knowledge) and whatever tokens are in the current context window. When the context is&amp;hellip;</description></item><item><title>How tokenization actually works</title><link>https://wes.today/llms/what-happens/tokens/tokenization/</link><pubDate>Mon, 13 Apr 2026 00:00:00 +0000</pubDate><guid>https://wes.today/llms/what-happens/tokens/tokenization/</guid><description>Let&amp;rsquo;s walk through what happens when you type &amp;ldquo;I love unbelievable tacos&amp;rdquo; and hit send.</description></item><item><title>What are weights?</title><link>https://wes.today/llms/what-happens/embeddings/weights/</link><pubDate>Mon, 13 Apr 2026 00:00:00 +0000</pubDate><guid>https://wes.today/llms/what-happens/embeddings/weights/</guid><description>A weight is a single number that the model can adjust to change its behavior. That&amp;rsquo;s it. When people say a model has &amp;ldquo;405 billion parameters,&amp;rdquo; they mean it has 405 billion individual numbers —&amp;hellip;</description></item><item><title>Gradients and gradient updates (how weights get their values)</title><link>https://wes.today/llms/what-happens/embeddings/gradients/</link><pubDate>Mon, 13 Apr 2026 00:00:00 +0000</pubDate><guid>https://wes.today/llms/what-happens/embeddings/gradients/</guid><description>Training works by repeated trial and error, automated at massive scale. Here&amp;rsquo;s the loop:</description></item><item><title>"Directions in the space encode relationships"</title><link>https://wes.today/llms/what-happens/embeddings/directions/</link><pubDate>Mon, 13 Apr 2026 00:00:00 +0000</pubDate><guid>https://wes.today/llms/what-happens/embeddings/directions/</guid><description>This is one of the most surprising things about embeddings. Take the classic example: king - man + woman ≈ queen. What&amp;rsquo;s actually happening in the vector math?</description></item><item><title>What are model layers?</title><link>https://wes.today/llms/what-happens/embeddings/model-layers/</link><pubDate>Mon, 13 Apr 2026 00:00:00 +0000</pubDate><guid>https://wes.today/llms/what-happens/embeddings/model-layers/</guid><description>A layer is one complete round of transformation that every token vector passes through. If the model has 80 layers (typical for a large LLM), then each token vector gets transformed 80 times in&amp;hellip;</description></item><item><title>How do layers transform vectors?</title><link>https://wes.today/llms/what-happens/embeddings/layer-transforms/</link><pubDate>Mon, 13 Apr 2026 00:00:00 +0000</pubDate><guid>https://wes.today/llms/what-happens/embeddings/layer-transforms/</guid><description>The actual operations are matrix multiplications. Every transformation in a layer — attention, feed-forward — boils down to multiplying a vector by a weight matrix to produce a new vector.</description></item><item><title>What are hidden states?</title><link>https://wes.today/llms/what-happens/embeddings/hidden-states/</link><pubDate>Mon, 13 Apr 2026 00:00:00 +0000</pubDate><guid>https://wes.today/llms/what-happens/embeddings/hidden-states/</guid><description>Hidden states are simply what the token vectors are called once they&amp;rsquo;re inside the model. Before layer 1, they&amp;rsquo;re called embeddings. After passing through one or more layers, they&amp;rsquo;re called hidden&amp;hellip;</description></item><item><title>What is a dot product?</title><link>https://wes.today/llms/what-happens/vectors/dot-product/</link><pubDate>Mon, 13 Apr 2026 00:00:00 +0000</pubDate><guid>https://wes.today/llms/what-happens/vectors/dot-product/</guid><description>Take two vectors of the same length. Multiply their elements in pairs, then add up the results. That&amp;rsquo;s the entire operation:</description></item><item><title>KV cache and context memory costs</title><link>https://wes.today/llms/what-happens/prefill-decode/kv-cache/</link><pubDate>Mon, 13 Apr 2026 00:00:00 +0000</pubDate><guid>https://wes.today/llms/what-happens/prefill-decode/kv-cache/</guid><description>During prefill, every token at every layer produces three vectors: a &lt;strong>query&lt;/strong> (Q), a &lt;strong>key&lt;/strong> (K), and a &lt;strong>value&lt;/strong> (V). The query is ephemeral — it&amp;rsquo;s used once to compute attention scores against all&amp;hellip;</description></item><item><title>Skills: abstraction over tool calls</title><link>https://wes.today/llms/what-happens/tool-calls/skills/</link><pubDate>Mon, 13 Apr 2026 00:00:00 +0000</pubDate><guid>https://wes.today/llms/what-happens/tool-calls/skills/</guid><description>A skill is not something the model does — it&amp;rsquo;s something the &lt;strong>harness&lt;/strong> (the software wrapping the model) does before and after the model runs.</description></item><item><title>Planning and multi-step execution</title><link>https://wes.today/llms/what-happens/thinking/planning/</link><pubDate>Mon, 13 Apr 2026 00:00:00 +0000</pubDate><guid>https://wes.today/llms/what-happens/thinking/planning/</guid><description>Planning is an extension of thinking combined with tool use. The model doesn&amp;rsquo;t have a planner module — it generates a plan as text, then executes it step by step through the normal token generation loop.</description></item><item><title>Tokenization performance: where does it run and what's the bottleneck?</title><link>https://wes.today/llms/what-happens/tokens/tokenization/tokenization-perf/</link><pubDate>Mon, 13 Apr 2026 00:00:00 +0000</pubDate><guid>https://wes.today/llms/what-happens/tokens/tokenization/tokenization-perf/</guid><description>Tokenization runs on the &lt;strong>CPU&lt;/strong>. Not the GPU. This surprises people because everything else in the LLM pipeline is GPU-bound, but tokenization is the wrong shape for GPU execution.</description></item><item><title>Dimension trade-offs: expressiveness vs. cost</title><link>https://wes.today/llms/what-happens/embeddings/layer-transforms/dimension-tradeoffs/</link><pubDate>Mon, 13 Apr 2026 00:00:00 +0000</pubDate><guid>https://wes.today/llms/what-happens/embeddings/layer-transforms/dimension-tradeoffs/</guid><description>The embedding dimension (&lt;code>d_model&lt;/code>) is a design choice made before training. Common values:</description></item><item><title>Attention deep dive: what does it mean for a token to "pay attention"?</title><link>https://wes.today/llms/what-happens/embeddings/model-layers/attention-deep-dive/</link><pubDate>Mon, 13 Apr 2026 00:00:00 +0000</pubDate><guid>https://wes.today/llms/what-happens/embeddings/model-layers/attention-deep-dive/</guid><description>In practice, the model doesn&amp;rsquo;t run attention once per layer — it runs it &lt;strong>multiple times in parallel&lt;/strong> (typically 32-128 times), each with its own independent set of W_Q, W_K, W_V weight matrices&amp;hellip;.</description></item><item><title>FFN deep dive: the per-token thinking step</title><link>https://wes.today/llms/what-happens/embeddings/model-layers/ffn-deep-dive/</link><pubDate>Mon, 13 Apr 2026 00:00:00 +0000</pubDate><guid>https://wes.today/llms/what-happens/embeddings/model-layers/ffn-deep-dive/</guid><description>After attention has mixed information between tokens, each token&amp;rsquo;s vector passes through the &lt;strong>feed-forward network (FFN)&lt;/strong> — and this is the solo step. No cross-token interaction whatsoever. Every&amp;hellip;</description></item><item><title>From final vector to predicted token</title><link>https://wes.today/llms/what-happens/embeddings/model-layers/final-vector-to-token/</link><pubDate>Mon, 13 Apr 2026 00:00:00 +0000</pubDate><guid>https://wes.today/llms/what-happens/embeddings/model-layers/final-vector-to-token/</guid><description>After 80 layers of attention and FFN, each token&amp;rsquo;s vector has been transformed into a rich contextual representation. But the model needs to produce an actual token — a word (or subword) from its&amp;hellip;</description></item><item><title>Sparse attention: skipping tokens you don't need</title><link>https://wes.today/llms/what-happens/prefill-decode/kv-cache/sparse-attention/</link><pubDate>Mon, 13 Apr 2026 00:00:00 +0000</pubDate><guid>https://wes.today/llms/what-happens/prefill-decode/kv-cache/sparse-attention/</guid><description>Full attention means every token attends to every other token — T² comparisons. Sparse attention says: most of those comparisons are wasted. Token 4,000 probably doesn&amp;rsquo;t need to attend to token 12&amp;hellip;.</description></item><item><title>Attention approximations: breaking the T² barrier differently</title><link>https://wes.today/llms/what-happens/prefill-decode/kv-cache/attention-approximations/</link><pubDate>Mon, 13 Apr 2026 00:00:00 +0000</pubDate><guid>https://wes.today/llms/what-happens/prefill-decode/kv-cache/attention-approximations/</guid><description>Where sparse attention &lt;em>skips&lt;/em> certain token pairs, approximation methods try to compute &lt;em>something like&lt;/em> full attention but with less math. The goal is the same — avoid the T² bottleneck — but the&amp;hellip;</description></item><item><title>What is quantization?</title><link>https://wes.today/llms/what-happens/prefill-decode/kv-cache/quantization/</link><pubDate>Mon, 13 Apr 2026 00:00:00 +0000</pubDate><guid>https://wes.today/llms/what-happens/prefill-decode/kv-cache/quantization/</guid><description>Quantization means storing numbers with fewer bits. Every weight, every cached K/V vector, every activation is a number that takes up space in memory. The question is: how precisely do you need to&amp;hellip;</description></item><item><title>Paged attention: virtual memory for KV cache</title><link>https://wes.today/llms/what-happens/prefill-decode/kv-cache/paged-attention/</link><pubDate>Mon, 13 Apr 2026 00:00:00 +0000</pubDate><guid>https://wes.today/llms/what-happens/prefill-decode/kv-cache/paged-attention/</guid><description>The problem paged attention solves isn&amp;rsquo;t the size of the KV cache per se — it&amp;rsquo;s &lt;strong>fragmentation&lt;/strong> and &lt;strong>waste&lt;/strong> when serving many concurrent requests.</description></item><item><title>KV cache offloading: trading latency for capacity</title><link>https://wes.today/llms/what-happens/prefill-decode/kv-cache/kv-cache-offloading/</link><pubDate>Mon, 13 Apr 2026 00:00:00 +0000</pubDate><guid>https://wes.today/llms/what-happens/prefill-decode/kv-cache/kv-cache-offloading/</guid><description>When the KV cache exceeds GPU HBM capacity, you have two choices: refuse the request, or move some of the cache to slower memory. Offloading chooses the second option — it creates a **memory&amp;hellip;</description></item><item><title>MQA and GQA: reducing cache size at the architecture level</title><link>https://wes.today/llms/what-happens/prefill-decode/kv-cache/mqa-gqa/</link><pubDate>Mon, 13 Apr 2026 00:00:00 +0000</pubDate><guid>https://wes.today/llms/what-happens/prefill-decode/kv-cache/mqa-gqa/</guid><description>The problem: in standard Multi-Head Attention (MHA), each of the 64 heads has its own K and V projections, so the KV cache stores 64 sets of K and V per layer. Most of the KV cache cost comes from&amp;hellip;</description></item><item><title>Multi-head attention: how it works and what it costs</title><link>https://wes.today/llms/what-happens/embeddings/model-layers/attention-deep-dive/multi-head-attention/</link><pubDate>Mon, 13 Apr 2026 00:00:00 +0000</pubDate><guid>https://wes.today/llms/what-happens/embeddings/model-layers/attention-deep-dive/multi-head-attention/</guid><description>&lt;strong>How multi-head splits the work.&lt;/strong> Say &lt;code>d_model&lt;/code> = 8,192 and you have 64 attention heads. Instead of running one attention operation across all 8,192 dimensions, you split the vector into 64 chunks&amp;hellip;</description></item><item><title>How does the model know when to stop?</title><link>https://wes.today/llms/what-happens/embeddings/model-layers/final-vector-to-token/stopping/</link><pubDate>Mon, 13 Apr 2026 00:00:00 +0000</pubDate><guid>https://wes.today/llms/what-happens/embeddings/model-layers/final-vector-to-token/stopping/</guid><description>The model doesn&amp;rsquo;t &amp;ldquo;decide&amp;rdquo; it has fully answered your question. It has no concept of completeness, correctness, or satisfaction. It stops because it predicts a special token.</description></item><item><title>What happens when you send a message to an LLM</title><link>https://wes.today/llms/what-happens/</link><pubDate>Mon, 13 Apr 2026 00:00:00 +0000</pubDate><guid>https://wes.today/llms/what-happens/</guid><description>When you send a message to an LLM like ChatGPT or Claude, here&amp;rsquo;s what happens at a high level:</description></item></channel></rss>