graph LR A["Active tokens
(recent)"]:::s1 --> HBM["GPU HBM
(192 GB, ~8 TB/s)"]:::s1 B["Warm tokens
(medium-range)"]:::s2 --> CPU["CPU RAM
(1-2 TB, ~200 GB/s)"]:::s2 C["Cold tokens
(distant past)"]:::s3 --> NVM["NVMe / Storage
(many TB, ~10-50 GB/s)"]:::s3 HBM -.->|"evict"| CPU CPU -.->|"evict"| NVM NVM -.->|"fetch"| CPU CPU -.->|"fetch"| HBM classDef s1 fill:#2d6a4f,stroke:#1b4332,color:#d8f3dc classDef s2 fill:#e76f51,stroke:#9c3a1a,color:#fff classDef s3 fill:#264653,stroke:#1a3340,color:#e0e0e0 classDef default fill:#1a1a2e,stroke:#16213e,color:#e0e0e0 click A "/llms/what-happens/tokens/" click B "/llms/what-happens/tokens/" click C "/llms/what-happens/tokens/"

๐ŸŸข HBM (fast, small) ยท ๐ŸŸ  CPU RAM (medium) ยท ๐Ÿ”ต Storage (slow, large)

When the KV cache exceeds GPU HBM capacity, you have two choices: refuse the request, or move some of the cache to slower memory. Offloading chooses the second option โ€” it creates a memory hierarchy for the KV cache, analogous to the CPU cache โ†’ RAM โ†’ disk hierarchy in traditional computing.

The tiers:

TierMediumCapacityBandwidthLatency
HotGPU HBM192 GB (B200)~8 TB/s~ns
WarmCPU DRAM1-2 TB typical~200 GB/s~100ns
ColdNVMe SSDMany TB~10-50 GB/s~10-100ฮผs

How offloading works: The insight is that during decode, the new token probably doesn’t attend equally to all previous tokens. Attention scores typically concentrate on recent tokens and a handful of distant “anchor” tokens, with most of the middle getting near-zero attention. So you keep the most likely-to-be-attended K/V in HBM and evict the rest to CPU RAM or NVMe.

When a new token’s attention scores indicate it needs an evicted K/V pair, that pair gets fetched back into HBM. This is a cache miss โ€” it stalls the computation while waiting for the data to arrive from a slower tier.

The performance trade-off is stark:

  • HBM read for a cache hit: microseconds
  • CPU RAM fetch for a cache miss: ~10-100ร— slower
  • NVMe fetch: ~1,000-10,000ร— slower than HBM

A single NVMe fetch is small in absolute time (microseconds), but attention may need many K/V pairs per step, and these stalls add up. The quality of the eviction policy โ€” predicting which K/V pairs will be needed โ€” determines whether offloading is viable or cripplingly slow.

Eviction strategies:

  • LRU (least recently used): evict tokens that haven’t been attended to recently. Simple, but attention patterns aren’t always predictable from recency.
  • Attention-score based: track cumulative attention received by each token position. Tokens that consistently get low attention scores are evicted first. More accurate but requires tracking overhead.
  • Sliding window + landmarks: keep a local window in HBM, plus the first few tokens and any tokens flagged as high-attention during prefill. Evict everything else. This maps well to observed attention patterns.

Performance profile: Offloading is fundamentally about trading latency for capacity. With everything in HBM, decode is ~28ms/token (bandwidth-bound). With a 50% hit rate on CPU RAM offloading, you might see ~40-60ms/token โ€” slower, but the alternative was not being able to serve the request at all. The economics only make sense at long contexts where the KV cache exceeds HBM โ€” for short contexts that fit in HBM, offloading adds overhead for no benefit. This is also where purpose-built KV cache storage tiers become relevant: storage that sits between NVMe and HBM in the bandwidth/latency hierarchy, optimized for the specific access patterns of KV cache reads.