2026-06-30
The Core Claim
The KV cache is the single optimization that makes autoregressive decoding tractable: instead of recomputing every prior token's key/value projections at each step, the engine stores them once and appends per token, collapsing per-step attention cost from quadratic recompute to a linear append [Source 57]. Because decode is memory-bandwidth-bound rather than compute-bound on GPUs [Source 72], the cache's residency in HBM — not raw FLOPs — sets the ceiling: vLLM's PagedAttention allocates that HBM dynamically to actual decode length, and the reported GPU KV cache size in tokens directly determines how many requests run concurrently Source 2Source 8.
Evidence (5–7 numbered insights)
1. The cache exists to delete redundant recompute, not to save space for its own sake. Without it, generating token n requires re-projecting K and V for all n−1 prior tokens every step — pure waste, since those projections never change. The cache is an append-only log of K/V projections consumed by attention's GEMMs.
"You don't modify it during the LLM inference. You just append to it, with every processed token. The name of this K and V projections storage is KV cache." — [Source 57]
2. Decode is memory-bound, so the cache — not compute — is the bottleneck. A GPU has ~100× the compute of a CPU but only ~10× the memory bandwidth; single-token decode does little math per byte moved, so it stalls on KV reads. This is why bandwidth (480 GB/s VRAM) and cache residency dominate, and why the engineering target is keeping K/V resident and contiguous.
"GPUs have over sort of two orders of magnitude more compute than a CPU... But GPUs only have an order of magnitude more memory bandwidth than a CPU. So what that actually means is if you do things that are not compute intense, you will be memory bound" — [Source 72]
3. PagedAttention turns the cache from a fixed worst-case reservation into a dynamic allocation, raising throughput. Pre-allocating HBM for max sequence length strands memory; vLLM pages it by actual decode length, and the same paging lets multiple requests share identical K/V blocks (beam search, common prefixes).
"the paged attention of vLLM allocates GPU HBM dynamically for its actual decoding lengths" — Source 2
4. The cache's token capacity is a hard concurrency ceiling you can read off the logs. After model weights load, remaining HBM divided by per-token KV size yields the servable token pool — vLLM prints it, and divides by per-request length to estimate concurrency (e.g. 15.70× at 40,960 tokens/request).
"The
GPU KV cache sizeline reports the total number of tokens that can be stored in the GPU KV cache at once." — Source 8
5. Sharing the cache across the prefill/decode split is where the largest production wins come from. Disaggregated serving (LLM-D) routes prefill to high-memory GPUs and scales decode separately, with both phases reading the same KV cache for similar requests — yielding a 3× P90 latency improvement and a 57× improvement in time-to-first-token.
"the prefill can use high-memory GPUs, while the decode can scale separately, but both using the same KV cache for similar request" — [Source 14]
6. Prefix caching reuses the cache across requests, deleting repeated prefill. When every RAG query shares a ~2K-token system prompt, the KV states for that prefix are computed once and reused, skipping redundant prefill on a 32B model.
"this eliminates redundant prefill computation — saving 200-500ms per query on a 32B model" — [Source 35]
7. Quantizing the cache to FP8 trades precision for more resident tokens. Halving K/V byte-width nearly doubles the token pool from insight #4, directly increasing throughput and max context — vLLM supports fp8_e4m3 on both CUDA and ROCm.
"This optimization enables you to store more tokens in memory, leading to improved throughput and support for longer context windows." — [Source 42]
How It Works
flowchart LR
P[Prompt tokens] --> PF[Prefill: compute K,V for all tokens]
PF --> KV[(KV cache in HBM)]
KV --> AT[Attention GEMM]
AT --> TOK[Emit next token]
TOK --> AP[Append new K,V]
AP --> KV
KV --> CC[Concurrency = HBM pool / per-req KV]
Prefill populates the cache once for the whole prompt; each decode step then reads the resident cache, emits one token, and appends only that token's K/V — so the per-step cost is a bandwidth-bound read plus a small append, and the free HBM left after weights bounds how many requests can hold caches at once [Source 57]Source 8.
What This Means in Practice
On a high-traffic stack, treat the KV cache as the capacity unit you provision and meter, exactly as you'd budget LCP/INP on the frontend. Stabilize the cacheable prefix — pin a fixed system prompt and stable chunk ordering so prefix caching actually hits; dynamically resizing context (varying retrieved-chunk count) invalidates the cached prefix and raises TTFT instead of lowering it [Source 35][Source 148]. Size --gpu-memory-utilization against the printed GPU KV cache size to set real concurrency rather than guessing Source 8, and reach for FP8 KV (kv_cache_dtype=fp8_e4m3) before buying more cards when you need longer context or more concurrent users [Source 42]. Just as React 19 useTransition and Next.js streaming hide latency by not blocking on work already done, the KV cache and prefix reuse hide it by not recomputing work already done — the streaming TTFT a user feels is dominated by whether prefill was skipped.
Counter-Evidence / Limits
The cache is a speedup only while it stays resident in VRAM: when allocations spill K/V pages to GTT/system RAM over PCIe (~20 GB/s vs ~480 GB/s VRAM), the same mechanism inverts into a ~24× per-token penalty, pushing TTFT from ~50ms to 800–1200ms [Source 69]. Capacity tactics fight each other — speculative decoding's draft model and its own KV claim 1.5–3 GB that would otherwise hold concurrent requests' caches [Source 79], and shrinking context to save tokens can cost more latency than it saves by busting the prefix cache [Source 148]. The corpus is unanimous that the cache is foundational, but it disagrees on where the cache should live: on consumer AMD RDNA with no MIG/MPS isolation, the dominant advice is to stop co-locating and physically isolate the LLM's cache on a dedicated card rather than manage contention [Source 16][Source 147]. Finally, sharing a decrypted cache across workers in disaggregated serving is a real security surface — re-encrypting per decode step would erase the entire latency win, so isolation, not crypto on the hot path, is the mitigation [Source 36].
Today's CEMENT brick
Execute-blind: Start a vLLM (or check an existing) serve and grep the startup log for the two lines GPU KV cache size: N tokens and Maximum concurrency for M tokens per request: X. Before reading them, write down your predicted max concurrency from (VRAM − weights) / per-token-KV. Compare to the printed X — the gap is your real headroom for prefix caching and FP8 KV, and it tells you whether your next throughput win is a config flag or a hardware spend Source 8[Source 42].
Sources
- vLLM inference frameworks
- Parallelism and Scaling — GPU KV cache size log
- LLM‑D Explained: Building Next‑Gen AI with LLMs, RAG & Kubernetes
- Self-Learning Q&A — CPU vs GPU reranking / KV eviction
- Self-Learning Q&A — disaggregated KV cache security & re-encryption cost
- Self-Learning Q&A — production AI topology, prefix caching & KV reuse
- Quantized KV Cache — FP8 KV Cache Overview
- tiny-vllm — Why KV cache exists
- Self-Learning Q&A — cross-instance KV spill to GTT latency
- Building Windsurf with Varun Mohan
- Self-Learning Q&A — speculative decoding draft-model KV overhead
- Self-Learning Q&A — token-budget optimizer vs prefix-cache invalidation
- Self-Learning Q&A — cross-encoder + embedding GPU partitioning