What the KV cache actually is
During autoregressive decoding, a transformer generates one token at a time. At each attention layer, the model has to compute a Query, a Key, and a Value tensor for every token in the sequence. The Query for the new token is fresh, but the Keys and Values for all the prior tokens are identical to what was computed on the previous step. Recomputing them every time would be wasteful: generating the 500th token would mean redoing the work for tokens 1 through 499, at every layer, for nothing.
The KV cache is the obvious fix. The model stores K and V from every past step in GPU memory, and each new token only needs one new K row and one new V row appended per layer. The attention operation then runs as a dot product between the new Query and the cached Keys, weighted against the cached Values. Decoding stops being quadratic in sequence length per token and becomes linear, which is what makes streaming a large language model at interactive speeds possible at all.
If you have ever wondered why the first token of a response feels slow and the rest feel quick, the KV cache is the reason. The first token requires a full prefill pass over the prompt. Every token after that is a cheap append.
Why decoder-only LLMs cache K and V
The catch is memory. A rough sizing formula is 2 times n_layers times d_model times sequence_length times bytes_per_parameter, where the leading 2 covers both K and V. For a 70 billion parameter Llama-class model running at a 128k context window, that single cache can exceed 80 GB. The model weights have not even started loading yet. This is why inference servers obsess over KV cache layout the way databases obsess over page sizes.
Three optimizations are now standard:
- PagedAttention, introduced in the vLLM paper (Kwon et al., 2023), treats the KV cache like operating-system virtual memory. Blocks of cache pages are allocated on demand, fragmentation disappears, and pages can even be shared across requests that have a common prefix. The paper reports 2 to 4 times higher throughput at the same latency as prior serving systems, and the technique is now the default behind most production inference stacks.
- Grouped-Query Attention, or GQA, has multiple Query heads share a single Key and Value head. The cache shrinks by the grouping factor with only a small quality cost. It is standard from Llama 3 onward.
- Multi-Query Attention, or MQA, is the extreme version: one K and V head shared across all Query heads. Cheaper still, slightly more quality loss.
The cache itself can also be quantized. Storing K and V in 8-bit or even 4-bit precision instead of 16-bit halves or quarters the footprint, at a small accuracy cost. See quantization for the same idea applied to the weights.
Why KV cache matters for AI chatbots
For a chatbot operator, the KV cache is the difference between a chat that streams smoothly and a chat that stalls between tokens. Long conversations grow the cache linearly, so providers cap context not just because the model degrades past its trained range, but because the per-request memory footprint becomes the bottleneck for batch size, and batch size is what determines GPU economics.
This is also why streaming response UX and KV cache behavior are linked. Streaming relies on the model producing one cached-decode step per token. If the cache is evicted partway through, the model has to recompute, and the stream visibly hitches.
Prompt caching: the KV cache as a product feature
The most interesting recent development is that providers have started exposing the KV cache as a billing line item. They call it prompt caching, and it is the same data structure with a different lifecycle: instead of throwing the cache away at the end of a request, the provider keeps the prefill state around long enough that a follow-up request with the same prefix can skip the prefill entirely.
- Anthropic prompt caching (launched 2024) bills cache writes at 1.25 times the base input price for the 5-minute tier, and cache reads at 0.1 times the base price. That is a 90 percent discount on the cached prefix on every reuse. The math works out such that a single cache hit already pays back the write surcharge.
- OpenAI prompt caching is automatic on prompts longer than 1,024 tokens for GPT-4o-class and o-series models. Cached prefixes are billed at roughly 50 percent of the uncached input price, with no code changes required.
- Both work the same way mechanically: hash the longest shared prefix, look up the prefilled KV state, jump straight to decoding.
The wins are largest where the same long context appears on every turn: lengthy system prompts, role schemas, retrieval context that repeats across messages, few-shot examples, and tool definitions. ChatRaj's system prompt is structured so repeated parts hit prompt-cache rules at the provider, lowering per-message cost on long conversations.
The catch worth knowing: prompt caching only works if the cached prefix is bit-identical. A timestamp or a randomly ordered tool list at the start of the prompt invalidates the cache for everything after it. Put the volatile parts at the end.
Where the KV cache fits next to other terms
It is easy to conflate the KV cache with the context window. They are related but distinct: the context window is the budget for how many tokens the model can attend to, while the KV cache is where the model actually stores the per-layer state for those tokens. The window is the policy. The cache is the data structure that lives inside it.