6. Inference and Compression
1 Overview
When a user sends a request to an AI assistant, the system has a few hundred milliseconds to a few seconds to do a lot of things: receive the request, tokenize it, process the prompt, build the KV cache, generate the answer token by token, and stream the result back to the client. It looks like a single “model call.” But the real problem engineers face is resource coordination: how do you find an operable balance among latency, throughput, GPU memory, cost, and reliability?
That is the central question of this chapter: How do engineers run large language models efficiently under real-world constraints?
Inference is first a systems engineering problem, not just a model problem. Put the same base model behind different serving stacks, schedulers, cache strategies, and quantization paths, and you get very different user experiences and cost curves. A system can have excellent average throughput and still be slow on the first token. It can also show a high tokens/s number and still blow up its online SLA because of tail latency and OOMs.
Once you see that, LLM inference breaks cleanly into two basic stages:
- Prefill: process the input prompt and build the first round of hidden states and the KV cache.
- Decode: generate tokens step by step in the autoregressive loop.
Nearly every inference optimization that follows—KV cache, continuous batching, chunked prefill, PagedAttention, FlashAttention, speculative decoding, quantization, and P/D disaggregation—is, at bottom, an engineering trade-off around the different bottlenecks of these two stages.
After this chapter, you should be able to:
- explain the different resource bottlenecks of prefill and decode from a systems perspective;
- estimate KV cache size and reason about the relationship among context length, batch size, and GPU memory;
- understand what continuous batching, scheduling, admission control, prefix caching, and speculative decoding each solve;
- distinguish the role boundaries of FlashAttention, PagedAttention, FlashDecoding, kernel fusion, and tensor parallelism;
- make engineering trade-offs among weight quantization, activation quantization, KV quantization, pruning, distillation, and QAT;
- write a sensible inference architecture for different settings, such as chat, batch processing, long-context RAG, and single-GPU-constrained deployment;
- use failure-mode thinking to diagnose high TTFT, jittery generation speed, OOMs, quantization regressions, and unstable structured output.
2 Inference Flow and Performance Bottlenecks
- What stages does a live LLM request pass through?
- Why do prefill and decode have different physical bottlenecks?
- What do TTFT, generation speed, and throughput each measure, and why can they not be collapsed into one metric?
- When the first token is slow or generation stutters, what should you suspect first?
Inference is not one workload. It is a pipeline of distinct sub-stages. If you do not understand the bottlenecks of that pipeline first, every optimization that follows turns into guesswork.
2.1 How a Request Moves Through the Inference System
From a systems point of view, a request usually passes through five stages:
- Request ingress and queueing: the gateway receives the request and handles authentication, rate limiting, routing, and SLA tiering.
- Tokenization and preprocessing: tokenization, template expansion, schema compilation, and tool-parameter assembly.
- Prefill: feed the full prompt into the model in one shot, compute the representations at each layer, and write out the KV cache.
- Decode: the model generates output token by token in an autoregressive loop.
- Streaming the output back: the sampled result is serialized and returned to the client through SSE, WebSocket, or gRPC.
What engineers most often miss is this: the user’s “time to first token” is not determined by the model alone. Queueing, whether prefix caching hits, grammar compilation, cold starts, and gateway tail latency can all lengthen TTFT. By contrast, the output stage looks like “just emit a few more tokens,” but that is often where the system’s real decode bottleneck becomes most visible.
2.2 Prefill and Decode: One Model, Two Workloads
Let the input sequence length for prefill be \(L\). The system processes the full prompt in parallel at every layer, so there is a lot of matrix multiplication and attention work that can be done together. This stage has several characteristics:
- high parallelism along the token dimension;
- a large share of General Matrix-Matrix Multiplication (GEMM);
- large intermediate working sets for attention;
- many optimizations aimed at keeping the GPU Tensor Cores as full as possible.
Decode is completely different. Each step adds only one new query position, but it must read the full history in the KV cache, compute the distribution of the next token, and append the new K/V to the cache. This stage has its own characteristics:
- serial dependence along the sequence dimension;
- little incremental compute per step;
- a steadily growing amount of historical KV to read;
- user-perceived latency that is highly sensitive to single-step decode speed.
So prefill looks more like a compute problem, while decode looks more like a bandwidth and state-management problem. That is why many teams start by staring at model FLOPs and only later realize that the real online bottlenecks are KV cache, HBM bandwidth, and the scheduler.
2.3 Why Attention Gets More Expensive as Sequences Get Longer
Attention is expensive not just because “the formula is complicated,” but because every token has to interact with an ever-longer history.
Let the input sequence length be \(L\) and the hidden dimension be \(d\). In vanilla self-attention, we first construct:
\[ Q \in \mathbb{R}^{L \times d}, \qquad K \in \mathbb{R}^{L \times d}, \qquad V \in \mathbb{R}^{L \times d} \]
Then compute the relevance score matrix:
\[ S = QK^\top \in \mathbb{R}^{L \times L} \]
The \(i\)-th token computes one similarity score with the \(j\)-th token, so in total you compute roughly \(L^2\) scores. The compute complexity of just this part can be written as:
\[ O(L^2 d) \]
And the storage size of the score matrix itself is:
\[ O(L^2) \]
Then after softmax, you multiply by \(V\):
\[ \mathrm{Attention}(Q,K,V) = \mathrm{softmax}\!\left(\frac{QK^\top}{\sqrt{d}}\right)V \]
This process still operates on an attention matrix of size \(L \times L\). So when sequence length grows from \(L\) to \(2L\), both the relevance computation and the intermediate-state size roughly grow by a factor of 4. That is why vanilla attention is commonly described as \(O(L^2)\).
Even with KV cache at inference time, the problem does not disappear. KV cache avoids recomputing K/V for historical tokens, but at decode step \(t\), the current query still has to compute attention against all prior historical K. So the read and scoring cost of single-step attention is still proportional to the history length:
\[ O(td) \]
If you continue generating a total of \(T\) tokens, then from decode alone, the cumulative attention-related cost is roughly:
\[ \sum_{t=1}^{T} O(td) = O(T^2 d) \]
More generally, if the prompt length is \(L_{\text{prompt}}\) and the subsequent generation length is \(T\), then the cumulative attention cost of the decode stage can be written as:
\[ \sum_{t=1}^{T} O\big((L_{\text{prompt}} + t)d\big) = O\big(TL_{\text{prompt}}d + T^2 d\big) \]
That is why once context gets long, decode quickly stops being “compute the next token” and turns into “keep reading a large slab of historical state.”
The engineering implications are straightforward:
- longer prompts increase prefill time;
- longer generated history increases the per-step cost of decode;
- longer context is not just slower—it also consumes more GPU memory, because KV cache grows linearly as well.
In other words, context length is not a “pure model-capability parameter.” It is a first-class systems budget.
2.4 Viewing Inference Bottlenecks Through Compute and Bandwidth
The most common misdiagnosis in inference systems is to treat “slow” as one problem. It is not. There are at least two kinds of slow: slow to compute and slow to move data.
One of the most useful systems intuitions for telling which one a stage is closer to is its Arithmetic Intensity:
\[ AI = \frac{\text{FLOPs}}{\text{Bytes moved from memory}} \]
The formula is simple, but powerful. It asks: for each unit of data you move, how much computation do you actually do?
If arithmetic intensity is high, then once the data arrives, the chip has a lot of work to do with it. Such stages are more likely to hit the compute roof first. If arithmetic intensity is low, the system spends most of its time moving data around. Such stages are more likely to hit the bandwidth roof first.
Prefill processes the full input in one shot. Matrix multiplies are large, parallelism is strong, and Tensor Cores are easier to saturate. It looks more like a large construction project: once the materials are in place, computation can unfold at scale. Decode is different. Each decode step generates only one token. The compute gets smaller, but you still have to repeatedly read weights, access the KV cache, do sampling, and schedule work. So the system looks less like sustained heavy thinking and more like constant document lookup.
You can think of them as two completely different work styles:
- prefill is like reading through a long dialogue in full and only then getting into the zone;
- decode is like already being in the zone, but having to flip through old notes before every word.
That is why “inference optimization” is never one-size-fits-all.
In large-model inference, this difference almost determines the direction of every optimization that follows:
| Stage | Arithmetic intensity | More common bottleneck | Common optimizations |
|---|---|---|---|
| Long-prompt prefill | High | Compute, kernel efficiency | FlashAttention, kernel fusion, larger batches |
| Single-token decode | Low | HBM bandwidth, KV access | GQA/MQA, KV quantization, FlashDecoding |
| Long-context decode | Lower | Bandwidth + cache layout | PagedAttention, KV compression, prefix reuse |
The same word—acceleration—can mean very different things. If a method only helps large matrix multiplies, it usually improves prefill. If a method mainly reduces state reads or optimizes cache layout, it is more likely to improve decode. The two stages do not face the same bottleneck, so they should not be hit with the same hammer.
The value of the roofline model is not that it computes every kernel to two decimal places. Its value is that it gives you a fast diagnostic frame:
If current GPU utilization is low, is it because you are not computing enough, or because you are constantly waiting on data?
Once you ask the right question, debugging gets much faster. Otherwise it is easy to spend a week pushing in the wrong direction and end up with a 3% improvement.
2.5 TTFT, Generation Cadence, and System Throughput
The most common metrics in inference systems all sound like they measure “speed,” but they answer completely different questions.
- TTFT (Time To First Token): how long before the system starts speaking.
- ITL (Inter-Token Latency) / TPOT: whether the system speaks at a stable rhythm.
- Throughput: how many requests, or how many output tokens, the system can serve per unit time.
A lot of online discussion gets confused because people mix these three goals together. From a systems point of view, they are not describing the same thing at all.
TTFT is often the sum of several parts:
\[ TTFT \approx T_{\text{queue}} + T_{\text{preprocess}} + T_{\text{prefill}} + T_{\text{decode},1} + T_{\text{serialize}} + T_{\text{network}} \]
In other words, time to first token is not just “how fast the model infers.” It first depends on whether the request was queued, whether preprocessing slowed it down, whether prefill was amplified by long context, whether the first decode step was dragged down by the current batch shape, and whether the first byte was flushed to the client quickly.
ITL or TPOT is different. It is closer to the cadence of the decode subsystem, and can be roughly written as:
\[ TPOT \approx T_{\text{decode-step}} + T_{\text{sample}} + T_{\text{flush}} \]
What matters here is not the first word. It is whether the system can keep moving forward token by token at a stable pace. It is often shaped by the following factors:
- the efficiency of the decode kernel;
- High Bandwidth Memory (HBM: how fast data can move back and forth between GPU memory and the GPU compute units);
- the size and access pattern of the KV cache;
- the current batch shape;
- whether it is being interrupted by long prefills;
- the extra overhead introduced by constrained decoding or speculative decoding.
About GPU memory You can picture one machine like this:
The CPU is the manager in the office.
The GPU is a factory built for large-scale parallel computation.
Main memory (RAM) is the office’s large filing cabinet.
GPU memory (VRAM/HBM) is the small filing cabinet next to the factory.
When the GPU is actually computing, what it wants most is to read data directly from GPU memory. That is because GPU memory is physically closer to the GPU and offers much higher bandwidth. If the data is still sitting in CPU memory, it has to be moved into GPU memory first before computation can run efficiently.
Throughput is a third thing again. It cares about this:
\[ \text{Throughput} \approx \frac{\text{processed requests or output tokens}}{\text{unit time}} \]
This is closer to “how much output the whole production line can ship in an hour” than to “whether one user feels this particular response is fast.”
So a system can absolutely show the following apparently contradictory—but in practice very common—behavior:
- high throughput, but slow first token;
- not-slow first token, but choppy generation.
The former optimizes total capacity. The latter optimizes startup speed. Both can be rational choices, but only if you know what you are optimizing for.
For chat products, TTFT often determines whether the system feels alive. For long-form generation, ITL more strongly determines whether it feels like fluent speech. For offline batch jobs or high-concurrency platforms, throughput and cost are often the primary targets.
A mature inference system does not stare at just one metric. It puts all three on the same scorecard, because together they describe three different engineering capabilities:
- how fast the system can start
- how steadily it can continue
- how cheaply it can scale
If you collapse those three into one, system design will drift off course very quickly.
2.6 Why Single-Request Decode Usually Hits Bandwidth First
When many people first encounter large-model inference, they naturally assume that the stronger the GPU, the more the problem looks like “is there enough compute?” But in single-request decode, reality is usually different. What you hit first is often not the FLOPs ceiling, but memory bandwidth.
Start with a rough but useful order-of-magnitude estimate. Assume:
- H100 SXM memory bandwidth is about 3.35 TB/s;
- a 70B BF16 model has about 140 GB of weights;
- current workload is single-request decode, batch size = 1;
- to build intuition, we roughly approximate “generating one token requires scanning the main weights once.”
Then the theoretical upper bound for single-token decode can be written approximately as:
\[ \text{tokens/s} \approx \frac{3.35\ \text{TB/s}}{140\ \text{GB}} \approx 24 \]
This estimate is not precise, and it is not describing the real throughput of any particular framework. Its value is that it quickly corrects your intuition about the decode bottleneck.
An H100 has enormous BF16 compute, of course. But in single-request decode, the system generates only one token at each step, so the compute itself is not that large. At the same time, it often still has to keep reading large amounts of weights and state from memory. So the system feels less like “it cannot compute fast enough” and more like “the data cannot be moved to where compute happens fast enough.”
A simple mental picture helps. You are not solving a hard problem. You are writing one word at a time, and before each word you have to go back to the archive room and pull out a large stack of documents. The slowness is not because the reasoning is too complex. It is because moving the documents has become the dominant cost.
This also explains why prefill and decode have clearly different optimization directions. Prefill processes the full prompt at once; matrix multiplies are larger and parallelism is stronger, so Tensor Cores are easier to saturate. It tends to be closer to a compute-bound problem. Decode, by contrast, generates tokens step by step; each step does relatively fragmented compute, but still repeatedly reads weights, accesses the KV cache, performs sampling, and runs scheduling logic. So it is more likely to appear as a bandwidth-bound or state-access-bound problem.
In other words, the key contradiction in single-request decode is usually not “is this GPU fast enough at math?” but “can this GPU keep delivering the required data to the compute units at a high enough rate?”
In real systems, actual speed is often even lower than this rough estimate, because you have not yet counted:
- KV cache reads;
- sampling and logits post-processing;
- scheduler and runtime overhead;
- kernel efficiency of the framework itself;
- network send and streaming flush;
- and all kinds of unavoidable system noise.
So those 24 tokens/s are more like a “physics intuition line” than a performance promise. The core conclusion is this:
Single-request decode is often first a memory-bandwidth problem, and only second a pure compute problem.
Once that diagnosis is right, many later engineering choices become natural. For example, GQA / MQA reduce bandwidth pressure by shrinking KV access volume; KV quantization reduces read cost by shrinking online state; mechanisms such as FlashDecoding, PagedAttention, and prefix reuse are also, at bottom, trying to improve the data-access path during each generation step.
3 KV Cache
This section answers several key questions first:
- What exactly does KV cache store, and why does it speed up decode?
- Why does KV cache balloon so quickly as sequence length grows?
- How do MHA, MQA, and GQA each affect KV cache?
- What kinds of problems do PagedAttention and KV quantization each solve?
- When a long prompt OOMs immediately, how should you converge on the issue in engineering practice?
KV cache lets decode avoid recomputing the full history at every step, but it also turns “context length” into a GPU-memory and bandwidth problem. Whether online inference scales depends to a large extent on how you allocate, reuse, compress, and protect this state.
3.1 Why Decode Must Rely on KV Cache
Without KV cache, when the model generates the \(t\)-th token, it has to recompute K and V for all earlier tokens. That cost is repetitive and expensive. KV cache works like this:
- During prefill, store the Key and Value for historical tokens at each layer;
- during decode, compute new Q/K/V only for the new token;
- attention reads the historical K/V directly, then appends the new K/V to the cache.
So the core value of KV cache is not “making attention simpler.” It is avoiding repeated computation of the history. That is why it is the default configuration for almost every decoder-only production service.
3.2 How KV Cache Grows with Sequence Length
For a decoder-only model, the approximate KV cache size for a single token can be written as:
\[ KV_{\text{bytes/token}} = 2 \cdot N_{layers} \cdot H_{kv} \cdot d_{head} \cdot p \]
where:
- \(2\) accounts for K and V;
- \(N_{layers}\) is the number of layers;
- \(H_{kv}\) is the number of KV heads;
- \(d_{head}\) is the head dimension;
- \(p\) is the number of bytes per element, for example 2 bytes for FP16.
The total KV size is then approximately:
\[ KV_{\text{total}} = B \cdot L \cdot KV_{\text{bytes/token}} \]
where \(B\) is the number of active sequences and \(L\) is the current length of each sequence.
Plug in a common example:
- \(N_{layers}=80\)
- \(H_{kv}=8\)
- \(d_{head}=128\)
- \(p=2\) bytes (FP16)
Then:
\[ KV_{\text{bytes/token}} = 2 \cdot 80 \cdot 8 \cdot 128 \cdot 2 = 327{,}680\ \text{bytes} \]
That is about 320 KiB / token. If the context length is 8K, then the KV cache for a single sequence is about 2.5 GiB. That is why “long context + large batch” so easily hits the GPU-memory ceiling before model weights do.
3.3 MHA, GQA, and MQA: Intuition, Mathematical Form, and KV Cache Cost
You can think of these three mechanisms as different answers to one question: Query has many “angles of view.” Do Key/Value, which hold historical information, also need a separate copy for every angle? That is the essential difference among MHA, MQA, and GQA.
Start with vanilla attention
In attention, the current token produces a Query, while historical tokens provide Key / Value. A rough intuition is:
- Q (Query): what I am trying to find right now
- K (Key): the “index label” for each historical piece of information
- V (Value): the actual content I want to retrieve from history
Attention does this:
- use the current \(Q\) to match against all historical \(K\);
- compute which historical information deserves attention right now;
- retrieve the corresponding \(V\) with weights.
The formula is the same one:
\[ \mathrm{Attention}(Q,K,V)=\mathrm{softmax}\!\left(\frac{QK^\top}{\sqrt{d}}\right)V \]
Why are there multiple heads?
Because a single Query / Key / Value perspective is too narrow. So Transformers split attention into many heads. Each head is like a different angle of observation:
- one head may focus more on syntax;
- one may care more about long-range dependencies;
- one may care more about entity alignment;
- one may care more about positional relationships.
Multi-Head Attention (MHA): each head gets its own K/V
In MHA, each query head gets its own corresponding key/value head. If there are \(H\) attention heads in total, then:
- there are \(H\) query heads;
- there are \(H\) key heads;
- there are \(H\) value heads.
So every perspective stores its own copy of history.
This is the most fine-grained option, and the most expensive. What is it like? You understand a relationship from 32 different angles, and you build a full archive for each one. The upside is maximal information preservation. The downside is: the KV cache is the largest. Because at inference time, Q is not the expensive part; the historical K/V has to remain stored.
MQA (Multi-Query Attention): many Queries share one K/V
The key point of MQA is not “there is only one query.” It is this:
- there are still many query heads;
- but they share one common K/V set.
That means:
- the number of Query heads is still \(H\);
- but the number of Key / Value heads is only 1.
So:
\[ H_{kv}=1 \]
Many people can ask from different angles, but the historical archive is kept only once. That obviously slashes KV cache. With MHA, you store one copy of history for each head.
With MQA, you store it once, so both memory and bandwidth pressure drop sharply.
The problem is that different query heads clearly want to look at different things, yet they are forced to share the same historical representation. So it is cheaper, but more likely to lose quality, especially on complex tasks or long context.
GQA (Grouped-Query Attention): the compromise
GQA sits between MHA and MQA. It means:
- there are still many query heads;
- but not all Queries share one K/V;
- instead, they share in groups.
For example:
- 32 query heads
- 8 KV heads
Then every 4 query heads share one K/V group. That is:
\[ H = 32, \qquad H_{kv}=8 \]
So not everyone gets a private archive, but they are not all squeezed into one archive room either. You split them into several groups, and each group shares one copy. That gives you a compromise:
- lower memory and bandwidth cost than MHA;
- better expressive capacity than MQA.
That is why many serving models today prefer GQA.
Why is the key difference among the three really about KV cache?
Because at inference time, especially in decode:
- Q is computed on the fly for the current token;
- K/V for historical tokens must be cached.
In other words, what keeps growing with context length is K/V, not Q. So the essence of MHA, MQA, and GQA is not changing the attention formula. It is changing how many copies of historical state must be stored.
How to understand the reduction factor
If the number of query heads is \(H\) and the number of KV heads is \(H_{kv}\),
then relative to MHA, the KV-cache reduction factor is approximately:
\[ \frac{H}{H_{kv}} \]
Why? Because under MHA by default:
\[ H_{kv}=H \]
Every query head has its own K/V.
If you reduce that to \(H_{kv}\) KV heads,
the cache size shrinks roughly in proportion.
For example, if:
\[ H = 32 \]
MHA
\[ H_{kv}=32 \]
Reduction factor:
\[ \frac{32}{32}=1 \]
No reduction.
GQA
\[ H_{kv}=8 \]
Reduction factor:
\[ \frac{32}{8}=4 \]
So the KV cache shrinks to about \(1/4\) of the original size.
MQA
\[ H_{kv}=1 \]
Reduction factor:
\[ \frac{32}{1}=32 \]
So the KV cache shrinks to about \(1/32\) of the original. That is why MQA saves the most.
Why does this matter especially for decode?
Because decode is often not bottlenecked by FLOPs first. It is bottlenecked by:
- HBM bandwidth
- KV cache reads
- GPU-memory capacity
And MHA / MQA / GQA directly affect all three. The fewer K/V copies you store:
- the smaller the GPU-memory footprint;
- the less history you need to read;
- the less bandwidth pressure decode faces;
- the easier it usually is to speed up single-step generation.
So choosing an attention variant is not an abstract architecture choice. It is a very real inference-systems decision.
Mathematical form of MHA / MQA / GQA
Notation. Let batch size be \(B\), sequence length \(L\), hidden dimension \(d_{\text{model}}\), query-head count \(H\), KV-head count \(H_{kv}\), and per-head dimension \(d_h\). Usually \(d_{\text{model}} = H \cdot d_h\). Input hidden states are \(X \in \mathbb{R}^{B \times L \times d_{\text{model}}}\).
MHA (Multi-Head Attention)
Each of the three projection matrices contains \(H\) heads:
\[ Q = XW_Q, \quad K = XW_K, \quad V = XW_V \]
\[ W_Q,\; W_K,\; W_V \in \mathbb{R}^{d_{\text{model}} \times (H\, d_h)} \]
After projection, reshape to: \(Q, K, V \in \mathbb{R}^{B \times L \times H \times d_h}\).
Each head performs attention independently, then the outputs are concatenated:
\[ \mathrm{Attn}^{(h)} = \mathrm{softmax}\!\left(\frac{Q^{(h)} {K^{(h)}}^\top}{\sqrt{d_h}}\right) V^{(h)}, \qquad Y = \mathrm{Concat}\!\big(\mathrm{Attn}^{(1)}, \dots, \mathrm{Attn}^{(H)}\big)\, W_O \]
MQA (Multi-Query Attention)
Q still has \(H\) heads, but K/V have only one head:
\[ \begin{aligned} W_Q &\in \mathbb{R}^{d_{\text{model}} \times (H\, d_h)} &\Longrightarrow\quad Q &\in \mathbb{R}^{B \times L \times H \times d_h} \\ W_K,\; W_V &\in \mathbb{R}^{d_{\text{model}} \times d_h} &\Longrightarrow\quad K, V &\in \mathbb{R}^{B \times L \times 1 \times d_h} \end{aligned} \]
At compute time, that single K/V group is broadcast to all query heads:
\[ \mathrm{Attn}^{(h)} = \mathrm{softmax}\!\left(\frac{Q^{(h)} K^\top}{\sqrt{d_h}}\right) V \]
The different Q heads still exist, but they share the same historical K/V.
GQA (Grouped-Query Attention)
A compromise between MHA and MQA: there are \(H_{kv}\) KV heads, and each is shared by \(g = H / H_{kv}\) query heads.
\[ \begin{aligned} W_Q &\in \mathbb{R}^{d_{\text{model}} \times (H\, d_h)} &\Longrightarrow\quad Q &\in \mathbb{R}^{B \times L \times H \times d_h} \\ W_K,\; W_V &\in \mathbb{R}^{d_{\text{model}} \times (H_{kv}\, d_h)} &\Longrightarrow\quad K, V &\in \mathbb{R}^{B \times L \times H_{kv} \times d_h} \end{aligned} \]
The \(h\)-th query head maps to KV head \(\phi(h) = \lfloor h / g \rfloor\):
\[ \mathrm{Attn}^{(h)} = \mathrm{softmax}\!\left(\frac{Q^{(h)} {K^{(\phi(h))}}^\top}{\sqrt{d_h}}\right) V^{(\phi(h))} \]
Many Q heads, fewer K/V heads; multiple Q heads map to the same KV head.
| Variant | Number of KV heads | KV-cache scaling | Typical setting |
|---|---|---|---|
| MHA | \(H_{kv} = H\) | \(1\times\) | Highest training quality |
| GQA | \(1 < H_{kv} < H\) | \(H_{kv}/H\) | Llama 2/3, Qwen 2 |
| MQA | \(H_{kv} = 1\) | \(1/H\) | Extreme inference efficiency |
To sum up, the three attention mechanisms are really about how to trade attention expressivity against online KV cost.
- MHA: every query head has a corresponding K/V head. Quality is good, but KV cache is largest.
- MQA: all query heads share one K/V set. Decode is faster and KV cache is smallest, but quality may degrade.
- GQA: several query heads share one K/V set. It is a compromise among speed, GPU memory, and quality.
If the number of query heads is \(H\) and the number of KV heads is \(H_{kv}\), then relative to MHA, the KV-cache reduction factor is approximately:
\[ \frac{H}{H_{kv}} \]
That is why modern online serving stacks prefer GQA: it saves far more memory and bandwidth than MHA, but is usually more stable than MQA.
Example: how 32 Query heads map to KV heads
Use the most common example to see it directly:
- number of query heads: \(H = 32\)
- per-head dimension: \(d_h\)
- compare three schemes:
- MHA: \(H_{kv} = 32\)
- GQA: \(H_{kv} = 8\)
- MQA: \(H_{kv} = 1\)
MHA: 32 Q heads map to 32 KV heads
Here the mapping is one-to-one:
\[ Q_1 \rightarrow KV_1,\quad Q_2 \rightarrow KV_2,\quad \dots,\quad Q_{32} \rightarrow KV_{32} \]
Q heads: Q01 Q02 Q03 Q04 Q05 Q06 Q07 Q08 Q09 Q10 Q11 Q12 Q13 Q14 Q15 Q16
Q17 Q18 Q19 Q20 Q21 Q22 Q23 Q24 Q25 Q26 Q27 Q28 Q29 Q30 Q31 Q32
KV heads: K01 K02 K03 K04 K05 K06 K07 K08 K09 K10 K11 K12 K13 K14 K15 K16
K17 K18 K19 K20 K21 K22 K23 K24 K25 K26 K27 K28 K29 K30 K31 K32
mapping: Q01→K01, Q02→K02, ..., Q32→K32
At this point:
\[ H_{kv} = H = 32 \]
So the KV cache is not reduced at all. This is the baseline case.
GQA: 32 Q heads are split into 8 groups that share 8 KV heads
Assume:
\[ H = 32,\qquad H_{kv} = 8 \]
Then each KV head is shared by:
\[ g = \frac{H}{H_{kv}} = \frac{32}{8} = 4 \]
query heads.
So the mapping becomes:
Q01 Q02 Q03 Q04 → KV01
Q05 Q06 Q07 Q08 → KV02
Q09 Q10 Q11 Q12 → KV03
Q13 Q14 Q15 Q16 → KV04
Q17 Q18 Q19 Q20 → KV05
Q21 Q22 Q23 Q24 → KV06
Q25 Q26 Q27 Q28 → KV07
Q29 Q30 Q31 Q32 → KV08
Or, in a more compact diagram:
Q heads:
[Q01 Q02 Q03 Q04] [Q05 Q06 Q07 Q08] [Q09 Q10 Q11 Q12] [Q13 Q14 Q15 Q16]
[Q17 Q18 Q19 Q20] [Q21 Q22 Q23 Q24] [Q25 Q26 Q27 Q28] [Q29 Q30 Q31 Q32]
KV heads:
KV01 KV02 KV03 KV04
KV05 KV06 KV07 KV08
Mathematically, you can write the grouping map as:
\[ \phi(h) = \left\lfloor \frac{h}{g} \right\rfloor \qquad \text{where } g = \frac{H}{H_{kv}} \]
If indexing starts from 0:
- \(Q_0, Q_1, Q_2, Q_3\) use \(KV_0\)
- \(Q_4, Q_5, Q_6, Q_7\) use \(KV_1\)
- …
- \(Q_{28}, Q_{29}, Q_{30}, Q_{31}\) use \(KV_7\)
Then the KV-cache reduction factor relative to MHA is approximately:
\[ \frac{H}{H_{kv}} = \frac{32}{8} = 4 \]
So the KV cache becomes about \(1/4\) of the original.
MQA: all 32 Q heads share 1 KV head
This is the most extreme form of sharing:
\[ H = 32,\qquad H_{kv} = 1 \]
So all 32 query heads share the same K/V:
Q01 Q02 Q03 Q04 Q05 Q06 Q07 Q08
Q09 Q10 Q11 Q12 Q13 Q14 Q15 Q16
Q17 Q18 Q19 Q20 Q21 Q22 Q23 Q24
Q25 Q26 Q27 Q28 Q29 Q30 Q31 Q32
↓
KV01
That is:
\[ Q_1, Q_2, \dots, Q_{32} \rightarrow KV_1 \]
Now the KV-cache reduction factor relative to MHA is approximately:
\[ \frac{H}{H_{kv}} = \frac{32}{1} = 32 \]
So the KV cache can shrink to about \(1/32\) of the original.
Put the three together
MHA:
Q01 → KV01
Q02 → KV02
...
Q32 → KV32
GQA (32Q, 8KV):
Q05 Q06 Q07 Q08 → KV02
...
Q29 Q30 Q31 Q32 → KV08
MQA:
Q01 Q02 Q03 ... Q32 → KV01
If you look only at “how many copies of historical state are stored,” the difference becomes obvious:
- MHA: every perspective stores its own history
- GQA: every group of perspectives shares one history
- MQA: all perspectives share one history
3.4 Why Naive KV Allocation Wastes GPU Memory
If you allocate each request’s KV cache as one large contiguous memory block, you immediately run into three problems:
- External fragmentation: long and short requests enter and leave alternately, so contiguous space becomes harder and harder to reuse;
- Internal waste: sequence length rarely aligns exactly with the allocation granularity;
- Difficult sharing: in scenarios such as beam search, parallel sampling, and prefix reuse, identical prefixes are hard to share across requests.
This directly hurts the maximum online batch size, usable GPU memory, and tail latency. So inference systems started borrowing the idea of paged allocation.
3.5 PagedAttention: Manage KV Cache Like Virtual Memory
PagedAttention does not change the mathematics of attention. It changes how KV cache is laid out in GPU memory. The problem it solves is GPU fragmentation.
Concretely, instead of reserving one large contiguous chunk of GPU memory for each request, it slices KV cache into many fixed-size blocks/pages, allocates them on demand, and then uses a block table to map “logically contiguous history” to “physically non-contiguous GPU blocks.”
It is like managing a warehouse. If every new customer gets one entire contiguous row of shelves reserved for them, many customers leave those shelves half empty, the warehouse gets chopped into awkward pieces, and then a large new customer cannot fit in at all. PagedAttention fixes this in the smart way: split KV cache into many small fixed-size slots; whoever needs a few takes a few; they do not need to be adjacent; and a table records which slot holds which part of a sequence’s history.
This idea is borrowed directly from operating-system paged memory management. It brings several important benefits:
- lower fragmentation, with GPU-memory waste pushed close to the minimum;
- dynamic growth and reclamation of requests, with no need for large-scale relocation;
- easier prefix sharing and copy-on-write;
- better fit for the dynamic arrival and departure pattern of continuous batching.
This is easy to confuse with FlashAttention. They are not substitutes:
- PagedAttention solves the memory layout and lifecycle management of KV cache. How do you store it with less fragmentation and less wasted space?
- FlashAttention solves the IO efficiency of the attention kernel. How do you move less data while computing?
A modern serving stack will typically use both.
3.6 KV Quantization: Trade a Little Numeric Precision for More Context and Batch
KV cache can also be quantized. The most intuitive paths are:
- FP16 → INT8: theoretically close to 2× space savings;
- FP16 → 4-bit: theoretically close to 4× space savings.
For the system, this has two layers of value:
- it allows longer context or larger batch sizes;
- it reduces the number of bytes read from KV during decode, which relieves bandwidth pressure.
But the cost is real. Many tasks do not degrade first in casual chat quality. They degrade first in the following settings:
- long-document QA with long-range references;
- Needle-in-a-Haystack;
- code and math, where fine-grained symbols matter more;
- cases requiring precise entity binding and cross-paragraph localization.
So although weight quantization and KV quantization are both called “quantization,” they solve different problems. Weight quantization first solves “the model does not fit.” KV quantization first solves “context is too expensive.”
Needle-in-a-Haystack You hide one very important sentence inside a long body of content and see whether the model can still find it exactly. It is especially useful for checking whether a model in long documents, long conversations, or RAG has really seen the key point.
3.7 In Extreme Long-Context Settings, A Bigger Window Alone Is Usually Not the Answer
Roughly how long are context windows now?
As of March 2026, what matters more in engineering is not “how long a model claims to support,” but “how long a default window is actually controllable.”
- Chat, assistants, ordinary tool calling: the common comfort zone is still 8K–32K. That is usually enough to cover the last few turns of dialogue and a small amount of tool output, while keeping TTFT, GPU-memory use, and per-request cost under control.
- Long-document QA, RAG, codebase analysis: 32K–128K is common. You can go higher, but KV cache, tail latency, and overall serving cost rise noticeably.
- Ultra-long-context settings: frontier models now reach 200K, 400K, and even 1M-class context windows. Anthropic’s Claude documentation explicitly mentions 200K and 1M-scale context capability, Google Gemini’s official documentation also treats 1M long context as a core feature, and OpenAI currently has a model with a 1.05M context window.
But note this: “it fits” does not mean “you should fill it by default.” Once the window gets very large, the problem is often no longer whether the model can see it. It is whether the system can serve it stably, cheaply, and reliably. So in real systems, long context usually has to be designed together with better context engineering, retrieval, summarization, and external memory structures, not by simply pushing the window larger and larger. When context becomes extremely long, simply enlarging the window often turns the problem from “can the model see it?” into “the system cannot afford to serve it at all.”
The common classes of KV compression or retention strategy are roughly these:
- Quantization: the most general approach, and the easiest to integrate into existing stacks;
- Eviction / selective retention: keep only heavy hitters, the recent window, or tokens predicted to matter more;
- Sliding window / sink tokens: good for continuous streaming settings;
- Hierarchical memory: move part of the history out of the immediate context and retrieve it later through retrieval or external memory.
In practice, long-context problems are often solved not by “even larger raw context,” but by “better memory structure.” If your task truly requires precise cross-section citation, then RAG, hierarchical summarization, and retrieval compression are usually more controllable than brute-forcing the window.
3.8 Debugging Order When a Long Prompt OOMs
When a long prompt OOMs immediately, a more mature triage order is:
- reduce the input first: deduplicate, rerank, compress, summarize, pack context;
- confirm whether the model uses a GQA/MQA architecture;
- apply weight quantization first, to leave room for KV;
- then consider KV quantization;
- use PagedAttention to reduce fragmentation;
- enable prefix cache to avoid repeated prefill;
- introduce chunked prefill to avoid having one long prompt consume the entire token budget at once;
- if that is still not enough, consider sliding windows, selective retention, offload, or multi-GPU parallelism;
- only at the end ask: should this workload use a smaller model?
Behind this order is a very engineering-minded judgment: solve waste first, then solve shortage, and only then buy a bigger GPU.
4 Batching, Scheduling, and Request Orchestration
This section answers several key questions first:
- Why is batching the core lever of online inference?
- How does continuous batching differ from static batching?
- Why are admission control and chunked prefill often more important than “a larger batch”?
- What kinds of workloads do prefix caching, speculative decoding, and guided decoding each fit?
Online inference is not as simple as “throw requests at a GPU.” It is an ongoing scheduling problem. You are not optimizing a single request. You are optimizing how all requests share the same accelerator.
4.1 Batching
The essence of batching is not just “put a few requests together.” It is to amortize the fixed cost of each forward pass—weight reads, kernel launches, scheduler overhead—across more tokens. For autoregressive LLMs this matters even more: generation is iterative, one token at a time. If you serve only one request at a time, the GPU is easily underfilled because the work is too fragmented. Once you run multiple requests in parallel, the primary gain is usually not that each individual request gets faster, but that overall tokens/s rises, the GPU stays busier, and cost per token falls.
That is why batching almost always improves service throughput immediately:
- higher GPU utilization;
- weight reads and part of the kernel overhead are shared across more tokens;
- under the same hardware, the system can steadily produce more tokens/s.
But batching is never free money. For a single user, increasing batch size usually means two things: first, the request may wait while the system forms a batch; second, each later decode step must share the same scheduling round with more sequences. So what online systems actually seek is never “make the batch as large as possible.” It is find the most profitable balance between throughput and perceived latency for the business.
4.2 Static Batching vs. Continuous Batching
Static batching looks more like traditional deep-learning inference: collect a batch of requests, then run that batch all the way through together. That feels natural in offline workloads, but in LLM serving it quickly exposes problems. The reason is simple: different requests have different prompt lengths, and even more different output lengths. Some sequences finish quickly. Others keep generating. If the system must wait until “the entire batch is done” before releasing the slots, then requests that finished early get held hostage by requests that finish late, and new arrivals can only keep waiting in queue.
The key change in continuous batching is to reduce the scheduling granularity from the request to the iteration. The system does not wait for the full batch to finish. After each decode iteration, it checks: who is done, who is still running, and which new requests can be admitted immediately. We call this iteration-level scheduling. More plainly: completed requests are removed immediately, new ones fill the gaps immediately, and together with ragged batching and chunked prefill, the GPU is kept in a state where “there is always work to do.”
Continuous batching is not just “a more dynamic batch.” It turns batching from a one-shot action into a continuously running online scheduler. The goal is not to assemble a pretty batch. The goal is to keep GPU idle time as close to zero as possible when facing real traffic made of requests with different lengths, arriving and finishing at arbitrary times.
Ragged batching: batch variable-length requests at their true lengths
The problem ragged batching solves is not “can requests be put into the same batch?” It is once they are in the same batch, do you still need a lot of wasted padding just to align shapes?
Suppose the request lengths in one batch are \(L_1, L_2, \dots, L_B\).
In a traditional padded batch, the system usually pads all sequences to the same maximum length:
\[ L_{\max} = \max_i L_i \]
So the total processing scale along the length dimension becomes roughly:
\[ B \cdot L_{\max} \]
But the number of tokens that actually carry meaning is only:
\[ \sum_{i=1}^{B} L_i \]
The gap between them,
\[ B \cdot L_{\max} - \sum_{i=1}^{B} L_i \]
is, in essence, the waste introduced by padding. The larger the spread in sequence lengths, the worse this waste gets.
The idea of ragged batching is: do not pad short sequences to match long ones just to “make things rectangular.” Instead, pack them at their true lengths, and separately record boundaries, offsets, or masks so the system knows which segment belongs to which request.
For example, if three requests have lengths 3, 4, and 5, a standard padded batch often processes a total length of 15; ragged batching processes only the 12 real tokens and uses boundary information to make sure different requests do not interfere with one another.
From a systems point of view, the benefit is direct: the GPU spends more time on real tokens and less time on empty slots.
That is why ragged batching often appears together with continuous batching. Continuous batching faces online requests that keep arriving and finishing and whose lengths are all different. If you still insist on forcing them into a neat rectangle, padding overhead will quickly eat into the budget that could otherwise serve real tokens. Ragged batching lets the system keep the gains of batching while reducing that alignment cost.
Of course, it is not a free lunch. Once you remove padding, you must maintain more carefully:
- the start and end position of each request inside the packed tensor;
- the corresponding attention mask or boundary index;
- and the mapping between different sequences and their states inside the batch.
So ragged batching is a classic engineering trade-off: use more complex batch descriptions and scheduling logic in exchange for less wasted compute and better throughput efficiency. Ragged batching means batching requests at their real lengths instead of padding them to the same length and wasting large amounts of compute on padding.
4.3 Admission Control
What continuous batching improves is utilization, not “the more requests the better.” Once the system gets close to the limits of GPU memory or token budget, the most dangerous failure mode is often not that the average gets a little slower. It is that the system enters an unstable state: queueing time grows rapidly, a small number of long prefills seize most of the budget, and if KV cache space runs short, preemption and recompute get triggered, dragging end-to-end latency down even further. In vLLM, when KV cache space is insufficient for the current batch of requests, the system will preempt; and frequent preemption / recomputation directly hurts end-to-end performance.
So admission control is really answering one very practical question: the system is already busy—should this new request be let in right now?
Mature systems do not leave that question to luck. They make it an explicit policy. The most common control levers include:
- limiting the maximum number of active sequences;
- limiting total batched tokens;
- routing ultra-long prompts separately so they do not compete with ordinary short requests for the same budget;
- doing priority isolation under multi-tenant or multi-SLA settings.
vLLM’s tuning guidance says that if preemption happens frequently, you can increase the available KV space, or directly lower max_num_seqs and max_num_batched_tokens; the essence is reduce how much the system tries to swallow at once. In production, learning to “accept a bit less” under pressure is usually healthier than letting all requests get slower together—or start jittering together.
4.4 Chunked Prefill: Fix the Convoy Effect
The worst thing about a long prompt is not just that it is slow by itself. It tends to slow down everyone else too. If one ultra-long prefill consumes a large chunk of the token budget in one shot, the short requests behind it pile up like cars stuck behind a slow truck—this is the classic convoy effect.
Chunked prefill does not make long requests cheaper. It slices them into smaller prefill chunks so the scheduler can interleave those chunks with decode requests. The vLLM documentation is explicit about this mechanism: the system prioritizes decode; then, if max_num_batched_tokens still has remaining budget, it fills the remainder with prefills; if some prefill is too long to fit the current budget, the system automatically chunks it. As a result, an ultra-long prompt no longer monopolizes an entire scheduling round in one bite. It gets broken into smaller pieces that can be interleaved with other requests.
This scheduling style usually brings three kinds of benefit:
- better ITL / tail latency: decode gets priority, so short requests are less likely to be pinned behind long prefills;
- higher GPU utilization: compute-bound prefill and memory-bound decode mix more naturally in the same batch;
- less system-wide drag: a few ultra-long requests are less likely to block the entire request pool.
Of course, it has a cost. If chunks are too small, scheduling and management overhead rises; if chunks are too large, you drift back toward the old problem where “one long prefill blocks the road.” vLLM’s tuning guide is direct about that trade-off too: smaller max_num_batched_tokens usually helps ITL, while larger values help TTFT and overall throughput. In other words, chunked prefill is not an unconditional speedup switch. It is a scheduling tool for trading off fairness, throughput, and tail latency.
4.5 Prefix Caching: One of the Highest-ROI Online Speedups
The core idea of prefix caching is simple: when multiple requests share the same prefix, reuse the corresponding KV cache directly instead of recomputing prefill.
It is especially suited to these scenarios:
- the same system prompt;
- the same tool instructions and JSON schema;
- the same document prefix or knowledge package;
- templated RAG;
- long unchanged history in multi-turn sessions.
The benefits of prefix caching are usually direct: TTFT goes down, GPU utilization goes up, and cost goes down. But this is not purely a performance feature. It also touches isolation and security. In multi-tenant settings, if cache hits and misses are observable, they can become a side channel. So common production practices include:
- namespacing by tenant or trust group;
- introducing a cache salt for highly sensitive settings;
- disabling cross-tenant reuse for some workloads.
4.6 Speculative Decoding: Spend Extra Compute to Use Fewer Large-Model Steps
The basic idea of speculative decoding is:
- first let a faster small model, or a lightweight head, guess several tokens;
- then let the target large model verify them in one shot;
- if the draft guess is correct, the large model effectively skips several true decode steps.
Its engineering value is not “the answer gets better.” It is this: without changing the target distribution, reduce the number of expensive serial decode steps performed by the large model.
But speculative decoding only pays off when the following are true:
- the draft is cheap;
- the acceptance rate is high enough;
- the target model is genuinely decode-bound;
- the cost of verification does not swallow the savings.
So its winning condition can be written as:
\[ \text{draft cost} + \text{verify cost} < \text{saved target decode cost} \]
The failure modes are equally clear: low draft acceptance, a target GPU that is already saturated in another way, or a traffic pattern that does not fit speculative decoding will all make it slower instead.
4.7 Guided Decoding: Reduce Pipeline Retries, Not Just “Make JSON Look Better”
When model output must satisfy JSON, tool parameters, SQL, or some fixed schema, prompt-only control is often not stable enough. The idea of guided decoding / constrained decoding is to restrict the set of legal sampled tokens directly to those allowed by a grammar or an FSM (finite-state machine).
Its systems significance is large:
- it reduces invalid output and application-layer parse failures;
- it cuts the loop of “generate first, validate later, retry again”;
- it makes downstream logic in tool-calling and agent systems more reliable.
Of course, it is not free. Complex grammars bring extra state tracking and token-filtering overhead. Under high QPS, if grammar compilation or state maintenance is not efficient enough, they will eat part of the performance gain. What is truly worth doing in engineering is not “put the strongest possible constraint on every output,” but apply constraints where invalid output is most likely to break the pipeline.
5 GPU Memory Management
This section answers several key questions first:
- What objects consume most GPU memory during inference?
- Why must large-model services treat memory as a budget to be planned ahead, rather than fixing runtime errors after they happen?
- How do memory pooling, paged blocks, and dynamic batching work together?
- When GPU memory approaches the limit, what should you compress first, and how do you preserve headroom?
Inference systems do not fail by “running out of compute” first. They fail by “running out of room” first. This section is not about model math. It is about how GPU memory gets filled up, why it suddenly blows up, and how engineers prevent that in advance. If memory management is done poorly, throughput, latency, and reliability all break together.
5.1 GPU Memory Is Mostly Spent on Three Things: Weights, KV, and Temporary Workspace
You can think of GPU memory as an expensive, fast, but limited workbench. During model inference, many things must sit on that bench:
- Model weights: determine whether the model can stay resident on the device;
- KV cache: grows dynamically with active sequence count and context length;
- Temporary workspace and runtime overhead: including activations, attention workspace, allocator overhead, CUDA graphs, communication buffers, LoRA adapters, sampling buffers, and so on.
How much has to be placed on that bench? Will it suddenly stop fitting halfway through? Under short prompts and small batches, weights are often the largest component; under long context or large batches, KV quickly becomes the main character. What makes online systems suffer most is often the third category: it is less obvious than the first two, but under peak traffic it often turns “theoretically fits” into “actually OOMs.” That is GPU memory management.
5.2 Why Large-Model Services Must Do Memory Budgeting in Advance
You cannot manage GPU memory with “just run it and see what happens.” GPU memory is not an ordinary software bug. An ordinary bug might affect one request. Bad GPU-memory management directly causes: OOMs, smaller batches, lower throughput, higher TTFT, worse p99 tail latency, and a jittery service overall. Mature teams do not leave “is there enough GPU memory?” to online trial and error. They budget before launch:
- how much is taken by weights;
- how much KV is needed under the target maximum batch, maximum context, and maximum output length;
- how much workspace and burst room must be reserved;
- whether peak usage is further raised by prefix cache, multi-LoRA, CUDA graphs, or communication buffers.
The point of memory budgeting is not just to avoid OOM. It is to give the scheduler a clear boundary. Without a budget, the scheduler cannot know when to reject a request, when to degrade service, or when to split the queue.
5.3 Pooling, Paged Blocks, and Fragmentation Control
What GPU memory management fears most is not “the total is insufficient.” It is “the total looks sufficient, but allocation still fails because of fragmentation and fluctuation.” That is why modern inference stacks almost always use:
- preallocated or pooled memory regions;
- block-level or page-level KV allocation;
- block sizes that are as stable as possible;
- as little large-scale movement and reallocation as possible.
Memory pooling reserves a large chunk of memory first, then reuses it repeatedly within that pool, minimizing temporary allocate/free/reallocate behavior.
The idea behind paged blocks / paged attention is: do not give every request one huge contiguous KV region. Instead, split KV into many fixed-size small blocks. Whichever request needs some takes some blocks and stitches them together. What is it like? Standard storage bins in a warehouse. You no longer build a custom oddly shaped cabinet for every customer; you use standard boxes for all. The benefits are: easier reuse, easier reclamation, less fragmentation when long and short requests mix, and easier prefix sharing and copy-on-write.
The value of this mechanism is that it turns “frequent allocation and release” into “block-level reuse inside a fixed pool.” For online services where requests enter and leave dynamically and context lengths vary widely, that usually matters more to system stability than any single kernel optimization.
5.4 Why Dynamic Batching Needs Headroom
The great temptation of dynamic batching is to “fill the GPU to the brim.” But stable systems do not usually push GPU memory to 99%. The reason is simple:
- KV cache grows dynamically;
- new requests can arrive suddenly;
- long outputs or abnormally long prompts can break the average-case assumption;
- speculative decoding, guided decoding, multi-LoRA, and P/D migration can all require extra workspace.
So in practice, engineers deliberately keep memory headroom. This is not waste. It is room for jitter. The root cause of many p99 tail-latency problems is not “the GPU is too weak.” It is that the system packed it too tightly, so any fluctuation triggers preemption, swap, or OOM.
5.5 When GPU Memory Is Not Enough, What Should Be Prioritized?
An empirically effective order is:
- compress the input first: remove useless context; this usually gives the best return;
- then compress weights: weight-only quantization is often the first thing to land;
- then compress KV: especially in long-context workloads;
- if still insufficient, consider offload, disaggregated deployment, or multi-GPU parallelism;
- only then move to a larger GPU or a smaller model.
This order reflects an engineering principle: reduce waste first, then change the deployment shape, and only then make the more expensive hardware decision.
6 Kernel and GPU Optimization
This section answers several key questions first:
- Why is optimizing the attention kernel often worth more than “tuning batch size a little more”?
- What problems do FlashAttention, PagedAttention, and FlashDecoding each solve?
- Why is kernel fusion especially visible during decode?
- Why can tensor parallelism both scale models and introduce communication cost?
- How should inference frameworks be chosen—by what standards, rather than by hype?
Model structure by itself does not automatically become a high-performance service. Take the same model, swap in a different kernel, a different KV-management strategy, or a different parallel sharding scheme, and you can get completely different throughput, tail latency, and GPU-memory utilization. Paper-level architectural advantages only become real when the runtime actually realizes them.
Many inference optimizations are not changing whether the model can reason. They are changing how data flows across the GPU.
The most common sources of waste are threefold:
- compute that could have stayed on-chip, but instead keeps bouncing to HBM;
- several small operations that should have stayed fused, but are split into many kernels;
- work that one GPU could have handled, but after being sharded across many GPUs spends a large fraction of time communicating instead.
6.1 FlashAttention: Do Not Change Attention, Move Less Data
FlashAttention is most easily misunderstood as “approximate attention.” It is not. Its core is IO-aware exact attention: the mathematical result of attention does not change; what changes is the GPU implementation. Ordinary attention often materializes a large intermediate matrix, writes it back to HBM, and then reads it back again for the next step. FlashAttention instead organizes computation in tiles and tries to finish as much as possible in faster on-chip SRAM / shared memory, reducing traffic between HBM and on-chip cache.
FlashAttention mainly reduces HBM traffic. That is why it is especially valuable for long-prompt prefill. During prefill, attention work is large and intermediate state is large. If you can avoid moving a few large chunks of data, speed improves noticeably, and attention is more likely to be pulled from “slowed down by IO” back toward “closer to compute-bound.” Because it is still exact attention, it is also easier to land in production than many approximate methods.
You can think of FlashAttention like this: it is not a smarter attention mechanism. It is replacing “spread a giant sheet of scratch paper over the whole desk before you compute” with “compute in tiles, merge as you go, and avoid moving full pages back and forth.”
6.2 Kernel Fusion: Take Steps That Naturally Belong Together and Do Them at Once
Many model computations are naturally consecutive:
- dequant immediately followed by matmul;
- norm immediately followed by linear;
- RoPE immediately followed by QKV projection;
- matmul immediately followed by activation or residual.
If each step writes its intermediate result back to HBM before the next step reads it again, you pay two extra costs: first, an extra GPU-memory read/write, and second, an extra kernel launch. The TensorRT-LLM documentation states the value of fusion clearly: fuse multiple operations into one kernel to reduce memory movement and the overhead of launching multiple GPU kernels.
This helps both prefill and decode, but it is especially visible in decode. The reason is simple: each decode step generates very few new tokens. The compute volume is small to begin with. In that regime, extra overhead from what look like “just a few small kernels” gets amplified along the single-token path. Put differently: in a large job, a few extra steps do not stand out; in a tiny job, they become expensive. Decode is the latter case.
6.2.1 PagedAttention and FlashDecoding: One Manages Storage, the Other Manages Computation
The two names sound similar, but they do not solve the same problem.
PagedAttention manages how KV cache is laid out in GPU memory. It borrows the paging idea from operating systems: each sequence no longer requires its K/V to occupy one large contiguous region of memory. Instead, KV cache is split into fixed-size blocks/pages, and a mapping table maps logically contiguous history to physically non-contiguous GPU blocks. The point is not primarily “attention becomes faster to compute.” The point is less GPU-memory waste, less fragmentation, and easier KV sharing and reuse. The vLLM paper summarizes it this way: K/V can live in non-contiguous paged memory, which drives KV-cache waste close to zero and raises throughput under the same latency by 2–4×.
FlashDecoding, by contrast, manages how the attention kernel is computed during decode. The awkward part of decode is that each step typically has only one new query token. So many attention kernels that are ideal for large-matrix parallelism suddenly become too narrow during decode, and GPU utilization drops. The core idea of Flash-Decoding is to add a parallel dimension when the query length is almost 1: split the sequence dimension of historical K/V so different SMs can process different segments of history at the same time, then merge the results correctly. CRFM describes it directly: during decode, it parallelizes along the keys/values sequence-length dimension, which leads to better GPU utilization at small batch sizes and long context, and can deliver significant speedups on very long sequences.
So the easiest distinction to remember is:
- PagedAttention: solves KV-cache storage and allocation;
- FlashDecoding: solves parallelism and utilization for decode attention.
If your setting is long context + small-to-medium batch + decode clearly dragged down by bandwidth and historical access, then these specialized decode kernels are often highly valuable. If your main problem is KV does not fit, fragmentation is severe, and active request count cannot rise, then memory-management optimizations such as PagedAttention usually deserve higher priority.
6.3 Tensor Parallelism: Split Big Matrices Across GPUs, and Communication Comes With It
When a model does not fit on one GPU, or some layers are so large that one GPU is not fast enough, one of the most common tools is tensor parallelism (TP). The basic idea is to split the large matrices inside a layer across multiple GPUs, compute them in parallel, and then use all-reduce or all-gather when needed to stitch the results back together. The Megatron-LM paper introduced tensor parallelism precisely under the twin pressures of “GPU memory capacity is limited” and “large models are too slow.”
The benefits of TP are direct:
- you can fit larger models across multiple GPUs;
- some especially large operators can run faster in parallel.
But the costs are equally direct:
- every layer introduces cross-GPU communication;
- when batch size is small and the decode path is short, communication overhead becomes more visible;
- operations and debugging complexity both go up. Megatron-LM explicitly points out that naive model parallelism runs into expensive cross-node communication and devices waiting on one another.
So TP is not “the more the better.” It is a classic systems trade-off: you spend more GPUs to gain capacity and parallelism, and in exchange you convert part of your runtime into synchronization and communication.
That is why in settings where decode is clearly bandwidth-bound and batch size is not large, over-sharding can actually be a bad trade: the operator is split up, but you did not gain enough parallel work to compensate for the communication.
Inference frameworks should not be chosen by “what everyone is using.” They should be chosen by “what your worst bottleneck is right now.”
If your main problem is:
- long context, high concurrency, KV-cache management, then you should care more about PagedAttention, continuous batching, KV reuse, and scheduler maturity; this is exactly where vLLM focuses.
- pushing NVIDIA GPUs to the limit with kernels, fusion, multi-GPU parallelism, and compile-time optimization, then you should care more about the compiler, plugins, runtime, and communication stack; TensorRT-LLM’s official documentation emphasizes engine compilation, layer fusion, plugins, and Python/C++ runtimes.
- the model is too large and must be split across multiple GPUs, then you have to think about communication cost, topology, and parallel strategy together, not just single-GPU benchmarks; that is the mindset Megatron-LM provides.
So framework selection is not really about allegiance. It is understand your bottleneck first, then choose the runtime best suited to solve it.
7 Model Compression
This section answers several key questions first:
- What is model compression really compressing: weights, activations, KV, or the model itself?
- Why does naive quantization so often fail, and what exactly is the outlier problem?
- When are AWQ, SmoothQuant, GPTQ, and QAT each the better path?
- Why do pruning and sparsity so often “look efficient on paper but not necessarily run fast”?
- When is distillation more reasonable than continuing to squeeze a large model?
Compression is not one trick. It is a deployment path. The real question is not “can you compress it?” but “after compression, do business quality, hardware efficiency, and operational complexity improve together?”
7.1 The Four Main Compression Routes
In an inference context, compression includes at least four categories:
- Quantization: reduce numeric precision for weights, activations, or KV;
- Pruning and sparsity: make part of the parameters stop participating in computation;
- Distillation: transfer the behavior of a large model into a smaller one;
- Low-rank and adapter-based methods: reduce customization cost, rather than compressing the base model itself.
These routes do not solve the same problem:
- quantization first solves “it does not fit” and “bandwidth is too expensive”;
- pruning first asks whether effective compute can actually be reduced;
- distillation first solves “the model itself is too large”;
- LoRA-style methods first solve “it is too expensive to copy one full model per task.”
7.2 What Weight-Only Quantization, Activation Quantization, and KV Quantization Each Compress
| Type | What is compressed | Main benefit | Common setting | Main risk |
|---|---|---|---|---|
| Weight-only quantization | Model weights | Reduce resident GPU memory and system memory | Single-GPU-limited, self-hosted deployment | Quality can drop noticeably at very low bit widths |
| Activation quantization | Intermediate activations in the forward pass | Reduce bandwidth and operator cost | High-throughput inference, hardware-friendly paths | Sensitive to outliers; more complex to implement |
| KV quantization | KV cache | Free long-context GPU memory and reduce decode bandwidth | Long documents, long conversations, long RAG | Long-range precise recall degrades more easily |
In practice, the most common order is to apply weight quantization first, then decide based on the workload whether KV also needs to be compressed. Activation quantization typically appears in more aggressive, more hardware-bound optimization paths.
7.3 Why Naive Quantization Often Fails: The Outlier Problem
In LLMs, the activation distribution in many layers is not uniform. A small number of channels produce very large outliers. If you do naive per-tensor quantization, those outliers stretch the entire quantization range wide, so the many ordinary values get squeezed into coarse buckets and the error rises quickly.
When quantization fails, it often does not first show up as a slight perplexity increase. It shows up earlier in tasks like:
- code and math;
- structured output;
- long-context retrieval;
- tasks where rare tokens must remain stable.
So modern LLM quantization methods are almost all answering the same question: how do you protect the important channels so that a few outliers do not ruin the entire range?
7.4 AWQ, SmoothQuant, GPTQ: Three Important PTQ Paths
The core of AWQ is “use activation statistics to protect important weight channels,” which makes it a strong fit for 4-bit weight-only quantization. It is especially attractive in local inference, single-machine GPU deployment, and edge-device settings.
The idea of SmoothQuant is to move part of the “activations are hard to quantize” problem into the weights through an equivalent transformation, smoothing activation outliers so hardware-friendly paths like W8A8 become practical. It is a better fit when you want a relatively complete INT8 inference stack.
GPTQ is a classic one-shot PTQ route that uses approximate second-order information to do more refined weight-quantization compensation. It is highly representative of 3/4-bit weight-only paths and fits deployment after offline calibration.
All three are PTQ, but they answer different questions:
- If you want to safely push a large model to low bit width, common in local or single-machine deployment: look at AWQ / GPTQ first;
- if you want a more standard W8A8 hardware path: look at SmoothQuant first;
- if you want the fastest path to deployment with the lowest deployment cost: weight-only is usually what lands first.
7.5 PTQ and QAT: When Is the Training Cost Worth Paying?
PTQ (Post-Training Quantization) is fast, cheap, and deployment-friendly. Its downside is that at very low bit widths, quality is not always stable.
QAT (Quantization-Aware Training) includes quantization error during training or finetuning, so the model learns to adapt to quantization noise. Its advantages are:
- it is usually more stable at extremely low bit widths;
- it can recover quality lost by PTQ more easily;
- it is better at protecting structured output, long context, and certain high-value tasks.
Its costs are equally clear: it requires training resources, longer iteration cycles, and a deployment path more tightly bound to specific hardware or dtypes.
So QAT is usually worth it only when the following conditions hold together:
- PTQ is already hurting production quality;
- the model will be deployed for a long time and at large scale;
- the saved inference cost is large enough to amortize the training cost;
- you truly need that degree of compression, rather than just switching to a smaller model.
7.6 Pruning and Sparsity: Why “More Sparse” Does Not Automatically Mean “Faster”
Pruning can be divided roughly into:
- unstructured pruning: arbitrarily set some weights to zero;
- structured pruning / N:M sparsity: such as 2:4;
- architectural sparsity: such as MoE, where each token activates only part of the parameters.
The problem is that hardware and kernels can usually exploit only certain sparsity patterns efficiently. So you get a common engineering trap: sparsity looks high in the paper, but there is little obvious speedup online.
Worse, pruning often has a “quality cliff.” It looks harmless up to a point, and then quality collapses abruptly after crossing a threshold. The reason is that what gets damaged is not average redundancy, but critical circuits, critical heads, or critical channels.
So whether pruning is worth it comes down to two questions:
- can your hardware actually exploit this kind of sparsity;
- can your workload tolerate this nonlinear degradation risk?
7.7 Distillation: Sometimes What You Should Compress Is Model Choice, Not Bit Width
If what you really need is lower TTFT, faster generation, and lower dollar cost, distillation is often more decisive than trying to squeeze a giant model to ever lower bit widths. Its essence is: transfer the behavior of a large model into a smaller, cheaper student model.
Common paths include:
- response distillation: learn the teacher’s outputs;
- logit distillation: learn the teacher’s distribution;
- trajectory distillation: learn intermediate reasoning or tool-use behavior.
A very practical engineering judgment is this: if request volume is large, the pattern is stable, the same error types repeat, and model size itself has already become the dominant cost source, then distillation is often worth more than piling on more test-time tricks for a large model.
7.8 GGUF and Local Inference: More Delivery Ecosystem Than Algorithm
A better way to understand GGUF is not as a standalone quantization algorithm, but as an important model packaging and distribution format in the local-inference ecosystem. It commonly appears together with llama.cpp, CPU/Mac deployment, and edge-device settings. Its engineering value is:
- easy distribution;
- friendly local and offline deployment;
- good fit for CPU / Apple Silicon settings.
So when someone asks “how do GPTQ, AWQ, and GGUF compare?”, the more engineering-minded answer is:
- GPTQ / AWQ are more like quantization methods;
- GGUF is more like a delivery format and runtime ecosystem;
- they solve problems at different layers.
8 Test-Time Scaling
This section answers several key questions first:
- What is the essential difference between test-time scaling and speculative decoding?
- What kinds of problems do temperature, top-p, beam search, self-consistency, and best-of-N each fit?
- When is it worth spending more compute at inference time, and when should you go back to training or distillation instead?
- Why does the presence of a verifier or PRM change whether you should do search?
Not every inference optimization is trying to be faster. There is a class of optimization that deliberately adds compute at inference time in exchange for greater reliability or higher answer quality.
8.1 What Is Test-Time Scaling
Test-time scaling means: without retraining the model, spend more compute at inference time to improve answer quality. This is completely different from speculative decoding.
- The goal of speculative decoding is speed; ideally it does not change the target distribution.
- The goal of test-time scaling is accuracy; it explicitly increases inference cost.
When a request is high-value, the cost of mistakes is high, and there is a reliable verifier, test-time scaling is often more economical than blindly switching to a larger model.
8.2 Sampling Strategy: Decide First Whether You Need Diversity or Stability
Many system problems do not require complex search. Thinking clearly about the sampling strategy solves half of them.
- Temperature: lower means more deterministic; higher means more diverse;
- top-k / top-p: control the candidate-token space;
- beam search: keep multiple high-probability prefixes and keep expanding them;
- best-of-N: sample multiple complete candidates and then choose one;
- self-consistency: vote across multiple reasoning paths;
- critique-revise: generate a draft first, then criticize and revise it.
Sampling methods
At each step, the model is only choosing from the distribution over the next token:
\[ p_t(v)=P(y_t=v\mid x,y_{<t})=\frac{\exp(z_t(v))}{\sum_{u\in V}\exp(z_t(u))} \]
So these methods fall into only three categories at bottom: change the shape of the distribution, truncate the candidate set, and generate multiple paths and then choose.
Temperature
This changes how sharp the distribution is:
\[ p_t^{(T)}(v)=\frac{\exp\!\bigl(z_t(v)/T\bigr)}{\sum_u \exp\!\bigl(z_t(u)/T\bigr)} \]
- \(T<1\): more stable; probability mass concentrates on head tokens;
- \(T>1\): more diffuse; probability spreads over more tokens.
Top-k / Top-p
- top-k: keep only the top \(k\) tokens—a hard truncation.
- top-p: keep only the smallest set of tokens whose cumulative probability reaches \(p\)—more flexible.
Beam Search
Keep multiple high-scoring prefixes during generation and continue expanding them. Good for low-entropy tasks like translation, ASR, and summarization, where high-probability sequences are often genuinely good sequences. But in open-ended dialogue, high probability often means only mediocre.
Best-of-N
Generate \(N\) complete answers first, then choose the best one—“write first, choose later,” rather than “filter while generating”:
\[ \hat{y}=\arg\max_i\; s(x,\,y^{(i)}) \]
Self-consistency
Sample multiple reasoning paths and vote on the final answer:
\[ \hat{a}=\arg\max_a \sum_{i=1}^{N} \mathbf{1}\!\bigl[a^{(i)}=a\bigr] \]
Good for reasoning tasks. Not good for JSON—the problem with JSON is not “which line of thought is more consistent,” but “do not produce the wrong format.”
Critique-revise
Draft → critique → revised draft. Good for long answers, analysis, and proposals—because the common failure mode of the first draft is not that it is totally wrong, but that it is incomplete, blunt, or not rigorous enough.
Do not choose sampling methods by asking which one is more advanced. Ask the more practical question first: for this task, is the bigger risk being wrong, or being flat? A common engineering mistake is to treat these methods as a hierarchy where “more complex is better.” It is not. For low-entropy tasks such as tool parameters, JSON, and class labels, low temperature or guided decoding is often the most effective. For math, reasoning, or proposal exploration, self-consistency, best-of-N, or critique-revise are more valuable. Beam search is useful in machine translation, speech, summarization, and other relatively low-entropy tasks where the scoring function is more stable, but it is not always the best choice in open dialogue, because it tends to push output toward high-probability but mediocre modes.
| Scenario | Core need | Recommended strategy |
|---|---|---|
| JSON, tool parameters, class labels | Stability—you fear being wrong | Low temperature / guided decoding / small candidate set |
| Math, reasoning, proposals, writing | Diversity—you fear being flat | top-p / best-of-N / self-consistency / critique-revise |
8.3 When to Spend More Compute at Inference Time
Cases where test-time scaling usually deserves consideration are:
- request volume is not enormous, but the value of each request is high;
- errors are concentrated in hard cases, rather than being a global bias across all requests;
- you have a verifier, judge, or business rule that can distinguish good from bad;
- your latency budget allows the difficult samples to take a slower path.
By contrast, if the following are true, you should more likely go back to training, distillation, or model switching:
- request volume is enormous, so the extra inference cost cannot be amortized;
- failure modes are highly repetitive;
- there is no reliable evaluator;
- all requests require stable improvement, not just rescue for high-value hard cases.
So a mature system often adopts a tiered path: easy requests go through single-pass inference, while difficult or high-value requests enter a more expensive test-time-scaling path.
8.4 PRM and Search: Once You Can Evaluate Intermediate States, Search Becomes More Worthwhile
A PRM fundamentally scores intermediate states, not just the final result. For example: is this problem decomposition correct; does this intermediate conclusion move in the right direction; has this subproblem already drifted off course; is this reasoning path converging, or wandering. That turns search from “blind search” into “guided search,” letting the system detect bad paths earlier, waste less compute on obviously wrong branches, and focus compute on more promising trajectories.
- If you only evaluate the final answer, best-of-N and voting are simpler;
- if you can evaluate intermediate steps, tree search, backtracking, and pruning start to have a real basis.
This also maps to a practical breadth–depth trade-off:
- when uncertainty is mainly about “choosing the wrong direction,” increasing breadth is more valuable;
- when, once the direction is right, what follows is a long chain of reasoning, increasing depth is more valuable.
9 Metrics, Monitoring, and Quality Gates
This section answers several key questions first:
- What metrics matter most for an inference service, and which are nice-looking but insufficient?
- Why does looking only at average latency almost always misdiagnose the state of a live system?
- Before compression or quantization goes live, how should quality gates be set?
- When users say “sometimes it’s fast, sometimes it’s slow,” how do you slice the problem to find the root cause?
The stability of an inference system comes from observability. Without metrics, logs, and slices, you have no idea whether you are optimizing the model, the scheduler, the cache, or fixing a problem that does not actually exist.
9.1 Four Core Metrics, and Why They Still Are Not Enough
The four most common core metrics are:
- TTFT;
- generation speed / TPOT;
- throughput;
- cost, such as cost per million tokens or cost per request.
But in production, those four are nowhere near enough. You usually also need:
- p95 / p99 TTFT;
- p95 / p99 generation speed;
- queue depth and request age;
- prefix-cache hit rate;
- speculative acceptance rate;
- counts of OOM / preemption / eviction;
- JSON / tool-call success rate;
- citation and retrieval accuracy for long-context tasks.
Averages are not enough because users experience the tail, not the mean.
9.2 Cost Should Be Broken Down by Request Path, Not Just Model Unit Price
Inference cost usually comes from three parts:
- model token cost;
- tool and retrieval call cost;
- amplification from retries, reranking, test-time scaling, and multi-turn exploration.
For a serving team, what matters more than “how much does this model cost per million tokens?” is:
- which kinds of requests are most expensive;
- whether the cost is in prefill or decode;
- whether it is driven by long context or multi-turn retries;
- whether it comes from the large model itself or from the surrounding retrieval and structured-output path.
Only when cost is broken down by request path do you know what to optimize.
9.3 How to Set Quality Gates After Quantization
The easiest mistake before shipping compression is to look at one average benchmark score. A more reliable gating method should at least break into four layers:
- basic capability: perplexity, standard benchmarks;
- main business tasks: QA, code, summarization, customer support, and so on;
- structured output: JSON validity rate, schema hit rate;
- long context and safety: Needle, cross-section citation, refusal boundaries, unauthorized tool use.
The usual gating strategy is:
- set hard thresholds first, and do not allow key tasks to regress;
- then set weighted scores, allowing small regressions on non-critical metrics;
- finally send shadow traffic or a canary and observe the real online distribution.
9.4 Why Needle-in-a-Haystack Often Drops First
If, after quantization, you find that long-document retrieval or Needle metrics drop first, that is usually not surprising. These tests rely heavily on:
- long-range representations not being drowned in noise;
- subtle differences in attention remaining distinguishable;
- rare positions and rare tokens maintaining enough numerical precision.
Quantization—especially aggressive low-bit paths and KV quantization—happens to hurt exactly those abilities first. So if your workload depends heavily on precise long-document localization, quantization evaluation cannot look only at chat quality.
9.5 When Users Say “Latency Is Unstable,” How to Slice the Problem
“Sometimes it is fast, sometimes it is 5× slower” is usually not one problem. The most effective diagnosis is to slice requests along dimensions like:
- prompt-length buckets;
- output-length buckets;
- cache hit / miss;
- guided / non-guided;
- speculative on / off;
- tenant / adapter / model version;
- prefill pool / decode pool.
Once you slice that way, the real root cause often becomes obvious: long prompts mixed into the same queue, prefix-cache mismatches, cold adapter loading, or an autoscaler that failed to catch up.
10 Engineering Cases
This section answers several key questions first:
- What are the most common failure modes in online inference systems?
- When KV cache, the scheduler, quantization, or decode configuration goes wrong, what do the symptoms usually look like?
- What engineering lessons from those incidents are actually worth remembering?
Online problems rarely show up in the form “some theory was wrong.” They show up as user complaints, cost anomalies, exploding p99s, or quiet quality drift. The point of the cases below is not the stories themselves. It is to build transferable debugging intuition.
10.1 KV Cache Explodes: The System Is Not Fully Loaded, Yet OOM Happens Constantly
Symptoms:
- GPU utilization is not especially high, but OOM occurs frequently at peak traffic;
- users with long conversations fail at a noticeably higher rate;
- scaling out the cluster helps only slightly.
Root causes:
- the team estimated capacity only from “average prompt length,” rather than budgeting by maximum active sequences and maximum output length;
- prefix cache, multi-turn history, and long outputs together drove KV much higher;
- there was no headroom, so as soon as dynamic batching jittered, the system crossed the line.
Engineering lessons:
- KV cache is not background cost. It is the main cost of online state;
- capacity planning must budget the worst path, not the average path;
- any long-context product should go through memory-envelope testing before launch.
10.2 Continuous Batching Improves Average Throughput, but p99 Explodes
Symptoms:
- average tokens/s increases;
- users keep complaining that first-token latency swings wildly;
- p99 TTFT suddenly spikes during traffic bursts.
Root causes:
- long prompts and short prompts are mixed in the same queue;
- admission control is too weak;
- chunked prefill was not enabled, so a few long prefills monopolized the entire prefill channel.
Engineering lessons:
- online systems should not optimize only for average throughput;
- fair scheduling and SLA tiering are often worth more than a larger batch;
- continuous batching without bucketing and admission control can easily turn a local optimum into a global worst case.
10.3 After Quantization, Chat Still Looks Fine, but RAG Quality Quietly Collapses
Symptoms:
- ordinary chat and summarization feel almost unchanged;
- in long-document QA, citations begin to drift and cross-section retrieval misses more often;
- Needle-style tests regress clearly.
Root causes:
- the team looked only at standard benchmarks and manual chat experience;
- the compressed path was KV or an overly low-bit route, which hurt long-range attention first;
- precise long-context citation was not included in the gating criteria.
Engineering lessons:
- quantization evaluation must align with the real workload;
- long context is not covered by “adding one benchmark”; you must test fine-grained abilities such as citation, localization, and entity binding;
- what breaks first is usually not average capability, but the most fragile edge capability.
10.4 Structured Output Is Unstable, and Retries Blow Up Both Latency and Cost
Symptoms:
- JSON parse failures are not frequent, but frequent enough to be painful;
- every failure triggers an application-layer retry;
- average latency and token cost both amplify with traffic.
Root causes:
- the system relies only on prompt instructions like “output valid JSON”;
- there is no guided decoding or strong schema validation;
- the retry policy is too permissive, turning small-probability errors into systemic cost.
Engineering lessons:
- for structured output problems, first ask how to prevent invalid tokens during sampling, not how to “try again a few more times” after the fact;
- retries are not free; they amplify every edge-case error into a cost problem for the whole system;
- in agent or tool settings, output validity is itself a performance problem.
10.5 TTFT Gets More Stable After P/D Disaggregation, but the Tail Starts Jittering
Symptoms:
- average TTFT improves;
- generation speed gets worse for some requests;
- cluster monitoring shows occasional idling or waiting in the decode pool.
Root causes:
- after separating prefill and decode, KV transfer and the connector became new bottlenecks;
- the boundary of tail-latency optimization turned into a cross-machine state-migration problem;
- KV migration cost and network jitter were not evaluated enough.
Engineering lessons:
- P/D disaggregation is an SLA tool, not free throughput;
- state migration itself becomes a systems problem;
- before introducing disaggregation, ask whether chunked prefill is already sufficient.
11 Chapter Summary
What this chapter is really trying to build is not a how-to for a particular framework, but an engineering lens for judging inference systems.
- Prefill and decode are not the same workload.
- KV cache is the most important online state in a serving system.
- Throughput, latency, GPU memory, cost, and reliability are coupled; no single-point optimum exists.
- Compression and kernel optimization only count as wins if the runtime can actually exploit them.
- In online systems, the real battle is usually not in averages, but in tail latency, failure modes, and observability.
From this angle, inference is not the cleanup step after training. It is itself one of the core production-engineering layers of modern AI systems.
11.1 How to Choose an Inference Framework: Choose by Bottleneck, Not Popularity
It is easier to judge common open-source stacks when you view them through the systems problem they solve:
| Framework | Stronger at | Better fit for | What to watch |
|---|---|---|---|
| vLLM | PagedAttention, continuous batching, APC, chunked prefill, structured outputs, LoRA | The default answer for general open-source model serving | The runtime is strong, but admission control, isolation, and SLA tiering still need to be designed by you |
| TensorRT-LLM | Deep optimization for the NVIDIA stack, IFB, paged KV, speculative decoding, guided decoding, multinode parallelism | Running squarely on the NVIDIA stack and chasing extreme performance | Higher engineering barrier, stronger hardware lock-in |
| SGLang | RadixAttention, structured output, PD disaggregation, multi-LoRA, rich quantization and KV-cache options | Agents, structured outputs, and complex prefix-reuse workloads | Fast-moving releases; you need to keep up |
| TGI | Tight Hugging Face ecosystem integration, mature deployment experience, good monitoring and streaming support | Teams already deeply dependent on the HF ecosystem | At this point it looks more like a maintenance-oriented solution; new projects usually evaluate vLLM and SGLang alongside it |
A more useful decision standard than “which one is fastest?” is:
- Is your main bottleneck TTFT, generation speed, throughput, structured output, or multi-tenancy and isolation?
- Are you explicitly committed to the NVIDIA stack?
- Do you need strong default performance, or more orchestration flexibility and cache-aware features?
- Can your team carry the cost of custom kernels, distributed debugging, and rapid upgrade cycles?
11.2 Question Recap
1. What is inference in a language model?
Turn user input into tokens, run prefill to build contextual representations and the KV cache, then decode output tokens autoregressively and return the result to the caller.
2. Why is LLM inference expensive?
Because you must not only keep huge model weights resident on hardware, but also keep reading weights and KV cache continuously under live traffic; as context, active-sequence count, and output length grow, GPU memory, bandwidth, and scheduling quickly become bottlenecks.
3. What is the difference between prefill and decode?
Prefill processes the full prompt at once, with high parallelism and high arithmetic intensity; decode adds one token at a time, but must read the full historical KV, making it serial and more often bandwidth-limited.
4. What is KV cache?
It is the per-layer cache of historical token K/V that avoids recomputing the full history at every decode step; it is the core online state of decoder-only serving.
5. Why does context length affect latency?
Because a longer prompt increases prefill compute, a longer history increases the amount of KV read on each decode step, and both also consume additional GPU memory.
6. Why does attention complexity rise so quickly?
Because token-to-token interactions grow with sequence length; even when inference uses KV cache to avoid recomputing K/V, each decode step still reads cache roughly in proportion to the history length.
7. How does KV caching speed up generation?
It turns “recompute the full history every step” into “compute only the new token and then read the cached history,” converting a repeated-compute problem into a state read/write problem.
8. What is batching, and why does it matter?
It lets multiple requests share one forward pass, amortizing weight reads and kernel overhead, and it is the core mechanism for getting throughput in online serving. But if pushed too aggressively, it hurts single-request latency.
9. How does an inference system manage GPU memory?
First budget capacity for weights and KV, then use memory pools, paged KV allocation, headroom control, chunked prefill, quantization, and admission control to keep dynamic requests safely inside that budget.
10. Why is decode often the user-visible bottleneck?
Because it is a serial token-by-token path that cannot parallelize as fully as prefill, and each step must read both weights and historical KV, so bandwidth often drags it down.
11. When does prefix caching deliver the most value?
When many requests share the same system prompt, template, document prefix, or long history, prefix cache can skip the shared part of prefill and directly reduce TTFT.
12. When is speculative decoding worth using?
When the target model is clearly decode-bound, the draft model is cheap enough, and the acceptance rate is high enough. If acceptance is low or the main bottleneck is not decode, it may not pay off.
13. Why must quantization be evaluated together with hardware?
Because whether a quantization path actually delivers speed or capacity gains depends on kernel support, data layout, and the hardware execution path; lowering bit width on paper does not automatically make production faster.
14. How do you design a scalable LLM inference service?
Start by defining latency, throughput, context length, structured output, and isolation requirements from the business side; then design around request queues, schedulers, KV management, caches, decoding, monitoring, and degradation paths, rather than just picking a model.
15. What is the key to supporting million-scale concurrency?
Not a single extreme benchmark number, but layered queues, cache-aware routing, continuous batching, prefix reuse, GPU-memory budgeting, autoscaling, and disaggregated deployment when needed.
16. How do you balance throughput and latency?
Use bucketing, priority queues, batched-token limits, chunked prefill, and headroom control so the system can preserve p95/p99 under high throughput, rather than optimizing only for the average.
17. Which techniques are most effective at reducing cost?
It depends on the setting: short context often starts with weight quantization and batching; long context often starts with KV management and input compression; long-lived high-traffic workloads should more seriously consider distillation or model switching.
18. When should you compress, and when should you switch to a smaller model?
If the workload must retain the current model’s capability and is only deployment-constrained, compress first. If the model is clearly overprovisioned, traffic is large, and the task pattern is stable, a smaller model or distillation is usually the better economic choice.
12 References
- NVIDIA. H100 Tensor Core GPU Datasheet. 2023.
- Dao et al. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. 2022.
- Kwon et al. Efficient Memory Management for Large Language Model Serving with PagedAttention. 2023.
- Ainslie et al. GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints. 2023.
- vLLM Project. vLLM Documentation (Automatic Prefix Caching, Chunked Prefill, Structured Outputs, Disaggregated Prefilling). 2024.
- NVIDIA. TensorRT-LLM Documentation (In-Flight Batching, Paged KV Cache, Chunked Prefill, Guided Decoding, Speculative Decoding). 2024.
- SGLang Team. SGLang Documentation (RadixAttention, Structured Outputs, Quantized KV Cache, LoRA Serving, PD Disaggregation). 2024.
- Hugging Face. Text Generation Inference Documentation (Continuous Batching, Paged Attention, Guidance, Monitoring). 2024.
- Chen et al. Accelerating Large Language Model Decoding with Speculative Sampling. 2023.
- Cai et al. Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads. 2024.
- Wang et al. Self-Consistency Improves Chain of Thought Reasoning in Language Models. 2023.
- Zhang et al. H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models. 2023.
- Xiao et al. Efficient Streaming Language Models with Attention Sinks. 2023.
- Li et al. SnapKV: LLM Knows What You Are Looking For Before Generation. 2024.
- Liu et al. KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache. 2024.
- Hooper et al. KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization. 2024.
- Lin et al. AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration. 2024.
- Xiao et al. SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models. 2023.
- Frantar et al. GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers. 2023.
- PyTorch Blog. Quantization-Aware Training for Large Language Models with PyTorch. 2024.
- Frantar & Alistarh. SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot. 2023.