Appendix 1A. Mathematical Explanations of the LLM Taxonomy
1 Overview
This appendix expands the concepts listed in the handbook taxonomy into a math-first reference.
For each concept or method, I use the same four questions:
- What it is — a compact mathematical definition or formalization.
- What problem it solves — the engineering or statistical bottleneck it addresses.
- What came before it — the earlier baseline or historical precursor.
- Pros / cons — the main trade-offs.
Where the original taxonomy contained near-duplicates (for example FFN and MLP, or TTFT / TPOT / ITL), I merge them into one explanation to avoid repeating the same math.
2 Notation
Let a raw input string be (x ^*).
A tokenizer maps (x) to tokens (t_{1:m} V^m), where (V) is a discrete vocabulary.
An embedding matrix (E ^{|V| d}) maps tokens to vectors (h_t ^d).
For attention, with sequence length (n),
[ Q = HW_Q,K = HW_K,V = HW_V, ]
and a standard attention layer computes
[ (Q,K,V) = !( + M)V, ]
where (M) is a mask or bias matrix.
For retrieval, let the corpus be ( = {d_1,,d_N}) and the query be (q).
For optimization, () denotes model parameters, and the loss is minimized over data ((x,y)).
3 Tokenization and data
3.1 Tokenization
What it is. A tokenizer is a map (T : ^* V^m) from raw text to a sequence of discrete symbols. The language model does not predict characters or words directly; it predicts token IDs (t_i). In a causal LM, the training objective is [ {}() = -{i=1}^m p_(t_i t_{<i}). ] Tokenization therefore defines the actual prediction alphabet.
What problem it solves. It makes the softmax and embedding tables finite and reusable. Character-level vocabularies are tiny but sequences become long; word-level vocabularies explode and suffer OOV failures. Tokenization sits between these extremes.
What came before it. Earlier systems used word-level vocabularies or pure character-level modeling. Word-level models had brittle OOV behavior; character models had long horizons and poor efficiency.
Pros / cons. Subword tokenization gives a good compression/coverage trade-off. The downside is that the chosen segmentation becomes part of the model’s inductive bias: bad segmentation hurts arithmetic, copy behavior, multilingual robustness, and context efficiency.
3.2 BPE / BBPE
What it is. Byte Pair Encoding (BPE) greedily merges the most frequent adjacent pair in the training corpus. If (c(a,b)) is the count of adjacent pair ((a,b)), BPE repeatedly adds [ (a^*, b^*) = _{(a,b)} c(a,b) ] to the merge list. Byte-level BPE (BBPE) runs the same procedure over bytes ({0,,255}) rather than Unicode characters.
What problem it solves. BPE compresses frequent strings into reusable subwords, reducing sequence length while preserving open-vocabulary behavior. BBPE adds robustness to arbitrary scripts, corrupted text, and rare symbols, since every string can always be represented as bytes.
What came before it. Word-level tokenization and character-level tokenization were the main earlier choices. BPE improved the trade-off between vocabulary size and sequence length.
Pros / cons. BPE is simple, fast, and works well in practice. BBPE is especially robust across languages. The downside is greediness: merges optimize corpus frequency, not downstream likelihood directly, so domain jargon can still fragment badly.
3.3 WordPiece
What it is. WordPiece is a subword tokenizer that chooses merges using a likelihood-style score rather than raw pair frequency. A useful mental model is that it prefers merges that increase the corpus likelihood most, often approximated by a score like [ (a,b) . ]
What problem it solves. It tries to choose subwords that are statistically coherent, not just frequent. This can produce better segmentation for morphology and technical strings than naive pair counting.
What came before it. Character and word vocabularies came first; BPE is the closest precursor. WordPiece is a more likelihood-aware variant of the same subword idea.
Pros / cons. It often gives cleaner segmentations than plain BPE. The downside is that training and implementation are a bit less transparent than greedy BPE, and real-world gains are task- and corpus-dependent rather than universal.
3.4 Unigram LM
What it is. Unigram LM starts from a large candidate vocabulary (V) and assigns each token (v V) a probability (p(v)). For each string (x), there are many segmentations (z (x)), and the objective is [ p(x) = {z (x)} {u z} p(u). ] Training alternates between estimating token probabilities and pruning tokens that contribute little to likelihood.
What problem it solves. Instead of greedily building merges, it directly models segmentation as latent-variable inference. This makes it flexible and often more stable for multilingual or morphologically rich corpora.
What came before it. BPE and WordPiece both build a vocabulary incrementally; Unigram LM starts large and prunes.
Pros / cons. It is probabilistically cleaner and often produces good segmentations. The downside is higher training complexity and less intuitive behavior than merge-based tokenizers.
3.5 Vocabulary, OOV, and fragmentation rate
What it is. The vocabulary is the discrete set (V) of token IDs. OOV behavior is best understood not as “unrepresentable” in a strict byte-level system, but as poor representation efficiency: a string (s) breaks into too many sub-tokens. A simple fragmentation measure is [ (s) = . ]
What problem it solves. This perspective diagnoses why some domains feel “hard” for a model even before training quality is considered: legal IDs, code symbols, chemistry strings, or biomedical terms may fragment heavily and waste context.
What came before it. Older word-level systems had hard OOV failures: unseen words mapped to an
UNKtoken. Modern subword systems replaced hard OOV with soft fragmentation.Pros / cons. Large vocabularies reduce fragmentation but increase embedding and softmax cost. Small vocabularies improve sharing and robustness but lengthen sequences and hurt exact-match behavior.
3.6 CBOW
What it is. Continuous Bag of Words (CBOW) is a word2vec objective that predicts a target token from surrounding context. If the context window around position (t) is (C_t), CBOW uses the average context embedding [ {h}t = {j C_t} e(w_j) ] and predicts [ p(w_t C_t) = (U {h}_t). ]
What problem it solves. It learns dense lexical embeddings cheaply from local distributional statistics.
What came before it. Before neural embeddings, NLP relied on sparse count vectors such as bag-of-words or tf-idf. CBOW replaced sparse co-occurrence counts with dense learned vectors.
Pros / cons. It is simple and fast, and historically important. The downside is that it ignores word order inside the context window and does not model long-range compositional reasoning like modern Transformers.
3.7 Deduplication
What it is. Deduplication removes exact or near-duplicate examples from a corpus. Formally, given examples (x_i), one removes pairs with high similarity, [ (x_i, x_j) , ] where similarity may be exact string match, MinHash/Jaccard, or embedding-based.
What problem it solves. Duplicates distort the empirical data distribution, increase memorization, and waste compute by repeatedly showing the model almost the same gradient signal.
What came before it. Raw web-scale crawls often trained directly on noisy repeated data.
Pros / cons. Dedup improves data diversity and can reduce memorization. The downside is that aggressive dedup may remove useful repeated structure such as code templates, formulas, or important canonical texts.
3.8 Contamination
What it is. Contamination is train–test leakage. If an evaluation item (e) or a near-duplicate of it appears in training, then the estimate of generalization is biased upward. In statistical terms, evaluation is no longer on an independent sample from the true test distribution.
What problem it solves. The concept exists to diagnose fake benchmark gains. A model that has already seen the benchmark is not demonstrating genuine out-of-sample performance.
What came before it. Classical ML already worried about train/test leakage, but web-scale LLM training made the problem much larger because public benchmarks are often present in crawled corpora.
Pros / cons. Contamination checks increase trustworthiness. The downside is that exact detection is hard at scale, especially for paraphrases, translated variants, or benchmark-derived training data.
3.9 Packing
What it is. Packing concatenates multiple shorter sequences into one longer training sequence so that the context window is filled efficiently. If the context length is (L), the goal is to partition short examples into bins of total length at most (L), minimizing padding waste.
What problem it solves. Without packing, many training tokens are padding, which wastes GPU throughput and reduces tokens-per-second.
What came before it. Naive batching with one example per context window and large padding overhead.
Pros / cons. Packing improves hardware utilization substantially. The downside is implementation complexity: boundaries between packed examples must be masked correctly so the model does not attend across unrelated samples.
4 Core architecture
4.1 Encoder-only, decoder-only, encoder-decoder, and PrefixLM
What it is.
- Encoder-only: bidirectional self-attention over (x), useful for representation learning.
- Decoder-only: causal self-attention with [ p(x) = {t=1}^n p(x_t x{<t}). ]
- Encoder-decoder: conditional generation [ p(y x) = {t=1}^m p(y_t y{<t}, x). ]
- PrefixLM: a hybrid mask where a prefix attends bidirectionally, but continuation tokens decode causally.
What problem it solves. Different tasks require different information flow. Representation tasks want bidirectionality; generation tasks want causal structure; conditional generation wants both an encoded source and an autoregressive target.
What came before it. RNN encoders/decoders and sequence-to-sequence attention models came before Transformer variants. PrefixLM is a later compromise between encoder-style context access and decoder-style generation.
Pros / cons. Decoder-only models unify many tasks under one objective and dominate LLMs. Encoder-decoder models remain strong for translation/summarization. PrefixLM is elegant but less standardized than pure decoder-only training.
4.2 RNN
What it is. A recurrent neural network updates a hidden state recursively: [ h_t = f_(h_{t-1}, x_t), y_t = g_(h_t). ] LSTMs and GRUs modify (f_) with gating to mitigate vanishing gradients.
What problem it solves. It models sequences with parameter sharing across time and was the dominant pre-Transformer architecture for speech, translation, and language modeling.
What came before it. Simpler Markov models and n-gram language models represented only finite context and relied on count smoothing rather than learned hidden state.
Pros / cons. RNNs are sequentially cheap in memory and conceptually natural. Their main drawback is poor parallelization over time and difficulty modeling very long dependencies compared with self-attention.
4.3 Attention mask / causal attention
What it is. In attention, [ A = !( + M), ] the mask (M) controls which positions may interact. In causal attention, [ M_{ij} =
\[\begin{cases} 0 & j \le i,\\ -\infty & j > i. \end{cases}\]] This forbids looking into the future.
What problem it solves. It enforces the statistical factorization required for next-token prediction and prevents training-time information leakage.
What came before it. RNNs enforced causality by architecture. In Transformers, the mask makes causality explicit.
Pros / cons. Masks are flexible and make many architectures possible. The downside is that the quadratic attention matrix still exists conceptually, so long contexts remain expensive unless further tricks are added.
4.4 CLM
What it is. Causal Language Modeling optimizes [ {} = -{t=1}^{n} p_(x_t x_{<t}). ] It is the canonical decoder-only pretraining objective.
What problem it solves. It trains one model to do open-ended generation, prompting, in-context learning, and chat-style continuation under a single autoregressive rule.
What came before it. N-gram LMs and RNN LMs also used next-token prediction. The difference is scale and the Transformer architecture.
Pros / cons. CLM is simple and general. Its downside is that it only conditions left-to-right during training, so tasks needing full bidirectional evidence sometimes benefit from encoder objectives.
4.5 MLM
What it is. Masked Language Modeling hides a subset (M) of positions and predicts them using both left and right context: [ {} = - {i M} p_(x_i x_{M}). ]
What problem it solves. It learns bidirectional contextual representations well, which is useful for classification, retrieval, tagging, and general text understanding.
What came before it. Earlier representation learning used static embeddings or recurrent encoders. MLM gave Transformers a powerful self-supervised encoder objective.
Pros / cons. MLM is excellent for encoders. The downside is a train–test mismatch for generation: the model is not trained as a direct left-to-right generator.
4.6 S2S
What it is. Sequence-to-sequence learning models conditional generation: [ p_(y x) = {t=1}^{m} p(y_t y_{<t}, x). ] The encoder reads (x), and the decoder generates (y) while attending to encoder states.
What problem it solves. It separates source understanding from target generation, which is ideal for translation, summarization, paraphrase, and many supervised transduction tasks.
What came before it. RNN encoder–decoder models with attention were the immediate precursor. Transformer seq2seq replaced recurrence with self-attention and cross-attention.
Pros / cons. S2S is structurally clean for conditional generation. The downside is that serving two stacks (encoder and decoder) can be less operationally uniform than one decoder-only model.
4.7 EOS
What it is. The End-of-Sequence token is a special symbol (t_{} V) such that generation stops when the model emits it. Formally, the generated length (T) is the first index with (t_T = t_{}).
What problem it solves. It gives the model a learned stopping condition instead of requiring fixed-length outputs.
What came before it. Earlier systems often used maximum length only, or external stop rules.
Pros / cons. EOS is simple and differentiable within the autoregressive framework. The downside is that models can under-generate or over-generate if EOS calibration is poor.
4.8 KV cache
What it is. During decode, each layer stores past keys and values so they do not need to be recomputed. For layer (), token (t), and (H_{kv}) KV heads, the cache stores tensors of size roughly [ 2 H_{kv} d_h t ] per layer, where the factor 2 accounts for keys and values.
What problem it solves. Without caching, each new token would recompute all previous K/V states, making autoregressive decoding prohibitively expensive.
What came before it. Naive recomputation of the full prefix at every step.
Pros / cons. KV caching is essential for fast decode. The downside is that it shifts serving from pure compute to a memory/bandwidth problem and heavily constrains concurrency.
4.9 MHA / MQA / GQA
What it is. Let (H_q) be the number of query heads and (H_{kv}) the number of KV heads.
- MHA: (H_{kv}=H_q).
- MQA: (H_{kv}=1).
- GQA: (1 < H_{kv} < H_q), with groups of query heads sharing KV heads. KV-cache memory scales linearly with (H_{kv}), not (H_q).
What problem it solves. MQA and GQA reduce KV-cache size and decode bandwidth while preserving most of the quality of full multi-head attention.
What came before it. Standard multi-head attention gave each head its own K and V. That was expressive but expensive for long-context inference.
Pros / cons. MQA is the most memory-efficient; GQA gives a better quality/efficiency compromise; MHA is the most expressive. The trade-off is always quality versus memory and bandwidth.
4.10 FFN / MLP
What it is. In a Transformer block, the feed-forward network acts independently on each token: [ (x) = W_2 ,(W_1 x + b_1) + b_2. ] Many papers use
MLPandFFNalmost interchangeably for this sub-layer.What problem it solves. Attention mixes information across positions; the FFN mixes information across features within each position, increasing representational capacity.
What came before it. Earlier sequence models also used per-step MLPs, but the Transformer standardized the “attention + position-wise FFN” decomposition.
Pros / cons. FFNs are simple, expressive, and highly optimized on hardware. The downside is high compute and memory cost, which is why MoE and gated variants became attractive.
4.11 ReLU / GELU / SwiGLU
What it is.
- ReLU: ((x)=(0,x)).
- GELU: ((x)=x(x)), where () is the standard normal CDF.
- SwiGLU: a gated FFN, often [ (x)= (xW_v)(xW_g), ] followed by an output projection.
What problem it solves. These define the nonlinearity inside FFNs. GELU smooths ReLU’s hard threshold; SwiGLU adds multiplicative gating, which often improves parameter efficiency.
What came before it. ReLU dominated older deep nets; GELU and gated units arrived later in Transformers.
Pros / cons. ReLU is cheap but less smooth. GELU is usually better behaved in Transformers. SwiGLU is strong in modern LLMs but costs extra parameters and compute.
4.12 LN / RMSNorm
What it is.
- LayerNorm: [ (x)= + . ]
- RMSNorm: [ (x)=. ] RMSNorm removes the mean-subtraction term.
What problem it solves. Normalization stabilizes optimization and keeps activation scales controlled through depth.
What came before it. BatchNorm works well in vision but is awkward for variable-length autoregressive sequence models. LayerNorm became the standard for Transformers.
Pros / cons. LayerNorm is robust and familiar. RMSNorm is simpler and often slightly cheaper, but it changes the inductive bias because it does not center activations.
4.13 RoPE
What it is. Rotary Position Embedding rotates each 2D pair of query/key coordinates by an angle depending on position (p): [ R_(p)=
\[\begin{bmatrix} \cos(\theta p) & -\sin(\theta p)\\ \sin(\theta p) & \cos(\theta p) \end{bmatrix}\]. ] Applying this blockwise to (Q) and (K) makes the attention score depend naturally on relative position differences.
What problem it solves. It injects positional information into attention without adding learned absolute position vectors, and it extrapolates better than many older positional embeddings.
What came before it. Sinusoidal absolute encodings and learned absolute embeddings.
Pros / cons. RoPE is elegant and widely effective. The downside is that very long-context extrapolation can degrade, which motivated scaling and interpolation tricks.
4.14 ALiBi
What it is. Attention Linear Bias adds a head-specific distance penalty directly to attention scores: [ B^{(h)}_{ij} = -m_h(i-j), j i. ] The farther a token is in the past, the larger the negative bias.
What problem it solves. It encourages recency while allowing the model to extrapolate to longer contexts than those seen during training.
What came before it. Absolute position embeddings and rotary encodings.
Pros / cons. ALiBi is simple and extrapolates well. The downside is that its bias is rigidly linear, which may be less expressive than richer positional schemes.
4.15 PI
What it is. Position interpolation / insertion methods rescale positions from a longer inference context into the range seen in training. A common idea is [ p’ = p , ] so that a test-time position (p) is “compressed” into the trained interval.
What problem it solves. RoPE-like models often fail when positions exceed the trained range. Interpolation extends usable context without full retraining.
What came before it. The baseline was simply applying the original positional scheme beyond training length and hoping it generalizes.
Pros / cons. It is much cheaper than retraining from scratch. The downside is that it is a hack around a mismatch, so quality usually degrades as extrapolation grows.
4.16 DCA
What it is. Dual Chunk Attention partitions a long sequence into chunks and combines at least two attention patterns: local within-chunk attention and a coarser cross-chunk interaction. If chunk size is (c) and sequence length is (n), the goal is to replace (O(n^2)) attention with something closer to (O(nc)) plus a smaller cross-chunk term.
What problem it solves. Full attention becomes too expensive for very long contexts. Chunked schemes preserve most relevant local interactions while still allowing some long-range communication.
What came before it. Full self-attention, sliding-window attention, and various sparse attention patterns.
Pros / cons. DCA can extend context length at lower cost. The downside is approximation: some token-to-token interactions are no longer modeled exactly.
4.17 MoE
What it is. A Mixture of Experts block has multiple expert FFNs (E_1,,E_M) and a router (g(x)). For top-(k) routing, [ y = _{e (g(x),k)} _e(x),E_e(x), ] where (_e(x)) are normalized routing weights.
What problem it solves. It increases model capacity without activating all parameters on every token. Total parameters can be huge while per-token compute stays relatively modest.
What came before it. Dense FFNs in every layer activated all parameters on every token.
Pros / cons. MoE gives strong scaling efficiency. The costs are routing instability, load balancing complexity, and more difficult distributed serving.
4.18 DeepNorm
What it is. DeepNorm is a depth-aware residual scaling scheme. The exact constants depend on architecture, but the principle is [ x_{+1} = !(L x+ F_(x_)), ] with (_L) chosen as a function of network depth (L) so gradients remain stable in very deep Transformers.
What problem it solves. Naively deep Transformers become numerically unstable; residual branches can explode or vanish.
What came before it. Standard post-norm and pre-norm Transformers, often with carefully tuned learning rates and warmups.
Pros / cons. It helps push depth much further. The downside is that it is less conceptually simple than standard normalization layouts and is mostly useful in very deep regimes.
4.19 ReZero
What it is. ReZero introduces a learnable scalar () on each residual branch: [ x{+1} = x_+ F(x_), _(0)=0. ] At initialization the network behaves nearly like the identity map.
What problem it solves. It stabilizes early optimization by preventing large residual perturbations before useful features are learned.
What came before it. Standard residual connections with fixed coefficient 1.
Pros / cons. It is minimal and elegant. The downside is that it is not the dominant recipe in modern LLMs, where pre-norm and RMSNorm-style designs are more common.
5 Training, adaptation, and alignment
5.1 Pretraining and CPT
What it is. Pretraining usually means minimizing CLM or MLM loss on a broad corpus. Continued pretraining (CPT, or mid-training) resumes the same objective on a more targeted distribution: [ {x _{}}. ]
What problem it solves. Pretraining builds general language competence. CPT shifts the model’s prior toward a domain, style, language mix, or context-length regime without needing full retraining.
What came before it. Domain adaptation used to rely more heavily on supervised finetuning or feature engineering after generic pretraining.
Pros / cons. CPT is powerful for domain knowledge and tokenizer/domain mismatch. The downside is distribution drift: if overdone, it can damage general capabilities or destabilize later finetuning.
5.2 SFT and masking
What it is. Supervised Fine-Tuning trains on instruction or dialogue pairs ((x,y)) using [ {} = - {t } p_(y_t x, y_{<t}), ] where () often includes only assistant tokens. This “loss masking” means user tokens are part of the context but do not receive gradient.
What problem it solves. It teaches response format, role behavior, tool-call schema usage, refusal style, and instruction following.
What came before it. Pure pretraining could continue text but was weak at following explicit instructions or chat conventions.
Pros / cons. SFT is cheap and high-leverage. The downside is that heavy SFT can overfit style or compress the model’s behavior too aggressively if the supervision distribution is narrow.
5.3 PEFT / LoRA
What it is. Parameter-efficient finetuning updates only a small subset of parameters. In LoRA, instead of changing a weight matrix (W), one learns a low-rank update [ W’ = W + BA, ] where (A ^{r d_{}}), (B ^{d_{} r}), and (r (d_{}, d_{})).
What problem it solves. Full finetuning is expensive in memory, storage, and deployment. LoRA gives task adaptation with a much smaller trainable footprint.
What came before it. Full-model finetuning, prefix tuning, and adapter layers.
Pros / cons. PEFT is cheap and operationally convenient. The downside is that low-rank updates may be insufficient for large distribution shifts, and many adapters can complicate deployment and regression testing.
5.4 DPO / ORPO
What it is. Direct Preference Optimization (DPO) learns from preference pairs ((y^+, y^-)) by increasing the relative log-probability of preferred outputs versus rejected ones, typically with a reference model: [ !(). ] ORPO is a related offline objective using an odds-ratio style preference term together with language-model loss.
What problem it solves. It aligns style and preferences without running online RL rollouts.
What came before it. RLHF with explicit reward models and PPO was the standard preference-alignment pipeline.
Pros / cons. DPO/ORPO are simpler and cheaper than full RLHF. The downside is that they still depend on the quality of preference data and cannot directly optimize long-horizon interactive behavior.
5.5 RLHF / PPO
What it is. Reinforcement Learning from Human Feedback learns a policy () that maximizes reward while staying near a reference policy: [ {y (|x)}[r_(x,y)] - ,((|x),|,{}(|x)). ] PPO optimizes a clipped surrogate objective using policy ratios [ _t()=. ]
What problem it solves. RLHF can push behavior beyond imitation, especially when reward depends on multi-step interaction, safety, or nuanced human preference.
What came before it. SFT and supervised preference classification without policy optimization.
Pros / cons. RLHF is powerful but expensive, unstable, and reward-sensitive. Poor reward models produce reward hacking rather than genuinely better behavior.
5.6 GRPO
What it is. Group Relative Policy Optimization is an RL-style update that normalizes rewards within a sampled group. If one samples (K) outputs with rewards (r_1,,r_K), a simple group-relative advantage is [ A_i = , ] and policy updates use (A_i) instead of a separately learned critic.
What problem it solves. It reduces the engineering burden of value-function learning and can work well when a verifier gives relative quality signals.
What came before it. PPO-style actor–critic RL with explicit value estimation.
Pros / cons. It is simpler than full actor–critic RL. The downside is that it still depends critically on reward design, and group-normalized rewards are not a cure for bad verifiers.
5.7 Verifier / RM / PRM / ORM
What it is.
- A reward model (RM) learns a scalar (r_(x,y)).
- An outcome reward model (ORM) scores final outputs only.
- A process reward model (PRM) scores intermediate reasoning steps (z_{1:T}), for example [ r_{}(x,z_{1:T}) = {t=1}^T r_t(x,z{t}). ]
- A verifier is the broader class: unit tests, rule checks, symbolic solvers, or learned judges.
What problem it solves. Generation quality often cannot be read off log-probability alone. Verifiers give an external correctness or preference signal.
What came before it. Pure next-token likelihood and human inspection.
Pros / cons. Verifiers are central to reliability and test-time scaling. The downside is that learned judges can be noisy or gameable, while deterministic verifiers usually cover only narrow domains.
5.8 CoT
What it is. Chain-of-Thought can be formalized as introducing a latent reasoning trace (z): [ p(y x) = _z p(y z, x),p(z x). ] In practice we sample or generate an explicit (z) in text and then condition the answer on it.
What problem it solves. It helps the model decompose complex tasks into intermediate steps rather than mapping input to output in one jump.
What came before it. Direct answer-only prompting or single-step classification/generation.
Pros / cons. CoT can improve reasoning and make errors easier to inspect. The downside is verbosity, possible hallucinated reasoning, and the fact that visible reasoning text is not guaranteed to reflect the model’s true internal computation.
5.9 Test-time scaling
What it is. Test-time scaling spends extra inference compute to improve answers without changing weights. A generic form is [ y_1,,y_K _(|x), y = _i s(x,y_i), ] where (s) may be a verifier score, unit test outcome, or reranker.
What problem it solves. It improves reliability when training is fixed and retraining is expensive.
What came before it. Single-sample greedy or temperature decoding.
Pros / cons. It often gives fast gains. The downside is higher inference cost and dependence on a trustworthy scoring function.
5.10 MTP
What it is. Multi-Token Prediction trains the model to predict several future tokens per position: [ {} = - {t=1}^{n}{k=1}^{K}k p(x{t+k}x_{t}). ]
What problem it solves. Standard CLM predicts only one next token per hidden state. MTP encourages representations that carry information about several future steps and can support faster decoding designs.
What came before it. Ordinary next-token CLM with (K=1).
Pros / cons. It may improve efficiency or richer predictive structure. The downside is objective mismatch with standard autoregressive generation and added training complexity.
5.11 PPL
What it is. Perplexity on tokens (t_{1:n}) is [ = !(-{i=1}^{n}p(t_i t_{<i})). ]
What problem it solves. It gives a simple intrinsic metric for next-token prediction quality.
What came before it. Cross-entropy is the underlying quantity; perplexity is just its exponential, making the scale more interpretable.
Pros / cons. PPL is easy to compute and compare within a fixed tokenizer/eval setup. The downside is that lower perplexity does not always translate to better instruction following, reasoning, or tool use.
6 Systems and serving
6.1 FLOPs
What it is. FLOPs counts floating-point operations. For matrix multiplication (A_{mk}B_{kn}), a rough count is [ 2mkn. ]
What problem it solves. It estimates compute cost and helps reason about scaling laws and hardware requirements.
What came before it. Wall-clock timing alone, without separating algorithmic cost from hardware effects.
Pros / cons. FLOPs are useful for first-order comparison. The downside is that actual latency can be dominated by memory bandwidth, kernel launch overhead, or communication rather than arithmetic count alone.
6.2 GEMM / GEMV / HBM
What it is.
- GEMM: matrix–matrix multiply, typically high arithmetic intensity.
- GEMV: matrix–vector multiply, lower arithmetic intensity.
- HBM: high-bandwidth GPU memory, often the bottleneck when moving KV cache or weights.
What problem it solves. These concepts explain why prefill and decode behave differently on hardware. Prefill is often dominated by large GEMMs; decode at small batch sizes often looks more like repeated GEMV-style or memory-bound operations.
What came before it. Cruder descriptions such as “the model is slow.”
Pros / cons. This decomposition is excellent for diagnosis. The downside is that real kernels are fused and more complex than textbook GEMM/GEMV categories, so the mental model is approximate.
6.3 DDP / TP
What it is.
- Distributed Data Parallel (DDP): each worker holds a full copy of parameters and computes gradients on different mini-batches, then all-reduces: [ g = _{i=1}^{N} g_i. ]
- Tensor Parallelism (TP): split large tensors across devices, e.g. (W=[W_1,,W_P]), and combine partial results during forward/backward passes.
What problem it solves. DDP scales batch throughput; TP lets a model fit or execute when single-device memory is insufficient.
What came before it. Single-GPU training/inference, and earlier model-parallel schemes with coarser sharding.
Pros / cons. DDP is conceptually simple but requires each replica to fit in memory. TP fits larger models but adds communication overhead and can hurt latency.
6.4 Prefill vs decode
What it is. Prefill processes the input prompt in parallel. Decode then generates output one token at a time. Roughly, prefill computes attention over all prompt tokens, while decode reuses the KV cache and appends one new step.
What problem it solves. This distinction explains why different optimizations matter at different times: prefill often wants larger efficient matrix operations; decode wants low-latency memory movement and scheduling.
What came before it. Treating inference latency as one undifferentiated number.
Pros / cons. Separating the two gives much better performance diagnosis. The downside is that some optimizations improve one phase while hurting the other, so global tuning becomes a multi-objective problem.
6.5 TTFT / TPOT / ITL
What it is.
- TTFT: time to first token.
- TPOT: time per output token.
- ITL: inter-token latency. A useful decomposition is [ + + . ]
What problem it solves. These are user-facing latency metrics. TTFT measures responsiveness; TPOT/ITL measure streaming smoothness.
What came before it. Aggregate end-to-end latency without separating prompt processing from generation speed.
Pros / cons. They are operationally meaningful. The downside is that optimizing one can worsen another if scheduling is not carefully designed.
6.6 FlashAttention
What it is. FlashAttention computes exact softmax attention [ (QK^)V ] in tiles, keeping partial softmax statistics on-chip rather than materializing the full (QK^) matrix in HBM.
What problem it solves. Standard attention moves too much data to and from GPU memory. FlashAttention reduces memory traffic while preserving exact results.
What came before it. Naive attention kernels that explicitly wrote large intermediate matrices to HBM.
Pros / cons. It gives major speed and memory wins, especially in prefill and long-context workloads. The downside is implementation complexity and the fact that decode can still be bottlenecked elsewhere.
6.7 PagedAttention
What it is. PagedAttention stores KV cache in fixed-size blocks and uses a page table to map logical token positions to physical blocks in memory. It is the inference analogue of virtual memory ideas.
What problem it solves. Variable-length requests fragment KV-cache memory badly. Paging allows reuse, compaction, and preemption.
What came before it. Monolithic contiguous KV allocations per request.
Pros / cons. It is a major enabler for high-concurrency serving. The downside is more complex memory management and scheduler design.
6.8 Continuous batching
What it is. Instead of waiting to form static batches, the server dynamically inserts requests into an active batch whenever slots free up. Formally, the active set (B_t) changes over decode time.
What problem it solves. Static batching wastes slots and increases queueing delay under real traffic.
What came before it. Fixed batches formed at batch boundaries.
Pros / cons. It usually improves throughput and tail latency. The downside is more complicated scheduling, fairness, and memory accounting.
6.9 Chunked prefill
What it is. A long prompt of length (n) is split into chunks of size (c), and prefill proceeds chunk by chunk instead of forcing one giant prefill step.
What problem it solves. Very long prompts can block shorter interactive requests and create head-of-line latency spikes.
What came before it. Full-prompt prefill as a single monolithic step.
Pros / cons. It improves fairness and p99 latency. The downside is scheduler complexity and possible slight throughput loss if chunking is too aggressive.
6.10 Prefix caching
What it is. If many requests share the same prefix (p) (system prompt, template, code context), cache the prefix KV once and reuse it for all requests with that exact prefix.
What problem it solves. Recomputing the same prefill over and over wastes compute and inflates TTFT.
What came before it. Full recomputation of shared prompts on every request.
Pros / cons. It can dramatically reduce TTFT in templated workloads. The downside is cache invalidation: the cache key must include model version, adapter, tokenizer, and prompt template.
6.11 Speculative decoding
What it is. A fast draft model proposes several tokens, and a slower target model verifies them. If the target agrees with the draft prefix, multiple tokens are accepted in one expensive step.
What problem it solves. It reduces the number of slow target-model decode steps.
What came before it. Pure target-model autoregressive decoding one token at a time.
Pros / cons. It can produce real speedups when draft acceptance is high. The downside is that gains vanish if the draft model is weak, if batching is unfavorable, or if verification overhead dominates.
6.12 Guided decoding
What it is. At step (t), instead of allowing all vocabulary tokens (V), guided decoding restricts sampling to a valid set (V_t^{} V) defined by a grammar, parser state, or JSON schema: [ p’(v) p(v),[v V_t^{}]. ]
What problem it solves. It enforces syntactic validity for structured outputs, especially tool calls and JSON.
What came before it. Free-form text generation followed by post-hoc parsing and repair.
Pros / cons. It greatly improves validity. The downside is that it does not solve semantic correctness: a valid JSON object can still contain the wrong tool or wrong arguments.
6.13 Admission control / SRPT
What it is.
- Admission control: limit active requests so memory and latency remain stable.
- SRPT (Shortest Remaining Processing Time): prioritize requests with the least remaining work; in queueing theory this minimizes mean completion time under ideal assumptions.
What problem it solves. Without request gating, decode workloads can oversubscribe KV memory and collapse tail latency.
What came before it. FIFO queues and uncontrolled concurrency.
Pros / cons. These tools stabilize production systems. The downside is fairness: aggressively favoring short jobs can starve long requests unless the scheduler is modified.
6.14 Quantization / weight-only quant / KV quant / pruning / sparsity / distillation
What it is.
- Quantization: replace high-precision values by low-bit approximations, e.g. [ w = s(q-z), q {0,,2^b-1}. ]
- Weight-only quantization: quantize weights but keep activations relatively high precision.
- KV quantization: quantize the cache.
- Pruning / sparsity: apply a mask (m) so (W’ = m W), often with (|m|_0) constrained.
- Distillation: train a smaller student on teacher outputs, often via KL: [ {} = (p{} ,|, p_{}). ]
What problem it solves. These are the main compression tools for reducing memory, latency, or serving cost.
What came before it. Full-precision dense models with no teacher compression.
Pros / cons. Compression improves deployability. The downside is quality loss, calibration drift, and sometimes hardware-specific implementation burden.
6.15 AWQ
What it is. Activation-aware Weight Quantization chooses weight scales using activation statistics, aiming to minimize the output error actually seen by the network, not just raw weight error. A good mental objective is [ _{W} |A(W-W)|_F^2, ] where (A) represents representative activations.
What problem it solves. Naive low-bit quantization can destroy important channels, especially those amplified by activation outliers.
What came before it. Simpler post-training quantization that optimized weight reconstruction more uniformly, with less attention to activation sensitivity.
Pros / cons. AWQ often preserves quality better at low bits. The downside is calibration complexity and dependence on representative activation samples.
7 Retrieval and agents
7.1 BM25
What it is. BM25 is a sparse retrieval score: [ (q,d)=_{w q} (w) . ] Here (f(w,d)) is term frequency, (|d|) is document length, and () is average length.
What problem it solves. It ranks documents by exact-term relevance while correcting for document length.
What came before it. tf-idf and earlier probabilistic retrieval formulas.
Pros / cons. BM25 is extremely strong for IDs, names, codes, and exact keywords. The downside is weak semantic matching when relevant documents use different wording.
7.2 MRR / nDCG
What it is.
- MRR: [ = _{i=1}^N , ] where (_i) is the first relevant result for query (i).
- DCG: [ = _{j=1}^{k} , ] and ( = /).
What problem it solves. These metrics evaluate retrieval ranking quality, not just binary recall.
What came before it. Precision/recall alone, without considering how early relevant items appear.
Pros / cons. MRR is simple and emphasizes the first hit. nDCG handles graded relevance well. The downside is that they require good relevance labels and can miss downstream answer quality effects.
7.3 RAG
What it is. Retrieval-Augmented Generation conditions the model on retrieved documents: [ p(y x, ) p_(y x, d_{1:k}), ] where (d_{1:k}) are the top retrieved items. A more probabilistic view is [ p(y x) = _{d} p(d x), p(y x,d), ] approximated by top-(k) retrieval.
What problem it solves. It injects external knowledge at inference time without changing model weights.
What came before it. Pure parametric generation from the model’s internal weights, or fully symbolic retrieval pipelines without neural generation.
Pros / cons. RAG improves freshness, grounding, and domain adaptation. The downside is that retrieval errors, stale indices, or prompt injection can still corrupt the answer.
7.4 Hybrid retrieval
What it is. Hybrid retrieval combines sparse and dense retrieval signals, for example [ s(q,d) = s_{}(q,d) + (1-)s_{}(q,d). ]
What problem it solves. Dense retrieval handles semantic similarity; sparse retrieval handles exact term matching. Hybrid retrieval captures both.
What came before it. Pure BM25 or pure embedding retrieval.
Pros / cons. It is usually the strongest default. The downside is more tuning, more infrastructure, and the need to calibrate heterogeneous scores.
7.5 Reranking / cross-encoder
What it is. A reranker rescoring function (s_(q,d)) is applied to a small candidate set. A cross-encoder reads query and document jointly: [ s_(q,d)=f_([q;d]). ]
What problem it solves. Initial retrieval retrieves broad candidates efficiently but imperfectly. A stronger second-stage model can reorder them more precisely.
What came before it. Single-stage retrieval only.
Pros / cons. Cross-encoders often produce the biggest relevance jump. The downside is latency and cost: joint scoring is much slower than approximate nearest-neighbor retrieval.
7.6 Agent
What it is. An agent is a policy operating in a loop: [ a_t (s_t), o_t (a_t), s{t+1}=u(s_t,a_t,o_t), ] where (s_t) is state, (a_t) is an action or tool call, (o_t) is an observation, and (u) updates state.
What problem it solves. Some tasks require iterative decision-making under partial information rather than one-shot generation.
What came before it. Single-prompt inference and deterministic workflows.
Pros / cons. Agents are flexible and can use tools adaptively. The downside is variance, cost, and the need for explicit budgets, state, and termination criteria.
7.7 Workflow vs agent
What it is. A workflow is a fixed directed graph of steps (G=(V,E)). An agent instead chooses the next step dynamically via a policy ((a_t|s_t)).
What problem it solves. This distinction clarifies when autonomy is useful. If the step graph is known in advance, an agent may add variance without value.
What came before it. Traditional software workflows, business process engines, and deterministic orchestration systems.
Pros / cons. Workflows are testable and stable; agents are adaptive. The cost of agentic flexibility is harder debugging and a larger safety surface.
7.8 ReAct
What it is. ReAct interleaves reasoning traces (r_t) with actions (a_t): [ (r_t,a_t) (s_t), o_t (a_t), s{t+1}=u(s_t,r_t,a_t,o_t). ]
What problem it solves. It prevents the model from doing all “thinking” before seeing external evidence; instead it reasons, acts, observes, and revises.
What came before it. Pure Chain-of-Thought without tool interaction, or tool calls without explicit intermediate reasoning.
Pros / cons. ReAct is natural for search, research, and debugging tasks. The downside is that the model can overthink, overact, or get stuck in loops unless budgets and validators exist.
7.9 Tool calling
What it is. Tool calling turns part of generation into structured function invocation. Formally, a tool is a typed map [ f : , ] where () is a structured error space. The model predicts a tool name and arguments; the runtime executes and returns observations.
What problem it solves. LLM parameters are not a trustworthy source of live external state. Tool calling lets the model read or modify the real world through APIs.
What came before it. Pure text generation and fragile regex-based function extraction.
Pros / cons. Tool calling is the foundation of production agents. The downside is that the interface contract and runtime validation matter as much as the model itself.
7.10 Tool contract
What it is. A tool contract is a formal schema for inputs, outputs, side effects, timeouts, and errors. One can view it as a pair of schemas ((S_{}, S_{})) plus semantics on retries and failures.
What problem it solves. Without a contract, the model guesses argument types and the runtime has no principled way to validate or retry.
What came before it. Informal tool descriptions embedded in prompts, often without strict runtime validation.
Pros / cons. Strong contracts improve reliability, observability, and safe retries. The downside is design effort: tool interfaces must be deliberately engineered, not improvised.
7.11 MCP
What it is. Model Context Protocol is a standard integration layer for exposing tools and resources to agent runtimes with consistent schemas and discovery mechanisms. Abstractly, it standardizes part of the map [ {, , }. ]
What problem it solves. It reduces one-off integration work and makes tool/resource discovery more uniform across runtimes.
What came before it. Ad hoc tool adapters and provider-specific interfaces.
Pros / cons. MCP improves interoperability. The downside is conceptual confusion: it is an integration standard, not a safety boundary by itself.
7.12 ACI
What it is. Agent–Computer Interface is the formal boundary between the agent and the world: observation space (), action space (), schemas, affordances, and UX. In control terms, it is the interface through which the policy touches the environment.
What problem it solves. It gives a systems lens for agent design: poor action/observation affordances create failure even if the model is strong.
What came before it. Less explicit notions of “tools” or “UI automation” without a unifying interface concept.
Pros / cons. ACI helps reason about capability and failure modes precisely. The downside is that it is a design abstraction, not a single algorithm one can simply plug in.
7.13 Allowlist
What it is. An allowlist is an application-enforced subset (A_{}(u) ) of actions available to a user, tenant, or run context.
What problem it solves. It constrains blast radius. The model may request any action, but the runtime executes only permitted ones.
What came before it. Relying on prompt instructions like “do not call this tool.”
Pros / cons. Runtime-enforced allowlists are simple and effective. The downside is maintenance burden and the risk of operational drift if permissions are not versioned and audited.
7.14 Idempotency
What it is. A write action with request ID (r) is idempotent if repeating it has the same effect as doing it once: [ f(f(s,r),r) = f(s,r). ]
What problem it solves. Retries are unavoidable under network failure and timeout. Without idempotency, retries can duplicate side effects such as refunds, emails, or ticket creation.
What came before it. Blind retries over non-idempotent endpoints.
Pros / cons. Idempotency makes retry logic safe. The downside is implementation complexity: the server must store and honor request IDs or deduplicate effects correctly.
7.15 STM / LTM
What it is.
- Short-term memory (STM): task-local state (m_t^{}).
- Long-term memory (LTM): persistent store (M^{}), usually queried by retrieval [ R(q, M^{}). ] Together, agent state may be written as (s_t = (m_t^{}, M^{})).
What problem it solves. It separates immediate working state from cross-session knowledge or preferences.
What came before it. Dumping all history into the prompt every time.
Pros / cons. This separation controls context growth and supports persistence. The downside is memory quality: stale or incorrectly retrieved memories can become a new source of error.
7.16 Prompt injection
What it is. Prompt injection is an adversarial string () inserted into untrusted context (documents, webpages, emails, tool outputs) so that the conditional model behavior shifts toward forbidden actions or policy violations: [ ^* = _{} ( x, ). ]
What problem it solves. The concept names a major real-world security failure mode in agentic systems: retrieved text is not just “data”; it can act like hostile control input.
What came before it. Jailbreaks through direct user prompts were already known, but RAG and tool use made indirect attacks much more important.
Pros / cons. There is no upside to prompt injection itself; the useful takeaway is defensive design. The main limitation is that mitigation is layered rather than absolute: isolate untrusted text, separate instructions from sources, enforce permissions in the runtime, and validate actions externally.
8 Coverage map
The table below maps source terms from the original taxonomy to the section that explains them here.
| Source term | Covered in this appendix |
|---|---|
| ACI | ACI |
| ALiBi | ALiBi |
| AWQ | AWQ |
| BBPE | BPE / BBPE |
| BM25 | BM25 |
| CBOW | CBOW |
| CLM | CLM |
| CoT | CoT |
| CPT | Pretraining and CPT |
| DCA | DCA |
| DDP | DDP / TP |
| DeepNorm | DeepNorm |
| DPO | DPO / ORPO |
| EOS | EOS |
| FFN | FFN / MLP |
| FLOPs | FLOPs |
| GEMM | GEMM / GEMV / HBM |
| GEMV | GEMM / GEMV / HBM |
| GELU | ReLU / GELU / SwiGLU |
| GQA | MHA / MQA / GQA |
| GRPO | GRPO |
| HBM | GEMM / GEMV / HBM |
| ITL | TTFT / TPOT / ITL |
| KV | KV cache |
| LN | LN / RMSNorm |
| LoRA | PEFT / LoRA |
| LTM | STM / LTM |
| MCP | MCP |
| MLM | MLM |
| MHA | MHA / MQA / GQA |
| MLP | FFN / MLP |
| MoE | MoE |
| MQA | MHA / MQA / GQA |
| MRR | MRR / nDCG |
| MTP | MTP |
| nDCG | MRR / nDCG |
| OOV | Vocabulary, OOV, and fragmentation rate |
| ORPO | DPO / ORPO |
| PEFT | PEFT / LoRA |
| PI | PI |
| PPL | PPL |
| PPO | RLHF / PPO |
| PRM | Verifier / RM / PRM / ORM |
| RAG | RAG |
| ReAct | ReAct |
| ReLU | ReLU / GELU / SwiGLU |
| ReZero | ReZero |
| RNN | RNN |
| RoPE | RoPE |
| RMSNorm | LN / RMSNorm |
| RLHF | RLHF / PPO |
| S2S | S2S |
| SFT | SFT and masking |
| SRPT | Admission control / SRPT |
| STM | STM / LTM |
| SwiGLU | ReLU / GELU / SwiGLU |
| TTFT | TTFT / TPOT / ITL |
| TP | DDP / TP |
| TPOT | TTFT / TPOT / ITL |
| Vocab | Vocabulary, OOV, and fragmentation rate |
| Tokenization | Tokenization |
| BPE / BBPE | BPE / BBPE |
| WordPiece / Unigram LM | WordPiece; Unigram LM |
| Fragmentation rate | Vocabulary, OOV, and fragmentation rate |
| Deduplication | Deduplication |
| Contamination | Contamination |
| Packing | Packing |
| Attention mask / causal attention | Attention mask / causal attention |
| KV cache | KV cache |
| MHA vs MQA vs GQA | MHA / MQA / GQA |
| Pretraining (CLM) | CLM; Pretraining and CPT |
| CPT (mid-training) | Pretraining and CPT |
| Masking (SFT) | SFT and masking |
| PEFT / LoRA | PEFT / LoRA |
| DPO / ORPO | DPO / ORPO |
| RLHF / PPO | RLHF / PPO |
| Verifier / RM | Verifier / RM / PRM / ORM |
| PRM vs ORM | Verifier / RM / PRM / ORM |
| Test-time scaling | Test-time scaling |
| Prefill vs decode | Prefill vs decode |
| TTFT / TPOT / ITL | TTFT / TPOT / ITL |
| FlashAttention | FlashAttention |
| PagedAttention | PagedAttention |
| Continuous batching | Continuous batching |
| Chunked prefill | Chunked prefill |
| Prefix caching | Prefix caching |
| Speculative decoding | Speculative decoding |
| Guided decoding | Guided decoding |
| Admission control | Admission control / SRPT |
| Agent | Agent |
| Workflow vs agent | Workflow vs agent |
| Tool calling | Tool calling |
| Tool contract | Tool contract |
| Allowlist | Allowlist |
| Idempotency | Idempotency |
| Hybrid retrieval | Hybrid retrieval |
| Reranking / cross-encoder | Reranking / cross-encoder |
| Prompt injection | Prompt injection |
| encoder-only | Encoder-only, decoder-only, encoder-decoder, and PrefixLM |
| decoder-only | Encoder-only, decoder-only, encoder-decoder, and PrefixLM |
| encoder-decoder | Encoder-only, decoder-only, encoder-decoder, and PrefixLM |
| PrefixLM | Encoder-only, decoder-only, encoder-decoder, and PrefixLM |
| quantization | Quantization / weight-only quant / KV quant / pruning / sparsity / distillation |
| weight-only quant | Quantization / weight-only quant / KV quant / pruning / sparsity / distillation |
| KV quant | Quantization / weight-only quant / KV quant / pruning / sparsity / distillation |
| pruning / sparsity | Quantization / weight-only quant / KV quant / pruning / sparsity / distillation |
| distillation | Quantization / weight-only quant / KV quant / pruning / sparsity / distillation |