Appendix 1. Taxonomy

Overview

This appendix collects technical terms and acronyms used across the handbook.

The intent is not to be exhaustive, but to provide quick “interview-grade” refreshers.

Acronyms

Acronym Expansion Meaning (one line)
ACI Agent–Computer Interface The boundary between an agent and the systems it can observe/act on (tools, permissions, schemas, UX).
ALiBi Attention Linear Bias Adds a linear distance bias in attention to encourage recency and enable long-context extrapolation.
AWQ Activation-aware Weight Quantization Weight quantization that accounts for activation outliers to reduce quality loss.
BBPE Byte-level BPE BPE performed over bytes (not unicode chars), improving multilingual robustness and coverage.
BM25 Best Matching 25 Classic sparse retrieval ranking function; strong for exact terms (IDs, names) and complements embeddings.
CBOW Continuous Bag of Words word2vec variant that predicts a target token from its surrounding context tokens.
CLM Causal Language Modeling Next-token prediction objective with a causal (masked) attention pattern.
CoT Chain-of-Thought A reasoning-style response pattern; often used for test-time scaling and verification pipelines.
CPT Continued Pretraining Mid-training by continuing next-token training on a target distribution (domain/context priors).
DCA Dual Chunk Attention Long-context attention scheme using within-chunk + cross-chunk attention.
DDP Distributed Data Parallel Training strategy that replicates model weights and splits batches across workers.
DeepNorm Deep residual scaling Residual scaling scheme to stabilize very deep Transformers.
DPO Direct Preference Optimization Offline alignment on preference pairs (chosen vs rejected) without online RL rollouts.
EOS End of Sequence Special token signaling sequence termination.
FFN Feed-Forward Network Per-token MLP block in Transformers (often the main compute cost).
FLOPs Floating Point Operations Used to estimate compute cost; often paired with the roofline model.
GEMM General Matrix Multiply Dominant kernel type during prefill (compute-heavy).
GEMV Matrix-Vector Multiply Common shape during decode at small batch sizes (often bandwidth/overhead heavy).
GELU Gaussian Error Linear Unit Smooth activation widely used in Transformers.
GQA Grouped-Query Attention Shares K/V across groups of query heads to reduce KV cache size with limited quality loss.
GRPO Group Relative Policy Optimization RL-style update using within-group reward normalization (often avoids a learned critic).
HBM High Bandwidth Memory GPU memory; decode often becomes HBM-bandwidth bound.
ITL Inter-Token Latency Latency per generated token; effectively decode speed.
KV Key/Value Attention cache tensors stored to avoid recomputing past states during decode.
LN Layer Normalization Normalization applied per token over features to stabilize training.
LoRA Low-Rank Adaptation PEFT method adding low-rank adapter matrices to update behavior with small trainable deltas.
LTM Long-Term Memory Persistent storage for agent systems (often implemented with retrieval).
MCP Model Context Protocol Standard protocol for advertising tools/resources to agent runtimes with consistent schemas (integration layer).
MLM Masked Language Modeling Objective predicting masked tokens using both left and right context (encoder-style).
MHA Multi-Head Attention Attention with separate Q/K/V projections per head (higher KV cost).
MLP Multi-Layer Perceptron Another name for FFN blocks in Transformers.
MoE Mixture of Experts Routing tokens to a subset of expert FFNs for higher capacity at lower activated compute.
MQA Multi-Query Attention Shares K/V across all query heads (minimal KV cache; sometimes quality hit).
MRR Mean Reciprocal Rank Retrieval metric emphasizing ranking the first relevant result early.
MTP Multi-Token Prediction Objective predicting multiple future tokens per step to improve efficiency (model-family dependent).
nDCG normalized Discounted Cumulative Gain Ranking metric for graded relevance that rewards good ordering near the top.
OOV Out Of Vocabulary Tokens/strings not represented in the tokenizer vocabulary (a tokenization failure mode).
ORPO Odds Ratio Preference Optimization Preference-optimization alignment method (offline), related to DPO-style approaches.
PEFT Parameter-Efficient Fine-Tuning Fine-tuning methods that update small parameter subsets (e.g., LoRA) instead of full weights.
PI Position Insertion Long-context trick that rescales/interpolates positions into a trained range.
PPL Perplexity Standard LM metric; lower is better for next-token modeling (within a fixed eval setup).
PPO Proximal Policy Optimization Online RL algorithm used in some RLHF pipelines.
PRM Process Reward Model Scores intermediate reasoning steps (dense supervision), often used to reduce reward hacking.
RAG Retrieval-Augmented Generation Retrieve relevant context from a corpus, then generate conditioned on that context.
ReAct Reason + Act Agent pattern combining reasoning steps with tool calls and observations.
ReLU Rectified Linear Unit Simple activation function, common historically in FFNs.
ReZero Residual with learnable scale Stabilizes training by initializing residual branch scale at zero.
RNN Recurrent Neural Network Sequence model family predating Transformers (sometimes used for positional encoding variants).
RoPE Rotary Position Embedding Injects positions via complex/2D rotations applied to Q/K so relative shifts emerge naturally.
RMSNorm Root Mean Square Norm LayerNorm variant that normalizes by RMS (no mean subtraction); common in modern LLMs.
RLHF Reinforcement Learning from Human Feedback Alignment approach combining reward modeling and RL (often PPO) to optimize preferences.
S2S Seq2Seq Encoder-decoder training objective for conditional generation (translation/summarization).
SFT Supervised Fine-Tuning Fine-tuning on instruction/chat data (often with user-token masking).
SRPT Shortest Remaining Processing Time Scheduling heuristic; used as intuition for prioritizing short requests for tail latency.
STM Short-Term Memory Session-local memory/state in agent systems.
SwiGLU Swish-Gated Linear Unit FFN variant using swish gating; common in modern LLMs.
TTFT Time To First Token User-perceived latency dominated by queueing + prefill.
TP Tensor Parallelism Split model tensors across GPUs for training/inference of large models.
TPOT Time Per Output Token Decode speed metric; closely related to ITL.
Vocab Vocabulary The tokenizer’s discrete token set; impacts context efficiency and OOV behavior.

Interview glossary (what / why / common confusion)

These are the terms that come up most often in interviews for applied LLM work.

Tokenization & data

Term What it is Why it matters Common confusion
Tokenization Mapping raw text into discrete tokens. Determines context efficiency, OOV behavior, and even arithmetic brittleness. “Tokenization is a preprocessing detail.” In practice it sets the unit the model predicts.
BPE / BBPE Subword tokenizer that merges frequent pairs; BBPE runs merges over bytes. BBPE improves coverage across scripts and weird text; token count drives compute and cost. Confusing “byte-level” with “character-level”; BBPE still yields subword tokens.
WordPiece / Unigram LM Alternative subword tokenizers (likelihood-based merge vs prune-from-large-vocab). Different token boundaries affect fragmentation rate and domain jargon handling. Treating these as interchangeable; switching tokenizers changes metrics and behavior.
OOV Strings that a tokenizer can’t represent as intended (or fragment badly). Causes high fragmentation and poor exact-match behavior in domain settings. Thinking OOV only exists for word-level tokenizers; subword can still fragment badly.
Fragmentation rate Average sub-tokens per word/entity. High fragmentation wastes context and hurts latency/cost. Confusing it with vocabulary size alone; it’s about how your domain text breaks up.
Deduplication Removing duplicates/near-duplicates in training data. Reduces memorization and improves effective dataset diversity. Mixing up train-train dedup (memorization) with train-test contamination (benchmark leakage).
Contamination Train/test leakage: eval items or close variants appear in training. Invalidates benchmark results and can hide real regressions. “Our scores are high so we’re good.” Contamination can produce fake wins.
Packing Concatenating sequences to fill context windows during training. Improves tokens/GPU and throughput. Confusing packing with “data mixing”; packing changes batching, not the underlying distribution.

Core architecture

Term What it is Why it matters Common confusion
Attention mask / causal attention Bias/mask that restricts attention (e.g., causal prevents looking ahead). Defines information flow (CLM vs MLM) and prevents training-time leakage. Thinking “causal” is only for generation; it’s a training constraint too.
KV cache Cached per-layer keys/values for past tokens during decode. Dominates memory/bandwidth in decode and often sets max concurrency. Assuming attention cost is only \(O(n^2)\); in serving, KV traffic often dominates.
MHA vs MQA vs GQA KV sharing schemes: MHA has per-head KV; MQA shares all; GQA shares by groups. Trades quality for smaller KV cache and faster decode. Treating “number of heads” as always “number of KV heads”; in GQA/MQA they differ.
RoPE Rotary positional encoding applied to \(Q,K\). Impacts long-context behavior; many long-context tricks are RoPE scaling variants. Treating RoPE as a minor embedding choice; it can make/break long context.
MoE Routes tokens to a subset of expert FFNs. Higher capacity at lower activated compute; adds routing + serving complexity. Equating total params with activated params; cost tracks activated compute.

Training stages & alignment

Term What it is Why it matters Common confusion
Pretraining (CLM) Next-token prediction on broad corpora. Sets core capabilities; later stages mostly shape behavior. Expecting pretraining to guarantee truth or instruction following.
CPT (mid-training) Continue CLM training on a narrower distribution (domain/long-context priors). Best lever for knowledge/vocab/context priors; can cause a stability gap. Using CPT to fix format/style problems (that’s usually SFT/DPO).
SFT Supervised instruction/chat fine-tuning (often with loss masking). Teaches roles, formats, tool schemas, and policy style. “More SFT always helps.” It can overfit style or regress capabilities.
Masking (SFT) Apply loss only on assistant tokens (ignore user/prompt). Prevents learning to imitate user prompts; stabilizes chat behavior. Confusing it with attention masking; this is about where gradients apply.
PEFT / LoRA Fine-tune small adapter parameters instead of full weights. Enables cheap iteration and multi-tenant serving patterns. Assuming adapters don’t need strict regression tests; they can interfere.
DPO / ORPO Offline preference optimization on chosen/rejected pairs. Often cheaper than RLHF and good for preference/style alignment. Treating preference optimization as a substitute for correctness verifiers.
RLHF / PPO Online RL with reward modeling (often PPO + KL control). Can push behavior beyond SFT when rewards are well-specified. “RLHF is just a reward model.” You also need rollouts, KL control, and eval gates.
GRPO RL-style update using within-group reward normalization (often avoids a critic). Useful for reasoning-style RL with verifier signals. Treating it as a magic drop-in; reward design and verifier quality still dominate.
Verifier / RM Deterministic checks or models that score correctness/quality. Central to reliability (best-of-N, self-training, agentic RL). Confusing verifiers with LLM judges only; unit tests are often better.
PRM vs ORM Process reward models score steps; outcome reward models score final outputs. PRM can reduce hacking, but is expensive and can overfit reasoning style. Assuming PRM is always better; ORM is often cheaper and more robust.
Test-time scaling Spend more inference compute (sampling, revision, verifier reranking). Often fastest path to reliability gains without retraining. Calling any sampling “test-time scaling”; it needs scoring + selection.

Inference & serving systems

Term What it is Why it matters Common confusion
Prefill vs decode Prefill processes the prompt; decode generates tokens sequentially. Prefill is often compute-bound; decode is often bandwidth/scheduling bound. Treating “latency” as one thing; TTFT and TPOT have different bottlenecks.
TTFT / TPOT / ITL Time to first token vs time per output token / inter-token latency. Primary user-facing performance metrics; drive different optimizations. Optimizing TTFT and making TPOT worse (or vice versa) without noticing.
FlashAttention Fused attention that reduces HBM traffic. Key prefill/long-context speedup; improves utilization. Assuming it fixes all decode issues; decode can still be KV/scheduler bound.
PagedAttention KV cache stored in fixed-size blocks with a page table. Reduces fragmentation; enables preemption and high concurrency. Thinking it’s just an implementation detail; it changes scheduling feasibility.
Continuous batching Insert requests into in-flight batches as slots free up. Improves throughput and tail latency under load. Equating it to static batching; continuous batching changes latency dynamics.
Chunked prefill Split long prompts into chunks to reduce head-of-line blocking. Prevents long prompts from starving shorter interactive requests. Treating it as a quality change; it’s mainly a scheduling/latency lever.
Prefix caching Cache KV for shared prompt prefixes (system prompts, templates). Reduces TTFT and prefill cost for repeated prefixes. Assuming it’s always safe; cache keys must include model + template + adapter versions.
Speculative decoding Draft model proposes tokens; target model verifies/accepts. Trades compute for fewer slow decode steps when acceptance is high. Thinking it always helps; acceptance rate and batching regime decide wins.
Guided decoding Enforce a grammar/JSON schema during decoding. Dramatically improves tool-call reliability and structured outputs. Expecting it to fix wrong-tool selection; it mainly fixes validity/format.
Admission control Limit active requests to avoid KV OOM and tail-latency collapse. Prevents overload meltdowns and stabilizes p99. Trying to “just batch more” without controlling concurrency.

Agents & RAG

Term What it is Why it matters Common confusion
Agent Loop over state → action (tool call) → observation → state update until stop. Explains autonomy, budgets, and explicit stop reasons. Calling any chatbot with tools an agent; true agents adapt their plan.
Workflow vs agent Predefined orchestration vs model-directed steps. Workflows are testable; agents are flexible but higher variance. Overbuilding agents when a deterministic workflow or RAG + single call works.
Tool calling Structured invocation of external functions/APIs with schemas and retries. Determines production reliability (tool choice, args, sequencing). Treating tool calling as prompting; it’s an interface contract + runtime enforcement.
Tool contract Strict tool input/output schema + error semantics. Enables testing, safe retries, and governance. Assuming the model enforces contracts; the runtime must validate and enforce.
MCP Standard protocol for exposing tools/resources to agent runtimes. Reduces integration friction; standardizes schemas and tool discovery. Treating MCP as a safety boundary; safety lives in allowlists/permissions + runtime policy.
Allowlist Application-enforced list of permitted tools/actions. Prevents tool escalation and contains blast radius. Delegating tool permissions to the model; enforcement must be outside the model.
Idempotency Retry safety for write actions (same request ID → same effect). Required for robust retries/timeouts in production. Thinking retries are free; without idempotency you create duplicate side effects.
RAG Retrieve context then generate conditioned on it. Improves grounding without retraining weights. Expecting RAG to guarantee correctness; retrieval quality and citations still matter.
Hybrid retrieval Combine sparse (BM25) + dense (embeddings). Improves recall across query types (exact terms + semantics). Using dense-only and missing exact-term matches (IDs, names).
Reranking / cross-encoder Stronger ranker scoring (query, doc) jointly. Often biggest relevance boost, but adds latency/cost. Adding reranking without budgets; it can blow up latency.
Prompt injection Retrieved/untrusted text tries to override instructions. Common real-world failure mode; mitigate by isolation and strict tool policies. Treating it as jailbreak only; it often arrives via retrieved corpora.

Minimal taxonomy (by workflow)

Category Examples
Tokenization BPE/BBPE, WordPiece, Unigram LM, fragmentation rate
Core architecture encoder-only, decoder-only, encoder-decoder, PrefixLM, MoE
Training phases pretraining, CPT, SFT, PEFT, preference optimization, RLHF
Alignment methods DPO/ORPO, PPO/RLHF, agentic RL (GRPO-style), verifiers
Inference systems prefill vs decode, KV cache, paged KV, continuous batching, speculative decoding
Compression weight-only quant, KV quant, pruning/sparsity, distillation
Applications RAG, tool use, agents (ReAct, planner–executor), evaluation/observability, MCP