Appendix 1. Taxonomy
Overview
This appendix collects technical terms and acronyms used across the handbook.
The intent is not to be exhaustive, but to provide quick “interview-grade” refreshers.
Acronyms
| Acronym | Expansion | Meaning (one line) |
|---|---|---|
| ACI | Agent–Computer Interface | The boundary between an agent and the systems it can observe/act on (tools, permissions, schemas, UX). |
| ALiBi | Attention Linear Bias | Adds a linear distance bias in attention to encourage recency and enable long-context extrapolation. |
| AWQ | Activation-aware Weight Quantization | Weight quantization that accounts for activation outliers to reduce quality loss. |
| BBPE | Byte-level BPE | BPE performed over bytes (not unicode chars), improving multilingual robustness and coverage. |
| BM25 | Best Matching 25 | Classic sparse retrieval ranking function; strong for exact terms (IDs, names) and complements embeddings. |
| CBOW | Continuous Bag of Words | word2vec variant that predicts a target token from its surrounding context tokens. |
| CLM | Causal Language Modeling | Next-token prediction objective with a causal (masked) attention pattern. |
| CoT | Chain-of-Thought | A reasoning-style response pattern; often used for test-time scaling and verification pipelines. |
| CPT | Continued Pretraining | Mid-training by continuing next-token training on a target distribution (domain/context priors). |
| DCA | Dual Chunk Attention | Long-context attention scheme using within-chunk + cross-chunk attention. |
| DDP | Distributed Data Parallel | Training strategy that replicates model weights and splits batches across workers. |
| DeepNorm | Deep residual scaling | Residual scaling scheme to stabilize very deep Transformers. |
| DPO | Direct Preference Optimization | Offline alignment on preference pairs (chosen vs rejected) without online RL rollouts. |
| EOS | End of Sequence | Special token signaling sequence termination. |
| FFN | Feed-Forward Network | Per-token MLP block in Transformers (often the main compute cost). |
| FLOPs | Floating Point Operations | Used to estimate compute cost; often paired with the roofline model. |
| GEMM | General Matrix Multiply | Dominant kernel type during prefill (compute-heavy). |
| GEMV | Matrix-Vector Multiply | Common shape during decode at small batch sizes (often bandwidth/overhead heavy). |
| GELU | Gaussian Error Linear Unit | Smooth activation widely used in Transformers. |
| GQA | Grouped-Query Attention | Shares K/V across groups of query heads to reduce KV cache size with limited quality loss. |
| GRPO | Group Relative Policy Optimization | RL-style update using within-group reward normalization (often avoids a learned critic). |
| HBM | High Bandwidth Memory | GPU memory; decode often becomes HBM-bandwidth bound. |
| ITL | Inter-Token Latency | Latency per generated token; effectively decode speed. |
| KV | Key/Value | Attention cache tensors stored to avoid recomputing past states during decode. |
| LN | Layer Normalization | Normalization applied per token over features to stabilize training. |
| LoRA | Low-Rank Adaptation | PEFT method adding low-rank adapter matrices to update behavior with small trainable deltas. |
| LTM | Long-Term Memory | Persistent storage for agent systems (often implemented with retrieval). |
| MCP | Model Context Protocol | Standard protocol for advertising tools/resources to agent runtimes with consistent schemas (integration layer). |
| MLM | Masked Language Modeling | Objective predicting masked tokens using both left and right context (encoder-style). |
| MHA | Multi-Head Attention | Attention with separate Q/K/V projections per head (higher KV cost). |
| MLP | Multi-Layer Perceptron | Another name for FFN blocks in Transformers. |
| MoE | Mixture of Experts | Routing tokens to a subset of expert FFNs for higher capacity at lower activated compute. |
| MQA | Multi-Query Attention | Shares K/V across all query heads (minimal KV cache; sometimes quality hit). |
| MRR | Mean Reciprocal Rank | Retrieval metric emphasizing ranking the first relevant result early. |
| MTP | Multi-Token Prediction | Objective predicting multiple future tokens per step to improve efficiency (model-family dependent). |
| nDCG | normalized Discounted Cumulative Gain | Ranking metric for graded relevance that rewards good ordering near the top. |
| OOV | Out Of Vocabulary | Tokens/strings not represented in the tokenizer vocabulary (a tokenization failure mode). |
| ORPO | Odds Ratio Preference Optimization | Preference-optimization alignment method (offline), related to DPO-style approaches. |
| PEFT | Parameter-Efficient Fine-Tuning | Fine-tuning methods that update small parameter subsets (e.g., LoRA) instead of full weights. |
| PI | Position Insertion | Long-context trick that rescales/interpolates positions into a trained range. |
| PPL | Perplexity | Standard LM metric; lower is better for next-token modeling (within a fixed eval setup). |
| PPO | Proximal Policy Optimization | Online RL algorithm used in some RLHF pipelines. |
| PRM | Process Reward Model | Scores intermediate reasoning steps (dense supervision), often used to reduce reward hacking. |
| RAG | Retrieval-Augmented Generation | Retrieve relevant context from a corpus, then generate conditioned on that context. |
| ReAct | Reason + Act | Agent pattern combining reasoning steps with tool calls and observations. |
| ReLU | Rectified Linear Unit | Simple activation function, common historically in FFNs. |
| ReZero | Residual with learnable scale | Stabilizes training by initializing residual branch scale at zero. |
| RNN | Recurrent Neural Network | Sequence model family predating Transformers (sometimes used for positional encoding variants). |
| RoPE | Rotary Position Embedding | Injects positions via complex/2D rotations applied to Q/K so relative shifts emerge naturally. |
| RMSNorm | Root Mean Square Norm | LayerNorm variant that normalizes by RMS (no mean subtraction); common in modern LLMs. |
| RLHF | Reinforcement Learning from Human Feedback | Alignment approach combining reward modeling and RL (often PPO) to optimize preferences. |
| S2S | Seq2Seq | Encoder-decoder training objective for conditional generation (translation/summarization). |
| SFT | Supervised Fine-Tuning | Fine-tuning on instruction/chat data (often with user-token masking). |
| SRPT | Shortest Remaining Processing Time | Scheduling heuristic; used as intuition for prioritizing short requests for tail latency. |
| STM | Short-Term Memory | Session-local memory/state in agent systems. |
| SwiGLU | Swish-Gated Linear Unit | FFN variant using swish gating; common in modern LLMs. |
| TTFT | Time To First Token | User-perceived latency dominated by queueing + prefill. |
| TP | Tensor Parallelism | Split model tensors across GPUs for training/inference of large models. |
| TPOT | Time Per Output Token | Decode speed metric; closely related to ITL. |
| Vocab | Vocabulary | The tokenizer’s discrete token set; impacts context efficiency and OOV behavior. |
Interview glossary (what / why / common confusion)
These are the terms that come up most often in interviews for applied LLM work.
Tokenization & data
| Term | What it is | Why it matters | Common confusion |
|---|---|---|---|
| Tokenization | Mapping raw text into discrete tokens. | Determines context efficiency, OOV behavior, and even arithmetic brittleness. | “Tokenization is a preprocessing detail.” In practice it sets the unit the model predicts. |
| BPE / BBPE | Subword tokenizer that merges frequent pairs; BBPE runs merges over bytes. | BBPE improves coverage across scripts and weird text; token count drives compute and cost. | Confusing “byte-level” with “character-level”; BBPE still yields subword tokens. |
| WordPiece / Unigram LM | Alternative subword tokenizers (likelihood-based merge vs prune-from-large-vocab). | Different token boundaries affect fragmentation rate and domain jargon handling. | Treating these as interchangeable; switching tokenizers changes metrics and behavior. |
| OOV | Strings that a tokenizer can’t represent as intended (or fragment badly). | Causes high fragmentation and poor exact-match behavior in domain settings. | Thinking OOV only exists for word-level tokenizers; subword can still fragment badly. |
| Fragmentation rate | Average sub-tokens per word/entity. | High fragmentation wastes context and hurts latency/cost. | Confusing it with vocabulary size alone; it’s about how your domain text breaks up. |
| Deduplication | Removing duplicates/near-duplicates in training data. | Reduces memorization and improves effective dataset diversity. | Mixing up train-train dedup (memorization) with train-test contamination (benchmark leakage). |
| Contamination | Train/test leakage: eval items or close variants appear in training. | Invalidates benchmark results and can hide real regressions. | “Our scores are high so we’re good.” Contamination can produce fake wins. |
| Packing | Concatenating sequences to fill context windows during training. | Improves tokens/GPU and throughput. | Confusing packing with “data mixing”; packing changes batching, not the underlying distribution. |
Core architecture
| Term | What it is | Why it matters | Common confusion |
|---|---|---|---|
| Attention mask / causal attention | Bias/mask that restricts attention (e.g., causal prevents looking ahead). | Defines information flow (CLM vs MLM) and prevents training-time leakage. | Thinking “causal” is only for generation; it’s a training constraint too. |
| KV cache | Cached per-layer keys/values for past tokens during decode. | Dominates memory/bandwidth in decode and often sets max concurrency. | Assuming attention cost is only \(O(n^2)\); in serving, KV traffic often dominates. |
| MHA vs MQA vs GQA | KV sharing schemes: MHA has per-head KV; MQA shares all; GQA shares by groups. | Trades quality for smaller KV cache and faster decode. | Treating “number of heads” as always “number of KV heads”; in GQA/MQA they differ. |
| RoPE | Rotary positional encoding applied to \(Q,K\). | Impacts long-context behavior; many long-context tricks are RoPE scaling variants. | Treating RoPE as a minor embedding choice; it can make/break long context. |
| MoE | Routes tokens to a subset of expert FFNs. | Higher capacity at lower activated compute; adds routing + serving complexity. | Equating total params with activated params; cost tracks activated compute. |
Training stages & alignment
| Term | What it is | Why it matters | Common confusion |
|---|---|---|---|
| Pretraining (CLM) | Next-token prediction on broad corpora. | Sets core capabilities; later stages mostly shape behavior. | Expecting pretraining to guarantee truth or instruction following. |
| CPT (mid-training) | Continue CLM training on a narrower distribution (domain/long-context priors). | Best lever for knowledge/vocab/context priors; can cause a stability gap. | Using CPT to fix format/style problems (that’s usually SFT/DPO). |
| SFT | Supervised instruction/chat fine-tuning (often with loss masking). | Teaches roles, formats, tool schemas, and policy style. | “More SFT always helps.” It can overfit style or regress capabilities. |
| Masking (SFT) | Apply loss only on assistant tokens (ignore user/prompt). | Prevents learning to imitate user prompts; stabilizes chat behavior. | Confusing it with attention masking; this is about where gradients apply. |
| PEFT / LoRA | Fine-tune small adapter parameters instead of full weights. | Enables cheap iteration and multi-tenant serving patterns. | Assuming adapters don’t need strict regression tests; they can interfere. |
| DPO / ORPO | Offline preference optimization on chosen/rejected pairs. | Often cheaper than RLHF and good for preference/style alignment. | Treating preference optimization as a substitute for correctness verifiers. |
| RLHF / PPO | Online RL with reward modeling (often PPO + KL control). | Can push behavior beyond SFT when rewards are well-specified. | “RLHF is just a reward model.” You also need rollouts, KL control, and eval gates. |
| GRPO | RL-style update using within-group reward normalization (often avoids a critic). | Useful for reasoning-style RL with verifier signals. | Treating it as a magic drop-in; reward design and verifier quality still dominate. |
| Verifier / RM | Deterministic checks or models that score correctness/quality. | Central to reliability (best-of-N, self-training, agentic RL). | Confusing verifiers with LLM judges only; unit tests are often better. |
| PRM vs ORM | Process reward models score steps; outcome reward models score final outputs. | PRM can reduce hacking, but is expensive and can overfit reasoning style. | Assuming PRM is always better; ORM is often cheaper and more robust. |
| Test-time scaling | Spend more inference compute (sampling, revision, verifier reranking). | Often fastest path to reliability gains without retraining. | Calling any sampling “test-time scaling”; it needs scoring + selection. |
Inference & serving systems
| Term | What it is | Why it matters | Common confusion |
|---|---|---|---|
| Prefill vs decode | Prefill processes the prompt; decode generates tokens sequentially. | Prefill is often compute-bound; decode is often bandwidth/scheduling bound. | Treating “latency” as one thing; TTFT and TPOT have different bottlenecks. |
| TTFT / TPOT / ITL | Time to first token vs time per output token / inter-token latency. | Primary user-facing performance metrics; drive different optimizations. | Optimizing TTFT and making TPOT worse (or vice versa) without noticing. |
| FlashAttention | Fused attention that reduces HBM traffic. | Key prefill/long-context speedup; improves utilization. | Assuming it fixes all decode issues; decode can still be KV/scheduler bound. |
| PagedAttention | KV cache stored in fixed-size blocks with a page table. | Reduces fragmentation; enables preemption and high concurrency. | Thinking it’s just an implementation detail; it changes scheduling feasibility. |
| Continuous batching | Insert requests into in-flight batches as slots free up. | Improves throughput and tail latency under load. | Equating it to static batching; continuous batching changes latency dynamics. |
| Chunked prefill | Split long prompts into chunks to reduce head-of-line blocking. | Prevents long prompts from starving shorter interactive requests. | Treating it as a quality change; it’s mainly a scheduling/latency lever. |
| Prefix caching | Cache KV for shared prompt prefixes (system prompts, templates). | Reduces TTFT and prefill cost for repeated prefixes. | Assuming it’s always safe; cache keys must include model + template + adapter versions. |
| Speculative decoding | Draft model proposes tokens; target model verifies/accepts. | Trades compute for fewer slow decode steps when acceptance is high. | Thinking it always helps; acceptance rate and batching regime decide wins. |
| Guided decoding | Enforce a grammar/JSON schema during decoding. | Dramatically improves tool-call reliability and structured outputs. | Expecting it to fix wrong-tool selection; it mainly fixes validity/format. |
| Admission control | Limit active requests to avoid KV OOM and tail-latency collapse. | Prevents overload meltdowns and stabilizes p99. | Trying to “just batch more” without controlling concurrency. |
Agents & RAG
| Term | What it is | Why it matters | Common confusion |
|---|---|---|---|
| Agent | Loop over state → action (tool call) → observation → state update until stop. | Explains autonomy, budgets, and explicit stop reasons. | Calling any chatbot with tools an agent; true agents adapt their plan. |
| Workflow vs agent | Predefined orchestration vs model-directed steps. | Workflows are testable; agents are flexible but higher variance. | Overbuilding agents when a deterministic workflow or RAG + single call works. |
| Tool calling | Structured invocation of external functions/APIs with schemas and retries. | Determines production reliability (tool choice, args, sequencing). | Treating tool calling as prompting; it’s an interface contract + runtime enforcement. |
| Tool contract | Strict tool input/output schema + error semantics. | Enables testing, safe retries, and governance. | Assuming the model enforces contracts; the runtime must validate and enforce. |
| MCP | Standard protocol for exposing tools/resources to agent runtimes. | Reduces integration friction; standardizes schemas and tool discovery. | Treating MCP as a safety boundary; safety lives in allowlists/permissions + runtime policy. |
| Allowlist | Application-enforced list of permitted tools/actions. | Prevents tool escalation and contains blast radius. | Delegating tool permissions to the model; enforcement must be outside the model. |
| Idempotency | Retry safety for write actions (same request ID → same effect). | Required for robust retries/timeouts in production. | Thinking retries are free; without idempotency you create duplicate side effects. |
| RAG | Retrieve context then generate conditioned on it. | Improves grounding without retraining weights. | Expecting RAG to guarantee correctness; retrieval quality and citations still matter. |
| Hybrid retrieval | Combine sparse (BM25) + dense (embeddings). | Improves recall across query types (exact terms + semantics). | Using dense-only and missing exact-term matches (IDs, names). |
| Reranking / cross-encoder | Stronger ranker scoring (query, doc) jointly. | Often biggest relevance boost, but adds latency/cost. | Adding reranking without budgets; it can blow up latency. |
| Prompt injection | Retrieved/untrusted text tries to override instructions. | Common real-world failure mode; mitigate by isolation and strict tool policies. | Treating it as jailbreak only; it often arrives via retrieved corpora. |
Minimal taxonomy (by workflow)
| Category | Examples |
|---|---|
| Tokenization | BPE/BBPE, WordPiece, Unigram LM, fragmentation rate |
| Core architecture | encoder-only, decoder-only, encoder-decoder, PrefixLM, MoE |
| Training phases | pretraining, CPT, SFT, PEFT, preference optimization, RLHF |
| Alignment methods | DPO/ORPO, PPO/RLHF, agentic RL (GRPO-style), verifiers |
| Inference systems | prefill vs decode, KV cache, paged KV, continuous batching, speculative decoding |
| Compression | weight-only quant, KV quant, pruning/sparsity, distillation |
| Applications | RAG, tool use, agents (ReAct, planner–executor), evaluation/observability, MCP |