Appendix 1. Taxonomy

Overview

This appendix collects technical terms and acronyms used across the handbook.

The intent is not to be exhaustive, but to provide quick “interview-grade” refreshers.

Acronyms

Acronym	Expansion	Meaning (one line)
ACI	Agent–Computer Interface	The boundary between an agent and the systems it can observe/act on (tools, permissions, schemas, UX).
ALiBi	Attention Linear Bias	Adds a linear distance bias in attention to encourage recency and enable long-context extrapolation.
AWQ	Activation-aware Weight Quantization	Weight quantization that accounts for activation outliers to reduce quality loss.
BBPE	Byte-level BPE	BPE performed over bytes (not unicode chars), improving multilingual robustness and coverage.
BM25	Best Matching 25	Classic sparse retrieval ranking function; strong for exact terms (IDs, names) and complements embeddings.
CBOW	Continuous Bag of Words	word2vec variant that predicts a target token from its surrounding context tokens.
CLM	Causal Language Modeling	Next-token prediction objective with a causal (masked) attention pattern.
CoT	Chain-of-Thought	A reasoning-style response pattern; often used for test-time scaling and verification pipelines.
CPT	Continued Pretraining	Mid-training by continuing next-token training on a target distribution (domain/context priors).
DCA	Dual Chunk Attention	Long-context attention scheme using within-chunk + cross-chunk attention.
DDP	Distributed Data Parallel	Training strategy that replicates model weights and splits batches across workers.
DeepNorm	Deep residual scaling	Residual scaling scheme to stabilize very deep Transformers.
DPO	Direct Preference Optimization	Offline alignment on preference pairs (chosen vs rejected) without online RL rollouts.
EOS	End of Sequence	Special token signaling sequence termination.
FFN	Feed-Forward Network	Per-token MLP block in Transformers (often the main compute cost).
FLOPs	Floating Point Operations	Used to estimate compute cost; often paired with the roofline model.
GEMM	General Matrix Multiply	Dominant kernel type during prefill (compute-heavy).
GEMV	Matrix-Vector Multiply	Common shape during decode at small batch sizes (often bandwidth/overhead heavy).
GELU	Gaussian Error Linear Unit	Smooth activation widely used in Transformers.
GQA	Grouped-Query Attention	Shares K/V across groups of query heads to reduce KV cache size with limited quality loss.
GRPO	Group Relative Policy Optimization	RL-style update using within-group reward normalization (often avoids a learned critic).
HBM	High Bandwidth Memory	GPU memory; decode often becomes HBM-bandwidth bound.
ITL	Inter-Token Latency	Latency per generated token; effectively decode speed.
KV	Key/Value	Attention cache tensors stored to avoid recomputing past states during decode.
LN	Layer Normalization	Normalization applied per token over features to stabilize training.
LoRA	Low-Rank Adaptation	PEFT method adding low-rank adapter matrices to update behavior with small trainable deltas.
LTM	Long-Term Memory	Persistent storage for agent systems (often implemented with retrieval).
MCP	Model Context Protocol	Standard protocol for advertising tools/resources to agent runtimes with consistent schemas (integration layer).
MLM	Masked Language Modeling	Objective predicting masked tokens using both left and right context (encoder-style).
MHA	Multi-Head Attention	Attention with separate Q/K/V projections per head (higher KV cost).
MLP	Multi-Layer Perceptron	Another name for FFN blocks in Transformers.
MoE	Mixture of Experts	Routing tokens to a subset of expert FFNs for higher capacity at lower activated compute.
MQA	Multi-Query Attention	Shares K/V across all query heads (minimal KV cache; sometimes quality hit).
MRR	Mean Reciprocal Rank	Retrieval metric emphasizing ranking the first relevant result early.
MTP	Multi-Token Prediction	Objective predicting multiple future tokens per step to improve efficiency (model-family dependent).
nDCG	normalized Discounted Cumulative Gain	Ranking metric for graded relevance that rewards good ordering near the top.
OOV	Out Of Vocabulary	Tokens/strings not represented in the tokenizer vocabulary (a tokenization failure mode).
ORPO	Odds Ratio Preference Optimization	Preference-optimization alignment method (offline), related to DPO-style approaches.
PEFT	Parameter-Efficient Fine-Tuning	Fine-tuning methods that update small parameter subsets (e.g., LoRA) instead of full weights.
PI	Position Insertion	Long-context trick that rescales/interpolates positions into a trained range.
PPL	Perplexity	Standard LM metric; lower is better for next-token modeling (within a fixed eval setup).
PPO	Proximal Policy Optimization	Online RL algorithm used in some RLHF pipelines.
PRM	Process Reward Model	Scores intermediate reasoning steps (dense supervision), often used to reduce reward hacking.
RAG	Retrieval-Augmented Generation	Retrieve relevant context from a corpus, then generate conditioned on that context.
ReAct	Reason + Act	Agent pattern combining reasoning steps with tool calls and observations.
ReLU	Rectified Linear Unit	Simple activation function, common historically in FFNs.
ReZero	Residual with learnable scale	Stabilizes training by initializing residual branch scale at zero.
RNN	Recurrent Neural Network	Sequence model family predating Transformers (sometimes used for positional encoding variants).
RoPE	Rotary Position Embedding	Injects positions via complex/2D rotations applied to Q/K so relative shifts emerge naturally.
RMSNorm	Root Mean Square Norm	LayerNorm variant that normalizes by RMS (no mean subtraction); common in modern LLMs.
RLHF	Reinforcement Learning from Human Feedback	Alignment approach combining reward modeling and RL (often PPO) to optimize preferences.
S2S	Seq2Seq	Encoder-decoder training objective for conditional generation (translation/summarization).
SFT	Supervised Fine-Tuning	Fine-tuning on instruction/chat data (often with user-token masking).
SRPT	Shortest Remaining Processing Time	Scheduling heuristic; used as intuition for prioritizing short requests for tail latency.
STM	Short-Term Memory	Session-local memory/state in agent systems.
SwiGLU	Swish-Gated Linear Unit	FFN variant using swish gating; common in modern LLMs.
TTFT	Time To First Token	User-perceived latency dominated by queueing + prefill.
TP	Tensor Parallelism	Split model tensors across GPUs for training/inference of large models.
TPOT	Time Per Output Token	Decode speed metric; closely related to ITL.
Vocab	Vocabulary	The tokenizer’s discrete token set; impacts context efficiency and OOV behavior.

Interview glossary (what / why / common confusion)

These are the terms that come up most often in interviews for applied LLM work.

Tokenization & data

Term	What it is	Why it matters	Common confusion
Tokenization	Mapping raw text into discrete tokens.	Determines context efficiency, OOV behavior, and even arithmetic brittleness.	“Tokenization is a preprocessing detail.” In practice it sets the unit the model predicts.
BPE / BBPE	Subword tokenizer that merges frequent pairs; BBPE runs merges over bytes.	BBPE improves coverage across scripts and weird text; token count drives compute and cost.	Confusing “byte-level” with “character-level”; BBPE still yields subword tokens.
WordPiece / Unigram LM	Alternative subword tokenizers (likelihood-based merge vs prune-from-large-vocab).	Different token boundaries affect fragmentation rate and domain jargon handling.	Treating these as interchangeable; switching tokenizers changes metrics and behavior.
OOV	Strings that a tokenizer can’t represent as intended (or fragment badly).	Causes high fragmentation and poor exact-match behavior in domain settings.	Thinking OOV only exists for word-level tokenizers; subword can still fragment badly.
Fragmentation rate	Average sub-tokens per word/entity.	High fragmentation wastes context and hurts latency/cost.	Confusing it with vocabulary size alone; it’s about how your domain text breaks up.
Deduplication	Removing duplicates/near-duplicates in training data.	Reduces memorization and improves effective dataset diversity.	Mixing up train-train dedup (memorization) with train-test contamination (benchmark leakage).
Contamination	Train/test leakage: eval items or close variants appear in training.	Invalidates benchmark results and can hide real regressions.	“Our scores are high so we’re good.” Contamination can produce fake wins.
Packing	Concatenating sequences to fill context windows during training.	Improves tokens/GPU and throughput.	Confusing packing with “data mixing”; packing changes batching, not the underlying distribution.

Core architecture

Term	What it is	Why it matters	Common confusion
Attention mask / causal attention	Bias/mask that restricts attention (e.g., causal prevents looking ahead).	Defines information flow (CLM vs MLM) and prevents training-time leakage.	Thinking “causal” is only for generation; it’s a training constraint too.
KV cache	Cached per-layer keys/values for past tokens during decode.	Dominates memory/bandwidth in decode and often sets max concurrency.	Assuming attention cost is only \(O(n^2)\); in serving, KV traffic often dominates.
MHA vs MQA vs GQA	KV sharing schemes: MHA has per-head KV; MQA shares all; GQA shares by groups.	Trades quality for smaller KV cache and faster decode.	Treating “number of heads” as always “number of KV heads”; in GQA/MQA they differ.
RoPE	Rotary positional encoding applied to \(Q,K\).	Impacts long-context behavior; many long-context tricks are RoPE scaling variants.	Treating RoPE as a minor embedding choice; it can make/break long context.
MoE	Routes tokens to a subset of expert FFNs.	Higher capacity at lower activated compute; adds routing + serving complexity.	Equating total params with activated params; cost tracks activated compute.

Training stages & alignment

Term	What it is	Why it matters	Common confusion
Pretraining (CLM)	Next-token prediction on broad corpora.	Sets core capabilities; later stages mostly shape behavior.	Expecting pretraining to guarantee truth or instruction following.
CPT (mid-training)	Continue CLM training on a narrower distribution (domain/long-context priors).	Best lever for knowledge/vocab/context priors; can cause a stability gap.	Using CPT to fix format/style problems (that’s usually SFT/DPO).
SFT	Supervised instruction/chat fine-tuning (often with loss masking).	Teaches roles, formats, tool schemas, and policy style.	“More SFT always helps.” It can overfit style or regress capabilities.
Masking (SFT)	Apply loss only on assistant tokens (ignore user/prompt).	Prevents learning to imitate user prompts; stabilizes chat behavior.	Confusing it with attention masking; this is about where gradients apply.
PEFT / LoRA	Fine-tune small adapter parameters instead of full weights.	Enables cheap iteration and multi-tenant serving patterns.	Assuming adapters don’t need strict regression tests; they can interfere.
DPO / ORPO	Offline preference optimization on chosen/rejected pairs.	Often cheaper than RLHF and good for preference/style alignment.	Treating preference optimization as a substitute for correctness verifiers.
RLHF / PPO	Online RL with reward modeling (often PPO + KL control).	Can push behavior beyond SFT when rewards are well-specified.	“RLHF is just a reward model.” You also need rollouts, KL control, and eval gates.
GRPO	RL-style update using within-group reward normalization (often avoids a critic).	Useful for reasoning-style RL with verifier signals.	Treating it as a magic drop-in; reward design and verifier quality still dominate.
Verifier / RM	Deterministic checks or models that score correctness/quality.	Central to reliability (best-of-N, self-training, agentic RL).	Confusing verifiers with LLM judges only; unit tests are often better.
PRM vs ORM	Process reward models score steps; outcome reward models score final outputs.	PRM can reduce hacking, but is expensive and can overfit reasoning style.	Assuming PRM is always better; ORM is often cheaper and more robust.
Test-time scaling	Spend more inference compute (sampling, revision, verifier reranking).	Often fastest path to reliability gains without retraining.	Calling any sampling “test-time scaling”; it needs scoring + selection.

Inference & serving systems

Term	What it is	Why it matters	Common confusion
Prefill vs decode	Prefill processes the prompt; decode generates tokens sequentially.	Prefill is often compute-bound; decode is often bandwidth/scheduling bound.	Treating “latency” as one thing; TTFT and TPOT have different bottlenecks.
TTFT / TPOT / ITL	Time to first token vs time per output token / inter-token latency.	Primary user-facing performance metrics; drive different optimizations.	Optimizing TTFT and making TPOT worse (or vice versa) without noticing.
FlashAttention	Fused attention that reduces HBM traffic.	Key prefill/long-context speedup; improves utilization.	Assuming it fixes all decode issues; decode can still be KV/scheduler bound.
PagedAttention	KV cache stored in fixed-size blocks with a page table.	Reduces fragmentation; enables preemption and high concurrency.	Thinking it’s just an implementation detail; it changes scheduling feasibility.
Continuous batching	Insert requests into in-flight batches as slots free up.	Improves throughput and tail latency under load.	Equating it to static batching; continuous batching changes latency dynamics.
Chunked prefill	Split long prompts into chunks to reduce head-of-line blocking.	Prevents long prompts from starving shorter interactive requests.	Treating it as a quality change; it’s mainly a scheduling/latency lever.
Prefix caching	Cache KV for shared prompt prefixes (system prompts, templates).	Reduces TTFT and prefill cost for repeated prefixes.	Assuming it’s always safe; cache keys must include model + template + adapter versions.
Speculative decoding	Draft model proposes tokens; target model verifies/accepts.	Trades compute for fewer slow decode steps when acceptance is high.	Thinking it always helps; acceptance rate and batching regime decide wins.
Guided decoding	Enforce a grammar/JSON schema during decoding.	Dramatically improves tool-call reliability and structured outputs.	Expecting it to fix wrong-tool selection; it mainly fixes validity/format.
Admission control	Limit active requests to avoid KV OOM and tail-latency collapse.	Prevents overload meltdowns and stabilizes p99.	Trying to “just batch more” without controlling concurrency.

Agents & RAG

Term	What it is	Why it matters	Common confusion
Agent	Loop over state → action (tool call) → observation → state update until stop.	Explains autonomy, budgets, and explicit stop reasons.	Calling any chatbot with tools an agent; true agents adapt their plan.
Workflow vs agent	Predefined orchestration vs model-directed steps.	Workflows are testable; agents are flexible but higher variance.	Overbuilding agents when a deterministic workflow or RAG + single call works.
Tool calling	Structured invocation of external functions/APIs with schemas and retries.	Determines production reliability (tool choice, args, sequencing).	Treating tool calling as prompting; it’s an interface contract + runtime enforcement.
Tool contract	Strict tool input/output schema + error semantics.	Enables testing, safe retries, and governance.	Assuming the model enforces contracts; the runtime must validate and enforce.
MCP	Standard protocol for exposing tools/resources to agent runtimes.	Reduces integration friction; standardizes schemas and tool discovery.	Treating MCP as a safety boundary; safety lives in allowlists/permissions + runtime policy.
Allowlist	Application-enforced list of permitted tools/actions.	Prevents tool escalation and contains blast radius.	Delegating tool permissions to the model; enforcement must be outside the model.
Idempotency	Retry safety for write actions (same request ID → same effect).	Required for robust retries/timeouts in production.	Thinking retries are free; without idempotency you create duplicate side effects.
RAG	Retrieve context then generate conditioned on it.	Improves grounding without retraining weights.	Expecting RAG to guarantee correctness; retrieval quality and citations still matter.
Hybrid retrieval	Combine sparse (BM25) + dense (embeddings).	Improves recall across query types (exact terms + semantics).	Using dense-only and missing exact-term matches (IDs, names).
Reranking / cross-encoder	Stronger ranker scoring (query, doc) jointly.	Often biggest relevance boost, but adds latency/cost.	Adding reranking without budgets; it can blow up latency.
Prompt injection	Retrieved/untrusted text tries to override instructions.	Common real-world failure mode; mitigate by isolation and strict tool policies.	Treating it as jailbreak only; it often arrives via retrieved corpora.

Minimal taxonomy (by workflow)

Category	Examples
Tokenization	BPE/BBPE, WordPiece, Unigram LM, fragmentation rate
Core architecture	encoder-only, decoder-only, encoder-decoder, PrefixLM, MoE
Training phases	pretraining, CPT, SFT, PEFT, preference optimization, RLHF
Alignment methods	DPO/ORPO, PPO/RLHF, agentic RL (GRPO-style), verifiers
Inference systems	prefill vs decode, KV cache, paged KV, continuous batching, speculative decoding
Compression	weight-only quant, KV quant, pruning/sparsity, distillation
Applications	RAG, tool use, agents (ReAct, planner–executor), evaluation/observability, MCP