Appendix 2. AI Engineering Interview Questions
Sources & Attribution
Questions in this database are curated, verified, and cross-referenced from the following sources:
Academic & Research Papers: Vaswani et al. 2017 (Attention Is All You Need), Sennrich et al. 2016 (BPE), Devlin et al. 2019 (BERT), Radford et al. 2018/2019 (GPT/GPT-2), Brown et al. 2020 (GPT-3), Touvron et al. 2023 (LLaMA), Hu et al. 2021 (LoRA), Ouyang et al. 2022 (InstructGPT/RLHF), Rafailov et al. 2023 (DPO), Shao et al. 2024 (GRPO), Dao et al. 2022/2023 (FlashAttention 1 & 2), Hoffmann et al. 2022 (Chinchilla), Kaplan et al. 2020 (Scaling Laws), Kwon et al. 2023 (PagedAttention/vLLM), Dettmers et al. 2023 (QLoRA), Lin et al. 2024 (AWQ), Xiao et al. 2023 (SmoothQuant), Gu & Dao 2023 (Mamba), DeepSeek-AI 2024/2025 (DeepSeek V2/V3/R1).
Interview Platforms & Question Banks: MLStack.cafe (59 LLMs Interview Questions, 42 ChatGPT Interview Questions, 38 NLP Interview Questions — mlstack.cafe/interview-questions/llms), Awesome Generative AI Guide by Aishwarya Naresh Reganti (60 Common GenAI Interview Questions — github.com/aishwaryanr/awesome-generative-ai-guide/interview_prep/60_gen_ai_questions.md).
Community Forums & Reports: Reddit r/MachineLearning, r/LocalLLaMA, r/learnmachinelearning; Hacker News; Blind (FAANG interview threads); LeetCode Discuss.
Industry Guides & Blogs: Chip Huyen — “Designing Machine Learning Systems” (O’Reilly 2022) and blog (huyenchip.com); Anthropic — “Building Effective Agents” (2024, anthropic.com); OpenAI Cookbook (github.com/openai/openai-cookbook); HuggingFace NLP Course (huggingface.co/learn/nlp-course); Stanford CS324 — Large Language Models (stanford-cs324.github.io); Full Stack Deep Learning — LLM Bootcamp 2023 (fullstackdeeplearning.com); Jay Alammar — “The Illustrated Transformer” (jalammar.github.io/illustrated-transformer); Lilian Weng — “LLM Powered Autonomous Agents” (lilianweng.github.io); Sebastian Raschka — “Build a Large Language Model From Scratch” (2024).
Book Chapters (this handbook): Each section maps to the corresponding chapter of the AI Engineering Handbook (Chapters 1–8).
Questions are tagged: [conceptual] [tradeoff] [design] [debug] [coding] [estimation]
Chapter 1 — Foundations
1.1 Tokenization
- Why do modern LLMs use subword tokenization instead of word-level or character-level?
[conceptual] - Compare BPE, WordPiece, and Unigram LM tokenizers — when would you choose each?
[tradeoff] - What is byte-level BPE (BBPE) and why is it better for multilingual models?
[conceptual] - How does vocabulary size trade off compute vs quality?
[tradeoff] - Your biomedical domain terms are splitting into 8+ subtokens — what do you do, and what can go wrong?
[debug] - What is tokenizer “fragmentation rate” and why does it matter for cost and latency?
[conceptual] - A colleague swapped the tokenizer between training and serving — what breaks?
[debug] - Why can’t you directly compare perplexity across models with different tokenizers?
[conceptual] - How does tokenization affect arithmetic and numerical reasoning in LLMs?
[conceptual] - If you extend a tokenizer mid-training, what are the risks and how do you mitigate them?
[tradeoff] - What is the relationship between token count and API cost? How would you optimize for a cost-sensitive deployment?
[estimation] - How does SentencePiece differ from HuggingFace tokenizers in handling whitespace and unknown characters?
[conceptual] - You’re building a multilingual model (English, Chinese, Arabic, code). Walk through your tokenizer design decisions — vocab size, algorithm, byte fallback, special tokens.
[design] - A model that handles code well suddenly produces wrong outputs when users paste markdown with backtick fences — what’s the likely tokenizer issue?
[debug]
1.2 Embeddings
- Why can’t the model just use token IDs as numbers — why do we need embeddings?
[conceptual] - Explain the embedding lookup — how is it equivalent to a one-hot × matrix multiply?
[conceptual] - What is “weight tying” (input embedding = output softmax weights) and why do it?
[tradeoff] - How do word/token embeddings differ from sentence embeddings?
[conceptual] - Compare CBOW and Skip-gram — when does Skip-gram outperform?
[tradeoff] - How does FastText handle out-of-vocabulary words?
[conceptual] - What is negative sampling and why is it needed to train word2vec at scale?
[conceptual] - How would you evaluate the quality of learned embeddings?
[design] - Explain the concept of contextual embeddings. How do BERT-style models differ from static word2vec/GloVe embeddings?
[conceptual] - What is the “anisotropy problem” in embedding spaces — why do modern LLM embeddings cluster in a narrow cone, and what can you do about it?
[conceptual] - How do cross-modal embeddings (e.g., CLIP) align text and image in a shared space? What is contrastive loss?
[conceptual] - You’re choosing an embedding model for a production RAG system — what dimensions do you evaluate (dimensionality, latency, domain coverage, MTEB scores)?
[design]
1.3 Attention
- Explain scaled dot-product attention step by step. Why divide by √d_k?
[conceptual] - What is the causal mask and why is it necessary for decoder-only LLMs?
[conceptual] - Why does vanilla self-attention scale as O(L²) in sequence length? What are the practical consequences?
[conceptual] - Explain multi-head attention (MHA). Why use multiple heads instead of one big head?
[conceptual] - What is cross-attention and where is it used?
[conceptual] - Compare MHA, MQA, and GQA — how does each affect the KV cache and quality?
[tradeoff] - Explain the KV cache. Why do we need it during autoregressive decoding?
[conceptual] - How does FlashAttention improve performance without changing the attention math?
[conceptual] - What is sparse attention (e.g., sliding window)? What do you gain and lose?
[tradeoff] - How can you turn vanilla softmax attention into “linear attention”? What’s the tradeoff?
[conceptual] - Why does long-context prefill dominate time-to-first-token (TTFT)?
[conceptual] - You doubled the context window from 4K to 128K — what breaks and what gets expensive?
[estimation] - Derive the memory cost of the KV cache for a single transformer layer as a function of batch size, sequence length, number of KV heads, and head dimension.
[estimation] - What is Multi-head Latent Attention (MLA) as used in DeepSeek V2? How does it compress the KV cache more aggressively than GQA?
[conceptual] - Explain Ring Attention (Striped Attention). How does it distribute long sequences across multiple GPUs for training?
[conceptual]
1.4 FFN, Activations, Residuals, Normalization
- Why do transformers need the FFN/MLP if attention already mixes information across tokens?
[conceptual] - Compare ReLU, GELU, Swish, and SwiGLU — why are gated activations preferred in modern LLMs?
[tradeoff] - What is the “dying ReLU” problem?
[conceptual] - Why do residual (skip) connections help train deep networks?
[conceptual] - Compare pre-norm vs post-norm transformers — which is more stable at depth and why?
[tradeoff] - Compare LayerNorm and RMSNorm. Why do modern LLMs favor RMSNorm?
[tradeoff] - What is DeepNorm? When would you need it?
[conceptual] - What happens if you remove all residual connections from a 70-layer transformer?
[debug] - The FFN typically has a “hidden dimension” 4× the model dimension (or 8/3× with SwiGLU). Why this ratio? What is the memory/compute implication?
[conceptual] - What is the “residual stream” view of transformers? How does it help reason about feature circuits and superposition?
[conceptual]
1.5 Positional Encoding
- Why does a transformer need positional encoding? What happens without it?
[conceptual] - Explain sinusoidal positional encoding. Why sin/cos with different frequencies?
[conceptual] - What is RoPE (Rotary Position Embedding)? Give the one-line intuition.
[conceptual] - Compare absolute, relative, RoPE, and ALiBi — what’s the one-sentence difference?
[tradeoff] - What is “length extrapolation” and why do some position schemes degrade at longer contexts?
[conceptual] - How do NTK-aware scaling and YaRN extend RoPE to longer contexts?
[conceptual] - Your model was trained at 8K context but users need 128K — what are your options?
[design] - Why do some models (like ALiBi) claim to need no positional encoding at all? What are they actually doing?
[conceptual]
1.6 Decoding
- Compare greedy search, beam search, top-k, top-p (nucleus), and temperature sampling.
[conceptual] - What does temperature do mathematically? What fails at very low T vs very high T?
[conceptual] - Why can beam search reduce diversity and produce bland outputs?
[conceptual] - Top-k vs top-p — which is more robust across different prompts and why?
[tradeoff] - Explain “best-of-N” sampling. When would you use it vs single-sample generation?
[tradeoff] - What is self-consistency / majority voting and when is it useful?
[conceptual] - What are “stop sequences” and why do they matter for tool calls and structured outputs?
[conceptual] - How do repetition penalties work? What can they accidentally break?
[debug] - A user complains the model is “too repetitive” — what decoding knobs do you check first?
[debug] - Explain min_p sampling. How does it differ from top-p and why has it gained popularity in open-source inference?
[conceptual] - How does guided/constrained decoding (e.g., Outlines, LMFE) enforce a grammar or JSON schema during generation? What are the runtime costs?
[conceptual]
1.7 Architecture
- Compare encoder-only, decoder-only, encoder-decoder, and PrefixLM architectures.
[conceptual] - Why are most modern chat LLMs decoder-only?
[conceptual] - When would you prefer encoder-only (BERT-style) over decoder-only?
[tradeoff] - What is Mixture of Experts (MoE)? Why does routing matter?
[conceptual] - In MoE, what is load balancing and why can it fail?
[debug] - What architectural choice tends to dominate inference memory at decode time?
[conceptual] - Give an overview of state-space models (Mamba). What bottleneck do they target?
[conceptual] - Compare transformers vs SSMs vs recurrent hybrids (RWKV) for long-context tasks.
[tradeoff] - What are diffusion language models and when might they outperform autoregressive generation?
[conceptual] - Explain the “modern decoder recipe”: RoPE + RMSNorm + SwiGLU + GQA. Why do LLaMA, Mistral, Qwen, and most open-source models converge on this pattern?
[conceptual] - What is “parallel attention + FFN” as used in PaLM? What is the compute saving vs the quality tradeoff?
[tradeoff] - What is Multi-Token Prediction (MTP) as used in DeepSeek V3? How does predicting multiple future tokens improve training signal and enable speculative decoding?
[conceptual]
Chapter 2 — Pretraining
2.1 Data Pipeline & Quality
- Walk through a pretraining data pipeline: sources → filtering → dedup → tokenization → mixture.
[design] - Why is “data quality > data quantity” the most important pretraining principle?
[conceptual] - What filters would you apply to a Common Crawl dataset before pretraining?
[design] - Explain train-train deduplication vs train-test contamination — why does each matter?
[conceptual] - How would you build a pipeline that detects and prevents benchmark contamination?
[design] - You found that your model memorized chunks of training data verbatim — how do you diagnose and fix this?
[debug] - What data governance practices matter for pretraining (licensing, PII, audit trails)?
[design] - How do you handle PII in training data at scale?
[design] - What are common deduplication strategies (exact hash, MinHash, embedding similarity)?
[conceptual] - You need to create a classifier that filters toxic content from your pretraining corpus. What architecture do you use, and how do you handle false positives removing valuable data?
[design] - What is the role of “quality classifiers” (e.g., trained on Wikipedia vs random web) in pretraining data pipelines? How did Llama 3 approach this?
[conceptual]
2.2 Data Mixture & Distribution
- Why does the training data mixture matter? Give an example where adding more code data hurts chat quality.
[tradeoff] - How would you design a data mixture for a general-purpose LLM?
[design] - You added domain-specific data and domain performance improved but general benchmarks dropped — what happened?
[debug] - What is “curriculum learning” in the context of pretraining?
[conceptual] - How do you estimate the marginal value of adding a new data source to the mixture? What proxy metrics can you track before running a full pretrain?
[design]
2.3 Compute & Scaling
- What are the main cost drivers of pretraining (parameters, context, tokens, precision)?
[conceptual] - Explain the Chinchilla scaling laws — what is the compute-optimal ratio of parameters to tokens?
[conceptual] - Back-of-the-envelope: how many FLOPs to train a 70B model on 2T tokens?
[estimation] - Doubling model parameters roughly does what to FLOPs per token?
[estimation] - Explain data parallelism, tensor parallelism, and pipeline parallelism. When do you use each?
[conceptual] - What is ZeRO / FSDP and how does it reduce memory?
[conceptual] - What is activation checkpointing and what’s the tradeoff?
[tradeoff] - Explain mixed-precision training (BF16/FP16). Why does BF16 dominate for LLMs?
[tradeoff] - What is the “roofline model” and how does it help reason about GPU utilization?
[conceptual] - Estimate the minimum number of H100 GPUs and wall-clock time needed to pretrain a 7B-parameter model on 1T tokens. State your assumptions.
[estimation] - What is sequence parallelism and how does it complement tensor parallelism for long-context training?
[conceptual] - Explain the difference between Megatron-LM’s 3D parallelism and DeepSpeed ZeRO-3. When do you use each?
[tradeoff]
2.4 Training Recipe & Monitoring
- What’s a typical learning rate schedule for LLM pretraining?
[conceptual] - Loss spiked 3× mid-run then slowly recovered — how do you triage?
[debug] - What are the top 5 failure modes in pretraining (instability, contamination, memorization, distribution shift, safety regression)?
[conceptual] - How do you set up a “probe set” for monitoring quality during pretraining?
[design] - Your training loss is decreasing but downstream eval scores plateaued — what’s wrong?
[debug] - When do you decide to stop pretraining? What signals do you look at?
[tradeoff] - What is gradient norm monitoring? How does a sudden gradient norm spike indicate a training issue, and what should you do about it?
[debug]
2.5 Evaluation & Downstream Impact
- What does perplexity measure and what are its limitations?
[conceptual] - Lower perplexity doesn’t always mean better instruction following — why?
[conceptual] - How do pretraining choices constrain mid-training, SFT, and inference?
[tradeoff] - Your base model is weak at code — should you fix it in pretraining or later?
[tradeoff] - You pretrained two models with identical architectures but different data mixtures. Model A has lower perplexity but Model B scores higher on downstream tasks. How do you explain this?
[debug]
Chapter 3 — Mid-Training / Continued Pretraining (CPT)
3.1 When & Why
- What is CPT and how does it differ from pretraining and SFT?
[conceptual] - When is CPT the right tool vs SFT vs prompt engineering?
[tradeoff] - A model has high perplexity on legal documents but good general performance — which training stage fixes this?
[tradeoff] - CPT uses the same next-token objective as pretraining but on a different distribution — what does that mean in practice?
[conceptual] - A startup wants to specialize an open-source 7B model for medical Q&A. They have 50K clinical notes. Should they do CPT, SFT, or both? Walk through the decision.
[design]
3.2 Data & Mixture
- What is “general replay” and why is it critical for CPT?
[conceptual] - Start with what replay ratio for CPT (e.g., 80/20)? How do you tune it?
[design] - What is document packing in CPT and why does it matter for throughput?
[conceptual] - Should you use curriculum learning during CPT (start with more replay, then anneal)?
[tradeoff] - How does data formatting differ between CPT and SFT? Why does packing documents with no attention masking across boundaries leak information?
[conceptual]
3.3 The Stability Gap
- What is the “stability gap” in CPT and why does it happen?
[conceptual] - General benchmarks dropped sharply after starting CPT — walk through your debugging playbook.
[debug] - What role does learning rate play in CPT stability?
[conceptual] - How do regression gates work in CPT monitoring?
[design] - You ran CPT for 20B tokens and MMLU dropped 5 points. After continued training for 50B more tokens it partially recovered. Explain this “U-shaped” curve and how to set stopping criteria.
[debug]
3.4 Tokenizer Extension
- When should you extend the tokenizer during CPT?
[tradeoff] - How do you initialize embeddings for new tokens added to the vocabulary?
[conceptual] - What is the “undertrained token” problem and how do you mitigate it?
[debug]
3.5 Training Topology
- Compare the “three topologies”: packing (CPT), masking (SFT), rollouts (RL) — which tokens contribute gradients in each?
[conceptual] - What is the difference between CPT, SFT, DPO, and RL in terms of data construction and loss application?
[conceptual]
3.6 CPT → RL Compatibility
- How can mid-training improve RL scaling (e.g., OctoThinker)?
[conceptual] - Given a product requirement, how do you decide between prompt → SFT → DPO → RL → CPT → distill?
[design]
Chapter 4 — Post-Training
4.1 SFT (Supervised Fine-Tuning)
- What is SFT and when is it the first lever to pull?
[conceptual] - Describe the four shapes of SFT data: single-turn, multi-turn, tool-use trajectory, safety demonstration.
[conceptual] - What is the “chat template trap”? How can a whitespace change break your model?
[debug] - What is completion-only loss (user-token masking) and why does it improve SFT?
[conceptual] - Write the masked SFT loss formula. What does m_t = 0 vs m_t = 1 mean?
[conceptual] - Your SFT model echoes the user’s prompt back — what went wrong?
[debug] - Too many refusal examples in SFT data caused the model to refuse everything — what’s this called and how do you fix it?
[debug] - How do you ensure train-inference consistency with chat templates?
[design] - Your SFT model follows instructions but is too verbose — what do you do?
[debug] - You SFT’d on 10K examples and got great benchmark scores but users complain the model “sounds robotic.” What’s likely in your data, and how do you fix it?
[debug] - How many SFT examples do you typically need? Discuss the LIMA paper’s finding that “less is more” for alignment vs. the need for diversity.
[tradeoff]
4.2 PEFT / LoRA
- Explain LoRA — what is the low-rank decomposition and why is it cheaper than full fine-tuning?
[conceptual] - LoRA rank r = 16 on a 4096×4096 weight matrix — how many trainable parameters vs full fine-tuning?
[estimation] - What is QLoRA and how does it reduce memory further?
[conceptual] - Where should you attach LoRA adapters (attention vs MLP) and why?
[tradeoff] - Merged adapter (single-tenant) vs online stacking (multi-tenant) — when do you use each?
[tradeoff] - What is adapter multi-tenancy? How do you serve one base model + 100 LoRA adapters?
[design] - A routing bug sent Tenant A’s request through Tenant B’s adapter — what’s the impact?
[debug] - How do you do regression testing per-adapter in a multi-tenant system?
[design] - Compare LoRA with other PEFT methods: prefix tuning, adapters (Houlsby), IA3. When does LoRA win and when might alternatives be better?
[tradeoff] - You trained a LoRA adapter on top of LLaMA-3-8B. The base model provider released a minor patch — can you reuse your adapter? What are the risks?
[debug]
4.3 Preference Alignment (DPO / RLHF / PPO)
- What is preference learning? Why is “which is better” easier to label than “write the perfect answer”?
[conceptual] - Write the KL-regularized reward maximization objective. What does β control?
[conceptual] - Explain the full RLHF pipeline: preference labels → reward model → PPO.
[conceptual] - What is reward hacking? Give a concrete example.
[conceptual] - Why is full RLHF expensive as a system (not just a training job)?
[tradeoff] - Explain DPO in one sentence — how does it avoid the full RL loop?
[conceptual] - Compare DPO vs ORPO — when do you start with which?
[tradeoff] - What is the main limitation of DPO/ORPO (hint: data coverage)?
[conceptual] - “More preferred” does not always mean “more correct” — why does this matter?
[conceptual] - Your DPO-tuned model has higher preference win rate but lower factual accuracy — what happened?
[debug] - Explain iterative DPO / online DPO. Why is generating on-policy data during training better than fixed offline preference pairs?
[conceptual] - What is Constitutional AI (CAI)? How does it reduce reliance on human labelers?
[conceptual] - You run RLHF and the model learns to be very sycophantic (“Great question!”). What went wrong in the reward model and how do you fix it?
[debug]
4.4 Tool Use & RAG Training
- Why isn’t prompting alone sufficient for reliable tool use in production?
[conceptual] - Describe the tool-use interaction chain: select tool → fill arguments → consume output → answer.
[conceptual] - What is the difference between tool underuse (hallucinating) and tool overuse (unnecessary calls)?
[debug] - What is constrained decoding for tool calls? Why is schema validity different from semantic correctness?
[tradeoff] - How is RAG a special case of tool use?
[conceptual] - Your agent hallucinates answers instead of calling the retrieval tool — how do you fix this?
[debug] - How do you construct training data for multi-turn tool use? Discuss synthetic rollout generation, filtering by success, and the cold-start problem.
[design]
4.5 Reasoning & Agentic RL
- Compare outcome reward models (ORM) vs process reward models (PRM).
[tradeoff] - What are STaR/ReST-style self-training loops? Why do they help with trajectory data scarcity?
[conceptual] - Explain GRPO — how does it avoid a learned critic and save memory?
[conceptual] - Why does GRPO matter for training 70B+ reasoning models?
[tradeoff] - What is reward hacking in reasoning RL? How does it manifest differently from chat alignment?
[debug] - Your reasoning model got longer and longer outputs over RL training — what’s happening?
[debug] - Compare Dr. GRPO, GSPO, DAPO, LUFFY — what knobs does each tune?
[conceptual] - Explain the “Aha moment” from DeepSeek R1-Zero. What emergent behaviors appeared during RL training without SFT, and why is this significant?
[conceptual] - What is the difference between chain-of-thought prompting, chain-of-thought fine-tuning, and training a reasoning model with RL? When do you use each?
[tradeoff]
4.6 Distillation
- Black-box vs white-box distillation — when do you use each?
[tradeoff] - Write the standard KD loss (KL divergence between teacher and student distributions).
[conceptual] - Your distilled student matches the teacher on benchmarks but fails on safety — what went wrong?
[debug] - How do you manage teacher version drift when the teacher is a closed API?
[design] - What is the practical order: black-box first → white-box refinement?
[design] - DeepSeek R1 distilled reasoning from R1 (671B MoE) into dense 1.5B–70B student models. What data was used, what was the training recipe, and how close did students get?
[conceptual]
4.7 Post-Training System Design
- Design a “Medical Scribe” that knows rare drug names (CPT) but refuses to prescribe (alignment + eval).
[design] - Fixed compute budget: DPO (training) or test-time scaling (inference)? How do you decide?
[tradeoff] - Your SFT model formats tool calls wrong — add more data or switch to constrained decoding?
[debug] - Your agent fails on multi-step tool chaining — what do you change in data, reward, and search?
[design] - You’re the ML lead for a legal-tech startup. You have: an open-source 8B model, 100K contract documents, 5K labeled QA pairs, and a 2-person team. Design the full post-training pipeline from CPT to production.
[design]
Chapter 5 — Common Models & Benchmarks
5.1 Architecture Families
- Compare BERT, T5, GPT, LLaMA, Mistral, Qwen, GLM, and DeepSeek — what’s the core architecture choice in each?
[conceptual] - What is the “modern decoder recipe” (RoPE + RMSNorm + SwiGLU + GQA)? Which models use it?
[conceptual] - PaLM uses parallel transformer sublayers — what does that mean and why?
[conceptual] - What is Multi-head Latent Attention (MLA) in DeepSeek V2/V3?
[conceptual] - DeepSeek V3 uses Multi-Token Prediction (MTP) — what is it and why does it help?
[conceptual] - How does DeepSeek R1 train reasoning without SFT (R1-Zero)?
[conceptual] - What is sliding-window attention in Mistral and what does it sacrifice?
[tradeoff] - Mixtral is an MoE: 8 experts, top-2 routing, 47B total / ~13B active — why is this efficient?
[conceptual] - Trace the evolution: GPT-2 → GPT-3 → GPT-3.5 → GPT-4. What changed at each step (scale, RLHF, multimodality, MoE)?
[conceptual] - Compare LLaMA 1 vs LLaMA 2 vs LLaMA 3. What were the key training recipe changes (not just scale) at each generation?
[conceptual] - What is Gemma/Gemini’s approach to long context (up to 1M tokens)? How does it differ from the RoPE-extension approach?
[conceptual]
5.2 Evaluation & Benchmarks
- Name a good minimal benchmark suite for a general-purpose LLM (knowledge, math, code, chat, safety).
[design] - What is pass@k in coding benchmarks and how does it differ from pass@1?
[conceptual] - MMLU vs MMLU-Pro — what’s the difference?
[conceptual] - What is the Chatbot Arena (LMSYS) and why is Elo-based ranking useful?
[conceptual] - Your model scores 90% on MMLU but fails on real customer queries — what’s the gap?
[debug] - What are common evaluation pitfalls (prompting format, temperature, contamination, scoring method)?
[conceptual] - LLM-as-a-judge: what biases exist and how do you mitigate them?
[tradeoff] - When would you use human preference tests over automatic benchmarks?
[tradeoff] - How do you evaluate long-context models (needle-in-a-haystack, LongBench)?
[design] - What is TruthfulQA and what failure mode does it target?
[conceptual] - How do you evaluate multimodal models (MMMU, MathVista)?
[conceptual] - How do you detect benchmark contamination (data leakage) in a trained model?
[design] - You need to evaluate a model’s safety: what benchmarks and red-team methods do you use?
[design] - What is the difference between “static” benchmarks (MMLU, HumanEval) and “dynamic” benchmarks (Chatbot Arena, LiveCodeBench)? Why are dynamic benchmarks becoming more important?
[conceptual] - Design an internal evaluation suite for a customer-facing chatbot. What dimensions do you test (accuracy, safety, latency, cost, style)? How do you weight them?
[design]
Chapter 6 — Inference & Compression
6.1 Inference Physics
- Explain the two phases of LLM inference: prefill vs decode. Why do they have different bottlenecks?
[conceptual] - Prefill is compute-bound; decode is memory-bandwidth-bound — explain why.
[conceptual] - What is arithmetic intensity and how does it determine whether a workload is compute-bound or memory-bound?
[conceptual] - What is TTFT and what dominates it? What is TPOT/ITL and what dominates it?
[conceptual] - TTFT is too high — walk through your debugging steps.
[debug] - TPOT got worse after a config change — what do you check?
[debug] - An H100 has ~3.35 TB/s memory bandwidth and ~990 TFLOPS BF16. A 70B model in BF16 is ~140GB. What’s the theoretical maximum tokens/second during decode (batch size = 1)? What’s the bottleneck?
[estimation]
6.2 KV Cache
- Estimate KV cache size: given layers=80, KV heads=8, head dim=128, FP16 — how many bytes per token?
[estimation] - At 8K context, how much KV memory per sequence from that calculation?
[estimation] - How does GQA reduce KV cache vs MHA? Give the scaling factor.
[conceptual] - What is PagedAttention? How is it like virtual memory?
[conceptual] - Why does naive contiguous KV allocation waste memory?
[conceptual] - What happens when you quantize the KV cache (FP16 → INT8)? What tasks degrade first?
[tradeoff] - Your model runs out of memory on long prompts — walk through your mitigation steps.
[debug] - Compare KV cache compression techniques: quantization, eviction (H2O, StreamingLLM-style sink tokens), sliding window, cross-layer sharing. What works best at extreme context lengths?
[tradeoff]
6.3 System Optimizations
- What is continuous batching (in-flight batching) and how does it differ from static batching?
[conceptual] - Why does admission control matter for continuous batching?
[conceptual] - What is chunked prefill and how does it solve the “convoy effect”?
[conceptual] - What is prefix caching (prompt caching)? When does it give big wins?
[conceptual] - Explain speculative decoding: draft model → target verify. What’s the win condition?
[conceptual] - What is guided/constrained decoding and why does it matter for JSON/tool outputs?
[conceptual] - Compare vLLM, TGI, TensorRT-LLM, and SGLang — what are the key architectural differences and when would you choose each?
[tradeoff]
6.4 Kernel Optimizations
- How does FlashAttention reduce HBM traffic?
[conceptual] - What is kernel fusion and why does it help both prefill and decode?
[conceptual] - FlashAttention vs PagedAttention — they solve different problems. Explain.
[conceptual] - What is FlashDecoding and how does it parallelize the decode phase across the KV sequence dimension?
[conceptual]
6.5 Serving Architecture
- Why disaggregate prefill and decode onto separate fleets (P/D split)?
[design] - What is Multi-LoRA serving (the “Bento” pattern)?
[design] - How do you batch requests by adapter_id in a Multi-LoRA system?
[design] - What is “tenant bleed” in multi-adapter serving and how do you prevent it?
[debug] - How would you design autoscaling for an LLM inference service? What metrics should trigger scale-up vs. scale-down (queue depth, GPU utilization, p99 latency)?
[design]
6.6 Compression
- Compare weight-only quantization, activation quantization, and KV quantization — when do you use each?
[tradeoff] - What is the “outlier problem” in LLM quantization? Why does naive quantization fail?
[conceptual] - What are AWQ and SmoothQuant? What problem does each solve?
[conceptual] - Compare unstructured pruning, structured pruning (N:M sparsity), and architectural sparsity (MoE).
[tradeoff] - What is knowledge distillation for inference? Compare response distillation vs logit distillation.
[tradeoff] - Why can pruning introduce non-linear “quality cliffs”?
[conceptual] - Compare GPTQ, AWQ, and GGML/GGUF quantization. Which is post-training, which is calibration-based? When do you use each?
[tradeoff] - What is quantization-aware training (QAT)? How does it differ from post-training quantization (PTQ) and when is the extra cost worth it?
[tradeoff]
6.7 Test-Time Scaling
- What is test-time scaling? How does it improve reliability without retraining?
[conceptual] - Compare best-of-N, self-consistency voting, critique-revise loops, and tree search (MCTS).
[tradeoff] - When would you spend compute on test-time scaling vs retraining?
[tradeoff] - Explain how process reward models (PRMs) enable tree search at inference time. What is the tradeoff between breadth (more candidates) and depth (more reasoning steps)?
[conceptual]
6.8 Metrics & Evaluation
- What are the four core inference metrics (TTFT, TPOT, throughput, $/1M tokens)?
[conceptual] - How do you set up quality regression gates after quantization?
[design] - You quantized your model and needle-in-a-haystack accuracy dropped — why?
[debug] - Your users report “variable latency” — sometimes fast, sometimes 5× slower for similar prompts. Diagnose the issue across batching, caching, and queue depth.
[debug]
6.9 System Design Drills
- Why does increasing batch size improve throughput but hurt per-request latency?
[conceptual] - Design an inference system for a latency-sensitive chat product.
[design] - Design an inference system for batch-mode document processing (throughput > latency).
[design] - Your service handles 10K requests/sec with long RAG contexts — what architecture do you propose?
[design] - You need to serve a 70B model but only have a single A100 80GB. Walk through the options (quantization, offloading, smaller model, distillation).
[design] - Compare cloud inference (OpenAI API, Bedrock, Vertex AI) vs self-hosted (vLLM on GPU, Ollama on Mac) — what factors determine the right choice?
[tradeoff]
Chapter 7 — Applications / Agents
7.1 Agent Fundamentals
- What is an “agentic application”? How does it differ from a single LLM call?
[conceptual] - Workflow vs agent — when do you use each? Give a decision rule.
[tradeoff] - Describe the agent control loop: Sense → Think → Act → Check → Update State → Repeat.
[conceptual] - What are common agent products (enterprise search, support triage, deep research, coding agent, ops automation)?
[conceptual] - When should you NOT use an agent?
[tradeoff] - What is the “bitter lesson” for agents — why do simple designs with strong models outperform complex frameworks?
[conceptual]
7.2 Architecture Components
- What is the “policy” in an agent system (model + routing + decoding settings)?
[conceptual] - What are output contracts and why are they critical for production agents?
[design] - Explain the validate → retry → repair prompting pattern.
[design] - What belongs in agent state (structured facts) vs context (token payload per turn)?
[design] - When do you use an explicit planner vs implicit planning in the policy prompt?
[tradeoff] - What are the common orchestration patterns: prompt chaining, routing, parallelization, orchestrator-workers, evaluator-optimizer?
[conceptual] - Explain the “ReAct” pattern (Reason + Act). How is it different from pure chain-of-thought? What are its failure modes?
[conceptual]
7.3 Tool Interface Design
- What makes a good tool contract (schema, validation, timeouts, retries, idempotency, observability)?
[design] - What is MCP (Model Context Protocol) and how does it relate to tool serving?
[conceptual] - What is idempotency for write tools and why is it critical?
[conceptual] - How do you risk-rate tools (read_low → write_medium → write_high)?
[design] - Your agent made a duplicate refund because the tool wasn’t idempotent — what’s the fix?
[debug] - How do you enforce tool permissions outside the model (allowlists, authz)?
[design] - How do you handle tool errors gracefully? Should the error message go back to the model, and what are the risks of that?
[design]
7.4 RAG Pipeline
- Walk through a RAG pipeline: ingest → chunk → embed → index → retrieve → rerank → generate.
[design] - Why does chunk size matter? What happens if chunks are too small or too big?
[tradeoff] - Compare dense retrieval (embeddings), sparse retrieval (BM25), and hybrid — when do you use each?
[tradeoff] - What is a cross-encoder reranker? Why is it better than bi-encoder similarity but more expensive?
[tradeoff] - How do you evaluate retrieval quality (recall@k, MRR, nDCG)?
[conceptual] - Your RAG system retrieves the right documents but the model ignores them — what’s wrong?
[debug] - How do you defend against prompt injection through retrieved documents?
[design] - What is query rewriting and multi-hop retrieval? When are they needed?
[conceptual] - How do you handle stale indexes in a production RAG system?
[design] - Compare vector databases (Pinecone, Weaviate, Qdrant, pgvector) — what are the tradeoffs?
[tradeoff] - You built a RAG system for a 10M-document legal corpus. Retrieval latency is acceptable but answer quality is poor. Walk through a systematic diagnosis: is it chunking, embedding model, retrieval, reranking, or generation?
[debug] - What is “agentic RAG”? How does it differ from single-shot retrieve-and-generate (e.g., iterative retrieval, query decomposition, self-reflection on retrieval quality)?
[conceptual] - How do you handle tables, images, and structured data in a RAG pipeline?
[design]
7.5 Guardrails & Safety
- What layers of guardrails should a production agent have (relevance, safety, PII, moderation, rules, tool safeguards)?
[design] - When should an agent escalate to a human-in-the-loop?
[design] - How do you handle prompt injection attacks in agentic systems?
[design] - Your agent approved an unauthorized action — what layers failed?
[debug] - What is the difference between “jailbreaking” (bypassing safety) and “prompt injection” (hijacking intent)? How do you defend against each?
[conceptual] - How do you build a red-team evaluation for an agentic system? What attack categories do you cover (direct injection, indirect injection, social engineering, tool abuse)?
[design]
7.6 Memory & Context Engineering
- What is “context engineering” — how does it differ from prompt engineering?
[conceptual] - How do you manage context growth over 25+ tool-call loops?
[design] - What is “context rot” — why does retrieval precision degrade in very long contexts?
[conceptual] - Explain progressive disclosure for agents (metadata → full instructions → deep resources).
[conceptual] - What are the context layers: pinned context, working set, retrieved context, summaries?
[design] - When do you use sub-agent architectures to manage context?
[tradeoff] - How do you implement “memory” for a conversational agent that spans multiple sessions? Compare explicit memory stores vs. retrieval over past transcripts.
[design]
7.7 Evaluation & Observability
- How do you evaluate agents as systems (task success, tool accuracy, cost, safety)?
[design] - What should an agent observability system log?
[design] - Build a debug funnel for agent failures: tooling → retrieval → context → generation → guardrails.
[design] - How do you build a replayable evaluation harness with golden tasks?
[design] - What is the cost of an agentic task? How do you estimate and control it (token budgets, max iterations, cost caps)?
[estimation]
7.8 Skills & Procedures
- What are “Agent Skills” and how do they differ from tools and prompts?
[conceptual] - Explain progressive disclosure levels for skills (metadata → full SKILL.md → linked resources).
[conceptual] - How do Skills and MCP fit together in an agent system?
[design]
7.9 Agent System Design Drills
- Design a “news research agent” that finds, verifies, and summarizes news on a topic with citations.
[design] - Design a customer support triage agent with routing, tool use, and human handoff.
[design] - Design a coding/debugging agent that loops with repo search + tests as verification.
[design] - How do you prevent “search → fetch everything” tool spam in an agent?
[design] - Your agent’s latency is 30 seconds per query — where do you look first?
[debug] - Design an “AI data analyst” agent that takes a natural language question, writes SQL, executes it, interprets results, and visualizes them. What tools does it need? What guardrails?
[design] - You’re building a multi-agent system where a “planner” agent delegates to “executor” agents. How do you handle failure recovery when an executor fails mid-task?
[design]
Chapter 8 — Taxonomy / Cross-Cutting Concepts
8.1 Key Distinctions
- Capability vs behavior — where does each come from in the LLM lifecycle?
[conceptual] - Latency vs throughput — when do you optimize for each?
[tradeoff] - Compute-bound vs memory-bound — how do you tell which regime you’re in?
[conceptual] - Dense models vs MoE — what are the pros and cons of each?
[tradeoff] - Online RL (PPO) vs offline preference learning (DPO) — when do you choose each?
[tradeoff]
8.2 Common Confusions
- “Tokenization is just preprocessing” — why is this wrong?
[conceptual] - “Lower perplexity = better model” — when does this break?
[conceptual] - “KV cache is just an O(n²) attention concern” — what else dominates?
[conceptual] - “Byte-level BPE = character-level tokenization” — what’s the difference?
[conceptual] - “Pack more data = better model” — when does packing hurt?
[conceptual] - “RAG solves hallucination” — why is this only partially true?
[conceptual] - “Bigger context window = better performance” — when does longer context actually hurt?
[conceptual] - “Quantized models are always worse” — when can a quantized model outperform a larger full-precision model?
[tradeoff]
8.3 End-to-End Design
- Trace a single user query through the full stack: tokenize → embed → transform → decode → serve → return.
[design] - How do training choices (architecture, tokenizer, context length) constrain inference?
[tradeoff] - You’re building an LLM-powered product from scratch for a vertical domain — walk through the full stack decisions.
[design] - Compare the engineering tradeoffs: latency vs quality, compute vs data, dense vs MoE, tool use vs reasoning.
[tradeoff]
8.4 Cross-Chapter Integration (Advanced)
- How does tokenizer choice in pretraining affect KV cache size at inference? Trace the connection through vocab size → embedding dimension → KV heads → memory per token.
[tradeoff] - You trained a model with 4K context but need 128K for RAG at inference. Trace the full solution path through positional encoding extension, KV cache management, chunked prefill, and retrieval pipeline design.
[design] - Compare the cost of improving a model’s coding ability via: (a) adding code data in pretraining, (b) code-focused CPT, (c) code SFT, (d) code RL with execution feedback. When is each appropriate?
[tradeoff] - A model produces incorrect tool-call JSON. Diagnose whether the problem is: SFT data quality, tokenizer artifacts, decoding strategy, or the tool schema definition. How do you tell them apart?
[debug] - Your production LLM system has a cost of $0.50 per query. The business requires $0.05. Walk through every lever: model size, quantization, batching, caching, distillation, reducing output tokens, moving to MoE.
[estimation] - You’re evaluating three competing LLMs for deployment. Model A is largest and highest quality. Model B is medium with fast inference. Model C is smallest and cheapest. How do you structure the evaluation to make a recommendation for an enterprise customer support use case?
[design]
Total: 368 questions
- Conceptual: ~150
- Tradeoff: ~85
- Design: ~70
- Debug: ~45
- Estimation: ~15
- Coding: ~5 (referenced in chapter code files)
References
Foundational Papers
| Paper | Authors | Year | Relevant Chapters |
|---|---|---|---|
| Attention Is All You Need | Vaswani et al. | 2017 | Ch 1 (Attention, Architecture) |
| BERT: Pre-training of Deep Bidirectional Transformers | Devlin et al. | 2019 | Ch 1, Ch 5 |
| Language Models are Unsupervised Multitask Learners (GPT-2) | Radford et al. | 2019 | Ch 1, Ch 2 |
| Language Models are Few-Shot Learners (GPT-3) | Brown et al. | 2020 | Ch 1, Ch 2, Ch 5 |
| Neural Machine Translation of Rare Words with Subword Units (BPE) | Sennrich, Haddow, Birch | 2016 | Ch 1 (Tokenization) |
| Efficient Estimation of Word Representations in Vector Space (word2vec) | Mikolov et al. | 2013 | Ch 1 (Embeddings) |
| GloVe: Global Vectors for Word Representation | Pennington, Socher, Manning | 2014 | Ch 1 (Embeddings) |
Scaling & Pretraining
| Paper | Authors | Year | Relevant Chapters |
|---|---|---|---|
| Scaling Laws for Neural Language Models | Kaplan et al. | 2020 | Ch 2 |
| Training Compute-Optimal Large Language Models (Chinchilla) | Hoffmann et al. | 2022 | Ch 2 |
| The Pile: An 800GB Dataset of Diverse Text for Language Modeling | Gao et al. | 2020 | Ch 2 |
| Deduplicating Training Data Makes Language Models Better | Lee et al. | 2022 | Ch 2 |
| LLaMA: Open and Efficient Foundation Language Models | Touvron et al. | 2023 | Ch 2, Ch 5 |
| Llama 2: Open Foundation and Fine-Tuned Chat Models | Touvron et al. | 2023 | Ch 2, Ch 4, Ch 5 |
Post-Training & Alignment
| Paper | Authors | Year | Relevant Chapters |
|---|---|---|---|
| Training Language Models to Follow Instructions with Human Feedback (InstructGPT) | Ouyang et al. | 2022 | Ch 4 (RLHF) |
| Direct Preference Optimization (DPO) | Rafailov et al. | 2023 | Ch 4 (Preference Alignment) |
| LoRA: Low-Rank Adaptation of Large Language Models | Hu et al. | 2021 | Ch 4 (PEFT) |
| QLoRA: Efficient Finetuning of Quantized Language Models | Dettmers et al. | 2023 | Ch 4 (PEFT) |
| LIMA: Less Is More for Alignment | Zhou et al. | 2023 | Ch 4 (SFT) |
| Constitutional AI: Harmlessness from AI Feedback | Bai et al. | 2022 | Ch 4 (Alignment) |
| GRPO: Group Relative Policy Optimization | Shao et al. | 2024 | Ch 4 (Reasoning RL) |
| STaR: Self-Taught Reasoner | Zelikman et al. | 2022 | Ch 4 (Reasoning) |
Architecture & Models
| Paper | Authors | Year | Relevant Chapters |
|---|---|---|---|
| RoFormer: Enhanced Transformer with Rotary Position Embedding (RoPE) | Su et al. | 2021 | Ch 1 (Positional Encoding), Ch 5 |
| ALiBi: Train Short, Test Long | Press et al. | 2022 | Ch 1, Ch 5 |
| GQA: Training Generalized Multi-Query Transformer Models | Ainslie et al. | 2023 | Ch 1 (Attention), Ch 6 |
| Mamba: Linear-Time Sequence Modeling with Selective State Spaces | Gu & Dao | 2023 | Ch 1 (Architecture) |
| Mixtral of Experts | Mistral AI | 2024 | Ch 5 |
| DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model | DeepSeek-AI | 2024 | Ch 5 |
| DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning | DeepSeek-AI | 2025 | Ch 4, Ch 5 |
Inference & Compression
| Paper | Authors | Year | Relevant Chapters |
|---|---|---|---|
| FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness | Dao et al. | 2022 | Ch 1, Ch 6 |
| FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning | Dao | 2023 | Ch 6 |
| Efficient Memory Management for Large Language Model Serving with PagedAttention (vLLM) | Kwon et al. | 2023 | Ch 6 |
| AWQ: Activation-aware Weight Quantization | Lin et al. | 2024 | Ch 6 (Compression) |
| SmoothQuant: Accurate and Efficient Post-Training Quantization | Xiao et al. | 2023 | Ch 6 (Compression) |
| GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers | Frantar et al. | 2023 | Ch 6 (Compression) |
| Fast Inference from Transformers via Speculative Decoding | Leviathan, Kalman, Matias | 2023 | Ch 6 (System Optimizations) |
Applications & Agents
| Paper / Resource | Authors | Year | Relevant Chapters |
|---|---|---|---|
| Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks | Lewis et al. | 2020 | Ch 7 (RAG) |
| ReAct: Synergizing Reasoning and Acting in Language Models | Yao et al. | 2023 | Ch 7 (Agents) |
| Toolformer: Language Models Can Teach Themselves to Use Tools | Schick et al. | 2023 | Ch 4, Ch 7 |
| Building Effective Agents (blog) | Anthropic | 2024 | Ch 7 |
| LLM Powered Autonomous Agents (blog) | Lilian Weng | 2023 | Ch 7 |
Interview Resources & Question Banks
| Resource | Link | Notes |
|---|---|---|
| MLStack.cafe — LLMs Interview Questions (59 Qs) | mlstack.cafe/interview-questions/llms | Transformer architecture, attention, transfer learning, alignment |
| MLStack.cafe — ChatGPT Interview Questions (42 Qs) | mlstack.cafe/interview-questions/chatgpt | RLHF, tokenization, context handling, evaluation |
| MLStack.cafe — NLP Interview Questions (38 Qs) | mlstack.cafe/interview-questions/nlp | Positional encoding, encoder-decoder, CNNs vs transformers |
| Awesome Generative AI Guide — 60 Common GenAI Interview Qs | github.com/aishwaryanr/awesome-generative-ai-guide | Generative models, LLMs, embeddings, multimodal, training & evaluation |
| HuggingFace NLP Course | huggingface.co/learn/nlp-course | Ch 1–4 foundations |
| Stanford CS324 — Large Language Models | stanford-cs324.github.io/winter2022 | Training, evaluation, societal impact |
| Full Stack Deep Learning — LLM Bootcamp 2023 | fullstackdeeplearning.com/llm-bootcamp | End-to-end LLM application development |
| Chip Huyen — Designing ML Systems (O’Reilly 2022) | huyenchip.com | Training, serving, evaluation, data distribution |
| Jay Alammar — The Illustrated Transformer | jalammar.github.io/illustrated-transformer | Visual guide to transformer architecture |
| Sebastian Raschka — Build a Large Language Model From Scratch (2024) | manning.com | Architecture, pretraining, SFT, alignment |
| UC Berkeley CS294/194-196 — Large Language Model Agents | rdi.berkeley.edu/llm-agents/f24 | Agentic systems, tool use, planning |