Appendix 2. AI Engineering Interview Questions

Sources & Attribution

Questions in this database are curated, verified, and cross-referenced from the following sources:

Academic & Research Papers: Vaswani et al. 2017 (Attention Is All You Need), Sennrich et al. 2016 (BPE), Devlin et al. 2019 (BERT), Radford et al. 2018/2019 (GPT/GPT-2), Brown et al. 2020 (GPT-3), Touvron et al. 2023 (LLaMA), Hu et al. 2021 (LoRA), Ouyang et al. 2022 (InstructGPT/RLHF), Rafailov et al. 2023 (DPO), Shao et al. 2024 (GRPO), Dao et al. 2022/2023 (FlashAttention 1 & 2), Hoffmann et al. 2022 (Chinchilla), Kaplan et al. 2020 (Scaling Laws), Kwon et al. 2023 (PagedAttention/vLLM), Dettmers et al. 2023 (QLoRA), Lin et al. 2024 (AWQ), Xiao et al. 2023 (SmoothQuant), Gu & Dao 2023 (Mamba), DeepSeek-AI 2024/2025 (DeepSeek V2/V3/R1).

Interview Platforms & Question Banks: MLStack.cafe (59 LLMs Interview Questions, 42 ChatGPT Interview Questions, 38 NLP Interview Questions — mlstack.cafe/interview-questions/llms), Awesome Generative AI Guide by Aishwarya Naresh Reganti (60 Common GenAI Interview Questions — github.com/aishwaryanr/awesome-generative-ai-guide/interview_prep/60_gen_ai_questions.md).

Community Forums & Reports: Reddit r/MachineLearning, r/LocalLLaMA, r/learnmachinelearning; Hacker News; Blind (FAANG interview threads); LeetCode Discuss.

Industry Guides & Blogs: Chip Huyen — “Designing Machine Learning Systems” (O’Reilly 2022) and blog (huyenchip.com); Anthropic — “Building Effective Agents” (2024, anthropic.com); OpenAI Cookbook (github.com/openai/openai-cookbook); HuggingFace NLP Course (huggingface.co/learn/nlp-course); Stanford CS324 — Large Language Models (stanford-cs324.github.io); Full Stack Deep Learning — LLM Bootcamp 2023 (fullstackdeeplearning.com); Jay Alammar — “The Illustrated Transformer” (jalammar.github.io/illustrated-transformer); Lilian Weng — “LLM Powered Autonomous Agents” (lilianweng.github.io); Sebastian Raschka — “Build a Large Language Model From Scratch” (2024).

Book Chapters (this handbook): Each section maps to the corresponding chapter of the AI Engineering Handbook (Chapters 1–8).

Questions are tagged: [conceptual] [tradeoff] [design] [debug] [coding] [estimation]


Chapter 1 — Foundations

1.1 Tokenization

  1. Why do modern LLMs use subword tokenization instead of word-level or character-level? [conceptual]
  2. Compare BPE, WordPiece, and Unigram LM tokenizers — when would you choose each? [tradeoff]
  3. What is byte-level BPE (BBPE) and why is it better for multilingual models? [conceptual]
  4. How does vocabulary size trade off compute vs quality? [tradeoff]
  5. Your biomedical domain terms are splitting into 8+ subtokens — what do you do, and what can go wrong? [debug]
  6. What is tokenizer “fragmentation rate” and why does it matter for cost and latency? [conceptual]
  7. A colleague swapped the tokenizer between training and serving — what breaks? [debug]
  8. Why can’t you directly compare perplexity across models with different tokenizers? [conceptual]
  9. How does tokenization affect arithmetic and numerical reasoning in LLMs? [conceptual]
  10. If you extend a tokenizer mid-training, what are the risks and how do you mitigate them? [tradeoff]
  11. What is the relationship between token count and API cost? How would you optimize for a cost-sensitive deployment? [estimation]
  12. How does SentencePiece differ from HuggingFace tokenizers in handling whitespace and unknown characters? [conceptual]
  13. You’re building a multilingual model (English, Chinese, Arabic, code). Walk through your tokenizer design decisions — vocab size, algorithm, byte fallback, special tokens. [design]
  14. A model that handles code well suddenly produces wrong outputs when users paste markdown with backtick fences — what’s the likely tokenizer issue? [debug]

1.2 Embeddings

  1. Why can’t the model just use token IDs as numbers — why do we need embeddings? [conceptual]
  2. Explain the embedding lookup — how is it equivalent to a one-hot × matrix multiply? [conceptual]
  3. What is “weight tying” (input embedding = output softmax weights) and why do it? [tradeoff]
  4. How do word/token embeddings differ from sentence embeddings? [conceptual]
  5. Compare CBOW and Skip-gram — when does Skip-gram outperform? [tradeoff]
  6. How does FastText handle out-of-vocabulary words? [conceptual]
  7. What is negative sampling and why is it needed to train word2vec at scale? [conceptual]
  8. How would you evaluate the quality of learned embeddings? [design]
  9. Explain the concept of contextual embeddings. How do BERT-style models differ from static word2vec/GloVe embeddings? [conceptual]
  10. What is the “anisotropy problem” in embedding spaces — why do modern LLM embeddings cluster in a narrow cone, and what can you do about it? [conceptual]
  11. How do cross-modal embeddings (e.g., CLIP) align text and image in a shared space? What is contrastive loss? [conceptual]
  12. You’re choosing an embedding model for a production RAG system — what dimensions do you evaluate (dimensionality, latency, domain coverage, MTEB scores)? [design]

1.3 Attention

  1. Explain scaled dot-product attention step by step. Why divide by √d_k? [conceptual]
  2. What is the causal mask and why is it necessary for decoder-only LLMs? [conceptual]
  3. Why does vanilla self-attention scale as O(L²) in sequence length? What are the practical consequences? [conceptual]
  4. Explain multi-head attention (MHA). Why use multiple heads instead of one big head? [conceptual]
  5. What is cross-attention and where is it used? [conceptual]
  6. Compare MHA, MQA, and GQA — how does each affect the KV cache and quality? [tradeoff]
  7. Explain the KV cache. Why do we need it during autoregressive decoding? [conceptual]
  8. How does FlashAttention improve performance without changing the attention math? [conceptual]
  9. What is sparse attention (e.g., sliding window)? What do you gain and lose? [tradeoff]
  10. How can you turn vanilla softmax attention into “linear attention”? What’s the tradeoff? [conceptual]
  11. Why does long-context prefill dominate time-to-first-token (TTFT)? [conceptual]
  12. You doubled the context window from 4K to 128K — what breaks and what gets expensive? [estimation]
  13. Derive the memory cost of the KV cache for a single transformer layer as a function of batch size, sequence length, number of KV heads, and head dimension. [estimation]
  14. What is Multi-head Latent Attention (MLA) as used in DeepSeek V2? How does it compress the KV cache more aggressively than GQA? [conceptual]
  15. Explain Ring Attention (Striped Attention). How does it distribute long sequences across multiple GPUs for training? [conceptual]

1.4 FFN, Activations, Residuals, Normalization

  1. Why do transformers need the FFN/MLP if attention already mixes information across tokens? [conceptual]
  2. Compare ReLU, GELU, Swish, and SwiGLU — why are gated activations preferred in modern LLMs? [tradeoff]
  3. What is the “dying ReLU” problem? [conceptual]
  4. Why do residual (skip) connections help train deep networks? [conceptual]
  5. Compare pre-norm vs post-norm transformers — which is more stable at depth and why? [tradeoff]
  6. Compare LayerNorm and RMSNorm. Why do modern LLMs favor RMSNorm? [tradeoff]
  7. What is DeepNorm? When would you need it? [conceptual]
  8. What happens if you remove all residual connections from a 70-layer transformer? [debug]
  9. The FFN typically has a “hidden dimension” 4× the model dimension (or 8/3× with SwiGLU). Why this ratio? What is the memory/compute implication? [conceptual]
  10. What is the “residual stream” view of transformers? How does it help reason about feature circuits and superposition? [conceptual]

1.5 Positional Encoding

  1. Why does a transformer need positional encoding? What happens without it? [conceptual]
  2. Explain sinusoidal positional encoding. Why sin/cos with different frequencies? [conceptual]
  3. What is RoPE (Rotary Position Embedding)? Give the one-line intuition. [conceptual]
  4. Compare absolute, relative, RoPE, and ALiBi — what’s the one-sentence difference? [tradeoff]
  5. What is “length extrapolation” and why do some position schemes degrade at longer contexts? [conceptual]
  6. How do NTK-aware scaling and YaRN extend RoPE to longer contexts? [conceptual]
  7. Your model was trained at 8K context but users need 128K — what are your options? [design]
  8. Why do some models (like ALiBi) claim to need no positional encoding at all? What are they actually doing? [conceptual]

1.6 Decoding

  1. Compare greedy search, beam search, top-k, top-p (nucleus), and temperature sampling. [conceptual]
  2. What does temperature do mathematically? What fails at very low T vs very high T? [conceptual]
  3. Why can beam search reduce diversity and produce bland outputs? [conceptual]
  4. Top-k vs top-p — which is more robust across different prompts and why? [tradeoff]
  5. Explain “best-of-N” sampling. When would you use it vs single-sample generation? [tradeoff]
  6. What is self-consistency / majority voting and when is it useful? [conceptual]
  7. What are “stop sequences” and why do they matter for tool calls and structured outputs? [conceptual]
  8. How do repetition penalties work? What can they accidentally break? [debug]
  9. A user complains the model is “too repetitive” — what decoding knobs do you check first? [debug]
  10. Explain min_p sampling. How does it differ from top-p and why has it gained popularity in open-source inference? [conceptual]
  11. How does guided/constrained decoding (e.g., Outlines, LMFE) enforce a grammar or JSON schema during generation? What are the runtime costs? [conceptual]

1.7 Architecture

  1. Compare encoder-only, decoder-only, encoder-decoder, and PrefixLM architectures. [conceptual]
  2. Why are most modern chat LLMs decoder-only? [conceptual]
  3. When would you prefer encoder-only (BERT-style) over decoder-only? [tradeoff]
  4. What is Mixture of Experts (MoE)? Why does routing matter? [conceptual]
  5. In MoE, what is load balancing and why can it fail? [debug]
  6. What architectural choice tends to dominate inference memory at decode time? [conceptual]
  7. Give an overview of state-space models (Mamba). What bottleneck do they target? [conceptual]
  8. Compare transformers vs SSMs vs recurrent hybrids (RWKV) for long-context tasks. [tradeoff]
  9. What are diffusion language models and when might they outperform autoregressive generation? [conceptual]
  10. Explain the “modern decoder recipe”: RoPE + RMSNorm + SwiGLU + GQA. Why do LLaMA, Mistral, Qwen, and most open-source models converge on this pattern? [conceptual]
  11. What is “parallel attention + FFN” as used in PaLM? What is the compute saving vs the quality tradeoff? [tradeoff]
  12. What is Multi-Token Prediction (MTP) as used in DeepSeek V3? How does predicting multiple future tokens improve training signal and enable speculative decoding? [conceptual]

Chapter 2 — Pretraining

2.1 Data Pipeline & Quality

  1. Walk through a pretraining data pipeline: sources → filtering → dedup → tokenization → mixture. [design]
  2. Why is “data quality > data quantity” the most important pretraining principle? [conceptual]
  3. What filters would you apply to a Common Crawl dataset before pretraining? [design]
  4. Explain train-train deduplication vs train-test contamination — why does each matter? [conceptual]
  5. How would you build a pipeline that detects and prevents benchmark contamination? [design]
  6. You found that your model memorized chunks of training data verbatim — how do you diagnose and fix this? [debug]
  7. What data governance practices matter for pretraining (licensing, PII, audit trails)? [design]
  8. How do you handle PII in training data at scale? [design]
  9. What are common deduplication strategies (exact hash, MinHash, embedding similarity)? [conceptual]
  10. You need to create a classifier that filters toxic content from your pretraining corpus. What architecture do you use, and how do you handle false positives removing valuable data? [design]
  11. What is the role of “quality classifiers” (e.g., trained on Wikipedia vs random web) in pretraining data pipelines? How did Llama 3 approach this? [conceptual]

2.2 Data Mixture & Distribution

  1. Why does the training data mixture matter? Give an example where adding more code data hurts chat quality. [tradeoff]
  2. How would you design a data mixture for a general-purpose LLM? [design]
  3. You added domain-specific data and domain performance improved but general benchmarks dropped — what happened? [debug]
  4. What is “curriculum learning” in the context of pretraining? [conceptual]
  5. How do you estimate the marginal value of adding a new data source to the mixture? What proxy metrics can you track before running a full pretrain? [design]

2.3 Compute & Scaling

  1. What are the main cost drivers of pretraining (parameters, context, tokens, precision)? [conceptual]
  2. Explain the Chinchilla scaling laws — what is the compute-optimal ratio of parameters to tokens? [conceptual]
  3. Back-of-the-envelope: how many FLOPs to train a 70B model on 2T tokens? [estimation]
  4. Doubling model parameters roughly does what to FLOPs per token? [estimation]
  5. Explain data parallelism, tensor parallelism, and pipeline parallelism. When do you use each? [conceptual]
  6. What is ZeRO / FSDP and how does it reduce memory? [conceptual]
  7. What is activation checkpointing and what’s the tradeoff? [tradeoff]
  8. Explain mixed-precision training (BF16/FP16). Why does BF16 dominate for LLMs? [tradeoff]
  9. What is the “roofline model” and how does it help reason about GPU utilization? [conceptual]
  10. Estimate the minimum number of H100 GPUs and wall-clock time needed to pretrain a 7B-parameter model on 1T tokens. State your assumptions. [estimation]
  11. What is sequence parallelism and how does it complement tensor parallelism for long-context training? [conceptual]
  12. Explain the difference between Megatron-LM’s 3D parallelism and DeepSpeed ZeRO-3. When do you use each? [tradeoff]

2.4 Training Recipe & Monitoring

  1. What’s a typical learning rate schedule for LLM pretraining? [conceptual]
  2. Loss spiked 3× mid-run then slowly recovered — how do you triage? [debug]
  3. What are the top 5 failure modes in pretraining (instability, contamination, memorization, distribution shift, safety regression)? [conceptual]
  4. How do you set up a “probe set” for monitoring quality during pretraining? [design]
  5. Your training loss is decreasing but downstream eval scores plateaued — what’s wrong? [debug]
  6. When do you decide to stop pretraining? What signals do you look at? [tradeoff]
  7. What is gradient norm monitoring? How does a sudden gradient norm spike indicate a training issue, and what should you do about it? [debug]

2.5 Evaluation & Downstream Impact

  1. What does perplexity measure and what are its limitations? [conceptual]
  2. Lower perplexity doesn’t always mean better instruction following — why? [conceptual]
  3. How do pretraining choices constrain mid-training, SFT, and inference? [tradeoff]
  4. Your base model is weak at code — should you fix it in pretraining or later? [tradeoff]
  5. You pretrained two models with identical architectures but different data mixtures. Model A has lower perplexity but Model B scores higher on downstream tasks. How do you explain this? [debug]

Chapter 3 — Mid-Training / Continued Pretraining (CPT)

3.1 When & Why

  1. What is CPT and how does it differ from pretraining and SFT? [conceptual]
  2. When is CPT the right tool vs SFT vs prompt engineering? [tradeoff]
  3. A model has high perplexity on legal documents but good general performance — which training stage fixes this? [tradeoff]
  4. CPT uses the same next-token objective as pretraining but on a different distribution — what does that mean in practice? [conceptual]
  5. A startup wants to specialize an open-source 7B model for medical Q&A. They have 50K clinical notes. Should they do CPT, SFT, or both? Walk through the decision. [design]

3.2 Data & Mixture

  1. What is “general replay” and why is it critical for CPT? [conceptual]
  2. Start with what replay ratio for CPT (e.g., 80/20)? How do you tune it? [design]
  3. What is document packing in CPT and why does it matter for throughput? [conceptual]
  4. Should you use curriculum learning during CPT (start with more replay, then anneal)? [tradeoff]
  5. How does data formatting differ between CPT and SFT? Why does packing documents with no attention masking across boundaries leak information? [conceptual]

3.3 The Stability Gap

  1. What is the “stability gap” in CPT and why does it happen? [conceptual]
  2. General benchmarks dropped sharply after starting CPT — walk through your debugging playbook. [debug]
  3. What role does learning rate play in CPT stability? [conceptual]
  4. How do regression gates work in CPT monitoring? [design]
  5. You ran CPT for 20B tokens and MMLU dropped 5 points. After continued training for 50B more tokens it partially recovered. Explain this “U-shaped” curve and how to set stopping criteria. [debug]

3.4 Tokenizer Extension

  1. When should you extend the tokenizer during CPT? [tradeoff]
  2. How do you initialize embeddings for new tokens added to the vocabulary? [conceptual]
  3. What is the “undertrained token” problem and how do you mitigate it? [debug]

3.5 Training Topology

  1. Compare the “three topologies”: packing (CPT), masking (SFT), rollouts (RL) — which tokens contribute gradients in each? [conceptual]
  2. What is the difference between CPT, SFT, DPO, and RL in terms of data construction and loss application? [conceptual]

3.6 CPT → RL Compatibility

  1. How can mid-training improve RL scaling (e.g., OctoThinker)? [conceptual]
  2. Given a product requirement, how do you decide between prompt → SFT → DPO → RL → CPT → distill? [design]

Chapter 4 — Post-Training

4.1 SFT (Supervised Fine-Tuning)

  1. What is SFT and when is it the first lever to pull? [conceptual]
  2. Describe the four shapes of SFT data: single-turn, multi-turn, tool-use trajectory, safety demonstration. [conceptual]
  3. What is the “chat template trap”? How can a whitespace change break your model? [debug]
  4. What is completion-only loss (user-token masking) and why does it improve SFT? [conceptual]
  5. Write the masked SFT loss formula. What does m_t = 0 vs m_t = 1 mean? [conceptual]
  6. Your SFT model echoes the user’s prompt back — what went wrong? [debug]
  7. Too many refusal examples in SFT data caused the model to refuse everything — what’s this called and how do you fix it? [debug]
  8. How do you ensure train-inference consistency with chat templates? [design]
  9. Your SFT model follows instructions but is too verbose — what do you do? [debug]
  10. You SFT’d on 10K examples and got great benchmark scores but users complain the model “sounds robotic.” What’s likely in your data, and how do you fix it? [debug]
  11. How many SFT examples do you typically need? Discuss the LIMA paper’s finding that “less is more” for alignment vs. the need for diversity. [tradeoff]

4.2 PEFT / LoRA

  1. Explain LoRA — what is the low-rank decomposition and why is it cheaper than full fine-tuning? [conceptual]
  2. LoRA rank r = 16 on a 4096×4096 weight matrix — how many trainable parameters vs full fine-tuning? [estimation]
  3. What is QLoRA and how does it reduce memory further? [conceptual]
  4. Where should you attach LoRA adapters (attention vs MLP) and why? [tradeoff]
  5. Merged adapter (single-tenant) vs online stacking (multi-tenant) — when do you use each? [tradeoff]
  6. What is adapter multi-tenancy? How do you serve one base model + 100 LoRA adapters? [design]
  7. A routing bug sent Tenant A’s request through Tenant B’s adapter — what’s the impact? [debug]
  8. How do you do regression testing per-adapter in a multi-tenant system? [design]
  9. Compare LoRA with other PEFT methods: prefix tuning, adapters (Houlsby), IA3. When does LoRA win and when might alternatives be better? [tradeoff]
  10. You trained a LoRA adapter on top of LLaMA-3-8B. The base model provider released a minor patch — can you reuse your adapter? What are the risks? [debug]

4.3 Preference Alignment (DPO / RLHF / PPO)

  1. What is preference learning? Why is “which is better” easier to label than “write the perfect answer”? [conceptual]
  2. Write the KL-regularized reward maximization objective. What does β control? [conceptual]
  3. Explain the full RLHF pipeline: preference labels → reward model → PPO. [conceptual]
  4. What is reward hacking? Give a concrete example. [conceptual]
  5. Why is full RLHF expensive as a system (not just a training job)? [tradeoff]
  6. Explain DPO in one sentence — how does it avoid the full RL loop? [conceptual]
  7. Compare DPO vs ORPO — when do you start with which? [tradeoff]
  8. What is the main limitation of DPO/ORPO (hint: data coverage)? [conceptual]
  9. “More preferred” does not always mean “more correct” — why does this matter? [conceptual]
  10. Your DPO-tuned model has higher preference win rate but lower factual accuracy — what happened? [debug]
  11. Explain iterative DPO / online DPO. Why is generating on-policy data during training better than fixed offline preference pairs? [conceptual]
  12. What is Constitutional AI (CAI)? How does it reduce reliance on human labelers? [conceptual]
  13. You run RLHF and the model learns to be very sycophantic (“Great question!”). What went wrong in the reward model and how do you fix it? [debug]

4.4 Tool Use & RAG Training

  1. Why isn’t prompting alone sufficient for reliable tool use in production? [conceptual]
  2. Describe the tool-use interaction chain: select tool → fill arguments → consume output → answer. [conceptual]
  3. What is the difference between tool underuse (hallucinating) and tool overuse (unnecessary calls)? [debug]
  4. What is constrained decoding for tool calls? Why is schema validity different from semantic correctness? [tradeoff]
  5. How is RAG a special case of tool use? [conceptual]
  6. Your agent hallucinates answers instead of calling the retrieval tool — how do you fix this? [debug]
  7. How do you construct training data for multi-turn tool use? Discuss synthetic rollout generation, filtering by success, and the cold-start problem. [design]

4.5 Reasoning & Agentic RL

  1. Compare outcome reward models (ORM) vs process reward models (PRM). [tradeoff]
  2. What are STaR/ReST-style self-training loops? Why do they help with trajectory data scarcity? [conceptual]
  3. Explain GRPO — how does it avoid a learned critic and save memory? [conceptual]
  4. Why does GRPO matter for training 70B+ reasoning models? [tradeoff]
  5. What is reward hacking in reasoning RL? How does it manifest differently from chat alignment? [debug]
  6. Your reasoning model got longer and longer outputs over RL training — what’s happening? [debug]
  7. Compare Dr. GRPO, GSPO, DAPO, LUFFY — what knobs does each tune? [conceptual]
  8. Explain the “Aha moment” from DeepSeek R1-Zero. What emergent behaviors appeared during RL training without SFT, and why is this significant? [conceptual]
  9. What is the difference between chain-of-thought prompting, chain-of-thought fine-tuning, and training a reasoning model with RL? When do you use each? [tradeoff]

4.6 Distillation

  1. Black-box vs white-box distillation — when do you use each? [tradeoff]
  2. Write the standard KD loss (KL divergence between teacher and student distributions). [conceptual]
  3. Your distilled student matches the teacher on benchmarks but fails on safety — what went wrong? [debug]
  4. How do you manage teacher version drift when the teacher is a closed API? [design]
  5. What is the practical order: black-box first → white-box refinement? [design]
  6. DeepSeek R1 distilled reasoning from R1 (671B MoE) into dense 1.5B–70B student models. What data was used, what was the training recipe, and how close did students get? [conceptual]

4.7 Post-Training System Design

  1. Design a “Medical Scribe” that knows rare drug names (CPT) but refuses to prescribe (alignment + eval). [design]
  2. Fixed compute budget: DPO (training) or test-time scaling (inference)? How do you decide? [tradeoff]
  3. Your SFT model formats tool calls wrong — add more data or switch to constrained decoding? [debug]
  4. Your agent fails on multi-step tool chaining — what do you change in data, reward, and search? [design]
  5. You’re the ML lead for a legal-tech startup. You have: an open-source 8B model, 100K contract documents, 5K labeled QA pairs, and a 2-person team. Design the full post-training pipeline from CPT to production. [design]

Chapter 5 — Common Models & Benchmarks

5.1 Architecture Families

  1. Compare BERT, T5, GPT, LLaMA, Mistral, Qwen, GLM, and DeepSeek — what’s the core architecture choice in each? [conceptual]
  2. What is the “modern decoder recipe” (RoPE + RMSNorm + SwiGLU + GQA)? Which models use it? [conceptual]
  3. PaLM uses parallel transformer sublayers — what does that mean and why? [conceptual]
  4. What is Multi-head Latent Attention (MLA) in DeepSeek V2/V3? [conceptual]
  5. DeepSeek V3 uses Multi-Token Prediction (MTP) — what is it and why does it help? [conceptual]
  6. How does DeepSeek R1 train reasoning without SFT (R1-Zero)? [conceptual]
  7. What is sliding-window attention in Mistral and what does it sacrifice? [tradeoff]
  8. Mixtral is an MoE: 8 experts, top-2 routing, 47B total / ~13B active — why is this efficient? [conceptual]
  9. Trace the evolution: GPT-2 → GPT-3 → GPT-3.5 → GPT-4. What changed at each step (scale, RLHF, multimodality, MoE)? [conceptual]
  10. Compare LLaMA 1 vs LLaMA 2 vs LLaMA 3. What were the key training recipe changes (not just scale) at each generation? [conceptual]
  11. What is Gemma/Gemini’s approach to long context (up to 1M tokens)? How does it differ from the RoPE-extension approach? [conceptual]

5.2 Evaluation & Benchmarks

  1. Name a good minimal benchmark suite for a general-purpose LLM (knowledge, math, code, chat, safety). [design]
  2. What is pass@k in coding benchmarks and how does it differ from pass@1? [conceptual]
  3. MMLU vs MMLU-Pro — what’s the difference? [conceptual]
  4. What is the Chatbot Arena (LMSYS) and why is Elo-based ranking useful? [conceptual]
  5. Your model scores 90% on MMLU but fails on real customer queries — what’s the gap? [debug]
  6. What are common evaluation pitfalls (prompting format, temperature, contamination, scoring method)? [conceptual]
  7. LLM-as-a-judge: what biases exist and how do you mitigate them? [tradeoff]
  8. When would you use human preference tests over automatic benchmarks? [tradeoff]
  9. How do you evaluate long-context models (needle-in-a-haystack, LongBench)? [design]
  10. What is TruthfulQA and what failure mode does it target? [conceptual]
  11. How do you evaluate multimodal models (MMMU, MathVista)? [conceptual]
  12. How do you detect benchmark contamination (data leakage) in a trained model? [design]
  13. You need to evaluate a model’s safety: what benchmarks and red-team methods do you use? [design]
  14. What is the difference between “static” benchmarks (MMLU, HumanEval) and “dynamic” benchmarks (Chatbot Arena, LiveCodeBench)? Why are dynamic benchmarks becoming more important? [conceptual]
  15. Design an internal evaluation suite for a customer-facing chatbot. What dimensions do you test (accuracy, safety, latency, cost, style)? How do you weight them? [design]

Chapter 6 — Inference & Compression

6.1 Inference Physics

  1. Explain the two phases of LLM inference: prefill vs decode. Why do they have different bottlenecks? [conceptual]
  2. Prefill is compute-bound; decode is memory-bandwidth-bound — explain why. [conceptual]
  3. What is arithmetic intensity and how does it determine whether a workload is compute-bound or memory-bound? [conceptual]
  4. What is TTFT and what dominates it? What is TPOT/ITL and what dominates it? [conceptual]
  5. TTFT is too high — walk through your debugging steps. [debug]
  6. TPOT got worse after a config change — what do you check? [debug]
  7. An H100 has ~3.35 TB/s memory bandwidth and ~990 TFLOPS BF16. A 70B model in BF16 is ~140GB. What’s the theoretical maximum tokens/second during decode (batch size = 1)? What’s the bottleneck? [estimation]

6.2 KV Cache

  1. Estimate KV cache size: given layers=80, KV heads=8, head dim=128, FP16 — how many bytes per token? [estimation]
  2. At 8K context, how much KV memory per sequence from that calculation? [estimation]
  3. How does GQA reduce KV cache vs MHA? Give the scaling factor. [conceptual]
  4. What is PagedAttention? How is it like virtual memory? [conceptual]
  5. Why does naive contiguous KV allocation waste memory? [conceptual]
  6. What happens when you quantize the KV cache (FP16 → INT8)? What tasks degrade first? [tradeoff]
  7. Your model runs out of memory on long prompts — walk through your mitigation steps. [debug]
  8. Compare KV cache compression techniques: quantization, eviction (H2O, StreamingLLM-style sink tokens), sliding window, cross-layer sharing. What works best at extreme context lengths? [tradeoff]

6.3 System Optimizations

  1. What is continuous batching (in-flight batching) and how does it differ from static batching? [conceptual]
  2. Why does admission control matter for continuous batching? [conceptual]
  3. What is chunked prefill and how does it solve the “convoy effect”? [conceptual]
  4. What is prefix caching (prompt caching)? When does it give big wins? [conceptual]
  5. Explain speculative decoding: draft model → target verify. What’s the win condition? [conceptual]
  6. What is guided/constrained decoding and why does it matter for JSON/tool outputs? [conceptual]
  7. Compare vLLM, TGI, TensorRT-LLM, and SGLang — what are the key architectural differences and when would you choose each? [tradeoff]

6.4 Kernel Optimizations

  1. How does FlashAttention reduce HBM traffic? [conceptual]
  2. What is kernel fusion and why does it help both prefill and decode? [conceptual]
  3. FlashAttention vs PagedAttention — they solve different problems. Explain. [conceptual]
  4. What is FlashDecoding and how does it parallelize the decode phase across the KV sequence dimension? [conceptual]

6.5 Serving Architecture

  1. Why disaggregate prefill and decode onto separate fleets (P/D split)? [design]
  2. What is Multi-LoRA serving (the “Bento” pattern)? [design]
  3. How do you batch requests by adapter_id in a Multi-LoRA system? [design]
  4. What is “tenant bleed” in multi-adapter serving and how do you prevent it? [debug]
  5. How would you design autoscaling for an LLM inference service? What metrics should trigger scale-up vs. scale-down (queue depth, GPU utilization, p99 latency)? [design]

6.6 Compression

  1. Compare weight-only quantization, activation quantization, and KV quantization — when do you use each? [tradeoff]
  2. What is the “outlier problem” in LLM quantization? Why does naive quantization fail? [conceptual]
  3. What are AWQ and SmoothQuant? What problem does each solve? [conceptual]
  4. Compare unstructured pruning, structured pruning (N:M sparsity), and architectural sparsity (MoE). [tradeoff]
  5. What is knowledge distillation for inference? Compare response distillation vs logit distillation. [tradeoff]
  6. Why can pruning introduce non-linear “quality cliffs”? [conceptual]
  7. Compare GPTQ, AWQ, and GGML/GGUF quantization. Which is post-training, which is calibration-based? When do you use each? [tradeoff]
  8. What is quantization-aware training (QAT)? How does it differ from post-training quantization (PTQ) and when is the extra cost worth it? [tradeoff]

6.7 Test-Time Scaling

  1. What is test-time scaling? How does it improve reliability without retraining? [conceptual]
  2. Compare best-of-N, self-consistency voting, critique-revise loops, and tree search (MCTS). [tradeoff]
  3. When would you spend compute on test-time scaling vs retraining? [tradeoff]
  4. Explain how process reward models (PRMs) enable tree search at inference time. What is the tradeoff between breadth (more candidates) and depth (more reasoning steps)? [conceptual]

6.8 Metrics & Evaluation

  1. What are the four core inference metrics (TTFT, TPOT, throughput, $/1M tokens)? [conceptual]
  2. How do you set up quality regression gates after quantization? [design]
  3. You quantized your model and needle-in-a-haystack accuracy dropped — why? [debug]
  4. Your users report “variable latency” — sometimes fast, sometimes 5× slower for similar prompts. Diagnose the issue across batching, caching, and queue depth. [debug]

6.9 System Design Drills

  1. Why does increasing batch size improve throughput but hurt per-request latency? [conceptual]
  2. Design an inference system for a latency-sensitive chat product. [design]
  3. Design an inference system for batch-mode document processing (throughput > latency). [design]
  4. Your service handles 10K requests/sec with long RAG contexts — what architecture do you propose? [design]
  5. You need to serve a 70B model but only have a single A100 80GB. Walk through the options (quantization, offloading, smaller model, distillation). [design]
  6. Compare cloud inference (OpenAI API, Bedrock, Vertex AI) vs self-hosted (vLLM on GPU, Ollama on Mac) — what factors determine the right choice? [tradeoff]

Chapter 7 — Applications / Agents

7.1 Agent Fundamentals

  1. What is an “agentic application”? How does it differ from a single LLM call? [conceptual]
  2. Workflow vs agent — when do you use each? Give a decision rule. [tradeoff]
  3. Describe the agent control loop: Sense → Think → Act → Check → Update State → Repeat. [conceptual]
  4. What are common agent products (enterprise search, support triage, deep research, coding agent, ops automation)? [conceptual]
  5. When should you NOT use an agent? [tradeoff]
  6. What is the “bitter lesson” for agents — why do simple designs with strong models outperform complex frameworks? [conceptual]

7.2 Architecture Components

  1. What is the “policy” in an agent system (model + routing + decoding settings)? [conceptual]
  2. What are output contracts and why are they critical for production agents? [design]
  3. Explain the validate → retry → repair prompting pattern. [design]
  4. What belongs in agent state (structured facts) vs context (token payload per turn)? [design]
  5. When do you use an explicit planner vs implicit planning in the policy prompt? [tradeoff]
  6. What are the common orchestration patterns: prompt chaining, routing, parallelization, orchestrator-workers, evaluator-optimizer? [conceptual]
  7. Explain the “ReAct” pattern (Reason + Act). How is it different from pure chain-of-thought? What are its failure modes? [conceptual]

7.3 Tool Interface Design

  1. What makes a good tool contract (schema, validation, timeouts, retries, idempotency, observability)? [design]
  2. What is MCP (Model Context Protocol) and how does it relate to tool serving? [conceptual]
  3. What is idempotency for write tools and why is it critical? [conceptual]
  4. How do you risk-rate tools (read_low → write_medium → write_high)? [design]
  5. Your agent made a duplicate refund because the tool wasn’t idempotent — what’s the fix? [debug]
  6. How do you enforce tool permissions outside the model (allowlists, authz)? [design]
  7. How do you handle tool errors gracefully? Should the error message go back to the model, and what are the risks of that? [design]

7.4 RAG Pipeline

  1. Walk through a RAG pipeline: ingest → chunk → embed → index → retrieve → rerank → generate. [design]
  2. Why does chunk size matter? What happens if chunks are too small or too big? [tradeoff]
  3. Compare dense retrieval (embeddings), sparse retrieval (BM25), and hybrid — when do you use each? [tradeoff]
  4. What is a cross-encoder reranker? Why is it better than bi-encoder similarity but more expensive? [tradeoff]
  5. How do you evaluate retrieval quality (recall@k, MRR, nDCG)? [conceptual]
  6. Your RAG system retrieves the right documents but the model ignores them — what’s wrong? [debug]
  7. How do you defend against prompt injection through retrieved documents? [design]
  8. What is query rewriting and multi-hop retrieval? When are they needed? [conceptual]
  9. How do you handle stale indexes in a production RAG system? [design]
  10. Compare vector databases (Pinecone, Weaviate, Qdrant, pgvector) — what are the tradeoffs? [tradeoff]
  11. You built a RAG system for a 10M-document legal corpus. Retrieval latency is acceptable but answer quality is poor. Walk through a systematic diagnosis: is it chunking, embedding model, retrieval, reranking, or generation? [debug]
  12. What is “agentic RAG”? How does it differ from single-shot retrieve-and-generate (e.g., iterative retrieval, query decomposition, self-reflection on retrieval quality)? [conceptual]
  13. How do you handle tables, images, and structured data in a RAG pipeline? [design]

7.5 Guardrails & Safety

  1. What layers of guardrails should a production agent have (relevance, safety, PII, moderation, rules, tool safeguards)? [design]
  2. When should an agent escalate to a human-in-the-loop? [design]
  3. How do you handle prompt injection attacks in agentic systems? [design]
  4. Your agent approved an unauthorized action — what layers failed? [debug]
  5. What is the difference between “jailbreaking” (bypassing safety) and “prompt injection” (hijacking intent)? How do you defend against each? [conceptual]
  6. How do you build a red-team evaluation for an agentic system? What attack categories do you cover (direct injection, indirect injection, social engineering, tool abuse)? [design]

7.6 Memory & Context Engineering

  1. What is “context engineering” — how does it differ from prompt engineering? [conceptual]
  2. How do you manage context growth over 25+ tool-call loops? [design]
  3. What is “context rot” — why does retrieval precision degrade in very long contexts? [conceptual]
  4. Explain progressive disclosure for agents (metadata → full instructions → deep resources). [conceptual]
  5. What are the context layers: pinned context, working set, retrieved context, summaries? [design]
  6. When do you use sub-agent architectures to manage context? [tradeoff]
  7. How do you implement “memory” for a conversational agent that spans multiple sessions? Compare explicit memory stores vs. retrieval over past transcripts. [design]

7.7 Evaluation & Observability

  1. How do you evaluate agents as systems (task success, tool accuracy, cost, safety)? [design]
  2. What should an agent observability system log? [design]
  3. Build a debug funnel for agent failures: tooling → retrieval → context → generation → guardrails. [design]
  4. How do you build a replayable evaluation harness with golden tasks? [design]
  5. What is the cost of an agentic task? How do you estimate and control it (token budgets, max iterations, cost caps)? [estimation]

7.8 Skills & Procedures

  1. What are “Agent Skills” and how do they differ from tools and prompts? [conceptual]
  2. Explain progressive disclosure levels for skills (metadata → full SKILL.md → linked resources). [conceptual]
  3. How do Skills and MCP fit together in an agent system? [design]

7.9 Agent System Design Drills

  1. Design a “news research agent” that finds, verifies, and summarizes news on a topic with citations. [design]
  2. Design a customer support triage agent with routing, tool use, and human handoff. [design]
  3. Design a coding/debugging agent that loops with repo search + tests as verification. [design]
  4. How do you prevent “search → fetch everything” tool spam in an agent? [design]
  5. Your agent’s latency is 30 seconds per query — where do you look first? [debug]
  6. Design an “AI data analyst” agent that takes a natural language question, writes SQL, executes it, interprets results, and visualizes them. What tools does it need? What guardrails? [design]
  7. You’re building a multi-agent system where a “planner” agent delegates to “executor” agents. How do you handle failure recovery when an executor fails mid-task? [design]

Chapter 8 — Taxonomy / Cross-Cutting Concepts

8.1 Key Distinctions

  1. Capability vs behavior — where does each come from in the LLM lifecycle? [conceptual]
  2. Latency vs throughput — when do you optimize for each? [tradeoff]
  3. Compute-bound vs memory-bound — how do you tell which regime you’re in? [conceptual]
  4. Dense models vs MoE — what are the pros and cons of each? [tradeoff]
  5. Online RL (PPO) vs offline preference learning (DPO) — when do you choose each? [tradeoff]

8.2 Common Confusions

  1. “Tokenization is just preprocessing” — why is this wrong? [conceptual]
  2. “Lower perplexity = better model” — when does this break? [conceptual]
  3. “KV cache is just an O(n²) attention concern” — what else dominates? [conceptual]
  4. “Byte-level BPE = character-level tokenization” — what’s the difference? [conceptual]
  5. “Pack more data = better model” — when does packing hurt? [conceptual]
  6. “RAG solves hallucination” — why is this only partially true? [conceptual]
  7. “Bigger context window = better performance” — when does longer context actually hurt? [conceptual]
  8. “Quantized models are always worse” — when can a quantized model outperform a larger full-precision model? [tradeoff]

8.3 End-to-End Design

  1. Trace a single user query through the full stack: tokenize → embed → transform → decode → serve → return. [design]
  2. How do training choices (architecture, tokenizer, context length) constrain inference? [tradeoff]
  3. You’re building an LLM-powered product from scratch for a vertical domain — walk through the full stack decisions. [design]
  4. Compare the engineering tradeoffs: latency vs quality, compute vs data, dense vs MoE, tool use vs reasoning. [tradeoff]

8.4 Cross-Chapter Integration (Advanced)

  1. How does tokenizer choice in pretraining affect KV cache size at inference? Trace the connection through vocab size → embedding dimension → KV heads → memory per token. [tradeoff]
  2. You trained a model with 4K context but need 128K for RAG at inference. Trace the full solution path through positional encoding extension, KV cache management, chunked prefill, and retrieval pipeline design. [design]
  3. Compare the cost of improving a model’s coding ability via: (a) adding code data in pretraining, (b) code-focused CPT, (c) code SFT, (d) code RL with execution feedback. When is each appropriate? [tradeoff]
  4. A model produces incorrect tool-call JSON. Diagnose whether the problem is: SFT data quality, tokenizer artifacts, decoding strategy, or the tool schema definition. How do you tell them apart? [debug]
  5. Your production LLM system has a cost of $0.50 per query. The business requires $0.05. Walk through every lever: model size, quantization, batching, caching, distillation, reducing output tokens, moving to MoE. [estimation]
  6. You’re evaluating three competing LLMs for deployment. Model A is largest and highest quality. Model B is medium with fast inference. Model C is smallest and cheapest. How do you structure the evaluation to make a recommendation for an enterprise customer support use case? [design]

Total: 368 questions

  • Conceptual: ~150
  • Tradeoff: ~85
  • Design: ~70
  • Debug: ~45
  • Estimation: ~15
  • Coding: ~5 (referenced in chapter code files)

References

Foundational Papers

Paper Authors Year Relevant Chapters
Attention Is All You Need Vaswani et al. 2017 Ch 1 (Attention, Architecture)
BERT: Pre-training of Deep Bidirectional Transformers Devlin et al. 2019 Ch 1, Ch 5
Language Models are Unsupervised Multitask Learners (GPT-2) Radford et al. 2019 Ch 1, Ch 2
Language Models are Few-Shot Learners (GPT-3) Brown et al. 2020 Ch 1, Ch 2, Ch 5
Neural Machine Translation of Rare Words with Subword Units (BPE) Sennrich, Haddow, Birch 2016 Ch 1 (Tokenization)
Efficient Estimation of Word Representations in Vector Space (word2vec) Mikolov et al. 2013 Ch 1 (Embeddings)
GloVe: Global Vectors for Word Representation Pennington, Socher, Manning 2014 Ch 1 (Embeddings)

Scaling & Pretraining

Paper Authors Year Relevant Chapters
Scaling Laws for Neural Language Models Kaplan et al. 2020 Ch 2
Training Compute-Optimal Large Language Models (Chinchilla) Hoffmann et al. 2022 Ch 2
The Pile: An 800GB Dataset of Diverse Text for Language Modeling Gao et al. 2020 Ch 2
Deduplicating Training Data Makes Language Models Better Lee et al. 2022 Ch 2
LLaMA: Open and Efficient Foundation Language Models Touvron et al. 2023 Ch 2, Ch 5
Llama 2: Open Foundation and Fine-Tuned Chat Models Touvron et al. 2023 Ch 2, Ch 4, Ch 5

Post-Training & Alignment

Paper Authors Year Relevant Chapters
Training Language Models to Follow Instructions with Human Feedback (InstructGPT) Ouyang et al. 2022 Ch 4 (RLHF)
Direct Preference Optimization (DPO) Rafailov et al. 2023 Ch 4 (Preference Alignment)
LoRA: Low-Rank Adaptation of Large Language Models Hu et al. 2021 Ch 4 (PEFT)
QLoRA: Efficient Finetuning of Quantized Language Models Dettmers et al. 2023 Ch 4 (PEFT)
LIMA: Less Is More for Alignment Zhou et al. 2023 Ch 4 (SFT)
Constitutional AI: Harmlessness from AI Feedback Bai et al. 2022 Ch 4 (Alignment)
GRPO: Group Relative Policy Optimization Shao et al. 2024 Ch 4 (Reasoning RL)
STaR: Self-Taught Reasoner Zelikman et al. 2022 Ch 4 (Reasoning)

Architecture & Models

Paper Authors Year Relevant Chapters
RoFormer: Enhanced Transformer with Rotary Position Embedding (RoPE) Su et al. 2021 Ch 1 (Positional Encoding), Ch 5
ALiBi: Train Short, Test Long Press et al. 2022 Ch 1, Ch 5
GQA: Training Generalized Multi-Query Transformer Models Ainslie et al. 2023 Ch 1 (Attention), Ch 6
Mamba: Linear-Time Sequence Modeling with Selective State Spaces Gu & Dao 2023 Ch 1 (Architecture)
Mixtral of Experts Mistral AI 2024 Ch 5
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model DeepSeek-AI 2024 Ch 5
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning DeepSeek-AI 2025 Ch 4, Ch 5

Inference & Compression

Paper Authors Year Relevant Chapters
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness Dao et al. 2022 Ch 1, Ch 6
FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning Dao 2023 Ch 6
Efficient Memory Management for Large Language Model Serving with PagedAttention (vLLM) Kwon et al. 2023 Ch 6
AWQ: Activation-aware Weight Quantization Lin et al. 2024 Ch 6 (Compression)
SmoothQuant: Accurate and Efficient Post-Training Quantization Xiao et al. 2023 Ch 6 (Compression)
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers Frantar et al. 2023 Ch 6 (Compression)
Fast Inference from Transformers via Speculative Decoding Leviathan, Kalman, Matias 2023 Ch 6 (System Optimizations)

Applications & Agents

Paper / Resource Authors Year Relevant Chapters
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks Lewis et al. 2020 Ch 7 (RAG)
ReAct: Synergizing Reasoning and Acting in Language Models Yao et al. 2023 Ch 7 (Agents)
Toolformer: Language Models Can Teach Themselves to Use Tools Schick et al. 2023 Ch 4, Ch 7
Building Effective Agents (blog) Anthropic 2024 Ch 7
LLM Powered Autonomous Agents (blog) Lilian Weng 2023 Ch 7

Interview Resources & Question Banks

Resource Link Notes
MLStack.cafe — LLMs Interview Questions (59 Qs) mlstack.cafe/interview-questions/llms Transformer architecture, attention, transfer learning, alignment
MLStack.cafe — ChatGPT Interview Questions (42 Qs) mlstack.cafe/interview-questions/chatgpt RLHF, tokenization, context handling, evaluation
MLStack.cafe — NLP Interview Questions (38 Qs) mlstack.cafe/interview-questions/nlp Positional encoding, encoder-decoder, CNNs vs transformers
Awesome Generative AI Guide — 60 Common GenAI Interview Qs github.com/aishwaryanr/awesome-generative-ai-guide Generative models, LLMs, embeddings, multimodal, training & evaluation
HuggingFace NLP Course huggingface.co/learn/nlp-course Ch 1–4 foundations
Stanford CS324 — Large Language Models stanford-cs324.github.io/winter2022 Training, evaluation, societal impact
Full Stack Deep Learning — LLM Bootcamp 2023 fullstackdeeplearning.com/llm-bootcamp End-to-end LLM application development
Chip Huyen — Designing ML Systems (O’Reilly 2022) huyenchip.com Training, serving, evaluation, data distribution
Jay Alammar — The Illustrated Transformer jalammar.github.io/illustrated-transformer Visual guide to transformer architecture
Sebastian Raschka — Build a Large Language Model From Scratch (2024) manning.com Architecture, pretraining, SFT, alignment
UC Berkeley CS294/194-196 — Large Language Model Agents rdi.berkeley.edu/llm-agents/f24 Agentic systems, tool use, planning